The Use of Random IDs in Software Development and Everyday Life

The German version of this article has been published in JavaSPEKTRUM 4/2021 and can be downloaded and read as a PDF file.

Abstract

Why do ID numbers always have to be sequential, URL paths always have to follow the same systematics, and exception messages have to contain meaningless texts that do not help in an emergency anyway? Isn’t it rather the case that the answer “Because that’s the way it’s done” introduces irreversible dependencies and causes long-term additional work? Let’s play through the unusual approach of consistently relying on random identifiers in software development as well as in everyday life.

Introduction

Regular readers who are familiar with my articles will probably have noticed that the links to the downloadable source codes always seem slightly “cryptic”. For example, consider my article from JavaSPEKTRUM 2/2021 titled “Future and CompletableFuture” [HE4L]. There the source code was referenced by the URL link.simplexacode.ch/va9s. In the same article, there was also a link link.simplexacode.ch/ewvs to one of my blog posts about floating point numbers.

At the time of printing, the first link referred to https://www.simplexacode.ch/source-code/javaspektrum_2021_2___completablefuture_2021.02.zip. However, if this link had been printed like this, a) it would not only have been much more cumbersome for the readers to type it in, but b) the link would also have remained frozen like this and to be managed accordingly forever. My own URL shortener, on the other hand, is quite simple and on my company’s Apache web server nothing more than a manually maintained table of HTTP redirects in the .htaccess file. However, its usefulness is enormous: my SimplexaCode is free to rename the directory source-code/ to sourcecode/ or quellcode/ one day, to implement a completely new file hierarchy, or to upload a new corrected version of the source codes (for example with the suffix _2022.12.zip, because in general no files are released without a version identifier).

The crucial consideration is that such a redirection table is needed even if there is no URL shortener, namely as soon as the first change becomes necessary. If in the past the fixed directory source-code/ was printed or otherwise published, but it should now be quellcodes/java/, then redirects from source-code/ to quellcodes/java/ have to be programmed explicitly. Such tables become very ugly very fast at the latest after the second change, and the risk of forgetting a link that was once published externally is relatively high. If, on the other hand, only short URLs are published, such refactoring is pretty simple: the “grooming” of the redirection table is trivial, and the caller is not even remotely affected.

The Inevitability of Randomness

An important requirement is that such short URLs consistently come from a proper (pseudo) random generator. We will see in a moment how these are generated. Alternatives to random generators, on the other hand, are not convincing:

  • Random identifiers by heart | The human being is a very bad random generator. For example, he would consider the abbreviation duck to be much less “random” than qypz, since the latter has “more strange letters”. But both abbreviations are absolutely equally likely.
  • Slapping random characters on the keyboard by hand | Try naming 100 files this way, each with four random letters. I bet no later than the 20th file there will be the first name conflict in the common folder—and it will probably be called asdf.
  • Speaking abbreviations | Should my mentioned JavaSPEKTRUM source code on the topic of “Future and CompletableFuture” thus have had the abbreviation fucf? Or rather cofu? Or futu? Or cplf? It is only a matter of time until the first conflict would occur here as well. Any reader who has ever worked in a company with more than 30 employees can probably confirm this fact with historical (two-digit) employee abbreviations. If a system ever existed at all, it is lost after the first collision, and the abbreviations eventually have to be looked up and maintained in a table again.
  • Sequential identifiers | In this case, should my first shortlink after founding my company have been aaaa (or 0001), and the shortlink to the source codes of the mentioned article been aaaq (or 0017)? What unintended information is unveiled if the reader can recognize that this is only (or already?) the 17th shortlink? How difficult is it to look at the other 16? Does the author do the same with his invoices, sending an invoice with the number 3 in summer 2022 (or conversely an invoice with the number 6521)?

UUIDs

Before we look at further application examples for the use of random identifiers, we must first clarify where we can easily get such random IDs. Most experienced software developers will probably already have become acquainted with Universally Unique Identifiers (UUIDs). These are net 32-digit hexadecimal numbers, divided into five groups, which look like this: 3ecc17f8-d9c7-4de6-8b47-efb1447afd94

In Java, such random UUIDs can be easily generated using the UUID.randomUUID() method [3LPH]. While version 1 of the UUID specification still included MAC address and timestamp of the creator—which in my opinion is unacceptable—today’s prevailing version 4 is purely based on random values [EXFQ]. Nevertheless, its use in interaction with humans is cumbersome, since the IDs are not only far too long, but also difficult to read, communicate, and compare.

User-Friendly and Flexible Random Identifiers

Therefore, at the time of my company’s founding more than four years ago, I have already thought about how random identifiers could be generated in a more user-friendly and flexible way. The following considerations formed the basis for the design of such strings:

  • They consist of Arabic numerals and Latin letters.
  • They are always generated as blocks of four. Longer random strings are created by concatenating such blocks of four, which are then separated by hyphen-minus (the only special character).
  • Numbers and letters easily confused by the human eye are removed from the generator character set.
  • Depending on the intended use, there are four variants with respect to upper and lower case.
Variant Generator Character Set Examples
uppercase only without 0DO, 1I, 2Z, 5S, 8B 4NQ9
XUYL‑ALEU
AN6P‑R3YR‑N7HC‑XAW3
lowercase only without 1l h8d4
3e6b‑nms2
s7ng‑axy2‑ocsb‑szqz
uppercase and lowercase (i.e., the random ID consists of uppercase as well as lowercase letters) without 0DO, 1Il, 2Z, 5S, 8B, Kk pXm4
6x49‑wfMX
AMh9‑vX3Q‑WCwP‑r7Ya
uppercase or lowercase (i.e., the random ID is displayed completely in uppercase or lowercase letters, but this is not constrained in advance) without 0DO, 1IL, 2Z, 5S, 8B (in uppercase representation) XTYW
KJEH‑RKV6
7ARN‑336F‑YHHF‑CVY9
without 0do, 1il, 2z, 5s, 8b (in lowercase representation) xtyw
kjeh‑rkv6
7arn‑336f‑yhhf‑cvy9

The grouping into blocks of four is also based on the findings of cognitive psychology. The famous Miller’s Number states that the short-term memory can process or remember 7 ± 2 information units (chunks) [AAPH]. However, this value—after all from the year 1956—is considered outdated today. Current research talks about 4 to at most 5 chunks [PNQF]. This can also be observed easily on oneself: a credit card block of four is relatively straightforward to memorize for typing, whereas blocks of five, such as those used for reference numbers on Swiss payment slips, are noticeably more awkward.

As a service to you, dear reader, you can find the Java source code for generating described random identifiers under [FVEK]. In addition, a website is available at [PKC4], where you can get a new random data set analogous to the examples in the table above with just one click. That page also describes how to use the web service for machine queries.

Application Examples in Software Development

The following three examples show how random identifiers can be used in software development:

  • Log Entries, Exception and Error messages | Hand on heart: When analyzing errors in large log files, who does not limit themselves to the class and method name and—if available—the line number of the error, just to continue the search directly in the source code from this point on? Is it really worth the effort to write allegedly “speaking” log entries, exception, and error messages? Experience has shown that the generation of such messages is perceived by developers as tedious and unimportant work, which is also reflected in the orthographic and grammatical quality of such messages, not to mention a uniform format.

    The following listing shows two examples of an alternative based on random identifiers. If an error occurs, a full text search for "Exception Code WR7U" or "Error Code CREA" immediately takes you to the right place in the code. Since the creation of such messages is much easier, it also motivates more log entries and state checks according to the fail-fast principle; and this in turn benefits the code quality far more than infrequent but detailed log and error messages.

    Objects.requireNonNull(value, "Exception Code WR7U");
    
    if (!Modifier.isPublic(modifiers)) {
      throw new AssertionError("Error Code CREA");
    }
  • HTML IDs and JavaScript | Anyone who programs a lot in HTML and JavaScript knows the “ID flood”. Every HTML element must be given a unique ID in order to be addressable by the script. You guessed it: systematic self-speaking identifiers need a lot of cognitive energy, and the question is whether this effort is really worth it. After all, systematic identifiers are only useful if they can be (re)constructed unambiguously and without effort. If an element ID ultimately has to be looked up again and again in order to find out what it refers to, then it can also be assigned a random identifier directly.

    The listing below shows such an example. The crucial point here is again that a full-text search in the source code for the random identifier "ia2n" will certainly lead faster to the target than, for example, "examTopicsProblem1LearningObjectives" or even "exTopsProb1LearnObjs". And if Aufgabe 1 is to become Aufgabe 5 later, then the random identifier remains completely unaffected by this.

    <div class="learning_objectives" id="ia2n">
      <h2>Lernziele | Aufgabe 1</h2>
      <!-- [...] -->
    </div>
    
    <button onclick="showLearningObjectives('ia2n')">Zeige Lernziele</button>
  • enums with Fixed Values for Persistence | Legacy code may contain old or homegrown persistence frameworks. Enumerations (classically, but not necessarily enums) have a fixed value stored for each element, which is then used for mapping to and from the database.

    The following listing shows an example of persisting colors. One unfavorable aspect is that it started with a strictly sequential order 1, 2, 3, 4. There was absolutely no reason for this. In fact, if the color ORANGE is to be added later, it would have to be given the number 5, although it would have fit better between RED and YELLOW. Furthermore, when inspecting database contents, similar values from other columns (color will probably not be the only such column) can complicate the overview and the reverse lookup to the source code.

    public enum Color {
      RED(1),
      YELLOW(2),
      GREEN(3),
      BLUE(4);
    
      private final int value;
    
      private Color(int value) {
        this.value = value;
      }
    
      public int getValue() {
        return value;
      }
    }

    If, on the other hand, Color.RED had the value "WQWX" (or the random number 8630, if it has to be an integer), then a code lookup using full-text search would be trivial. By the way, you can find a comprehensive look behind the scenes of enums in my JavaSPEKTRUM article at [UYW7].

Application Examples in General IT

In addition to the shortlinks already described in detail, there are also various possible applications for random identifiers in IT outside of software development:

  • Anchors and cross-references in large documents (especially in LaTeX), for example to other sections, figures or tables, to entries in the bibliography, or file names for images to be included, can advantageously be assigned random identifiers. The same argumentation as with HTML IDs and JavaScript described above applies here as well: a system with speaking identifiers needs extreme discipline, which is not only error-prone, but probably not worth the effort at all.
  • Manual IP addresses and port numbers in a home or corporate network often show sequential assignment sequences (or address blocks), as experience shows. New devices are added and old devices disappear. At the end of the day, a table has to be looked up and properly updated anyway. So why not start with random internal IP addresses and port shares right from the start?

Application Examples in Everyday Life

Randomness can also be used to great advantage in everyday life. A very impressive example of this is Amazon’s warehouse management. The online shipping giant uses a method known as chaotic warehousing. Here, items are stored wherever there is space, irrespective of any sequence, product category, or the like. On [QVAV] and [G399] you can see two videos showing how this works behind the scenes at Amazon.

Students know about the advantages of mixed and therefore random flashcards, especially when learning vocabulary. It is obvious that learning a fixed vocabulary list from top to bottom would usually result in a bias towards the upper words. I myself try to avoid such bias in everyday life as well.

I choose the music album I want to listen to today via random generator. Of course, the order of these songs is also random. If I cannot keep up with answering piled-up (but not urgent) e-mails, I let randomness decide which e-mail I deal with next. If a book consists of small but independent chapters (e.g., “Effective Java”), I let chance determine which one to read; otherwise I would probably only ever read the first half of a book. When I create exam questions for my students, I use a list of all chapters relevant to the exam and then pick one randomly to design a problem for.

You see, I personally care a lot about avoiding bias in every area of life. And yes, maybe I’m a little bit crazy too … So let’s move on to do some math.

Number of Variations

How many different identifiers can be created with the described generators? For the sake of simplicity, let’s consider only a block of four lowercase letters, such as those used for my shortlinks mentioned at the beginning of this article. The generator character set consists of 10 digits und 26 lowercase letters, excluding the confusingly similar characters 1 und l. This makes 34 available characters, which are drawn from an urn model with replacement and with order. In mathematics, this also referred to as variations. There are 34 possibilities per character, so for four characters there are a total of 34 × 34 × 34 × 34 = 344 = 1,336,336 variations. This is more than sufficient for this application purpose (short URLs).

If two blocks of four are used instead of one, then there are already 348 ≈ 1.79 × 1012, or 1.79 trillion possibilities. With four blocks of four, there are even 3416 ≈ 3.19 × 1024, i.e., a 25-digit number of possible identifiers. The following table lists the number of available characters and the resulting number of variations for each upper and lower case variant.

Variant Number of Generator Characters Number of Blocks of Four Number of Variations
uppercase only 25 1 390,625 ≈ 3.91 × 1005
2 152,587,890,625 ≈ 1.53 × 1011
4 23,283,064,365,386,964,000,000 ≈ 2.33 × 1022
lowercase only 34 1 1,336,336 ≈ 1.34 × 1006
2 1,785,793,904,896 ≈ 1.79 × 1012
4 3,189,059,870,763,704,000,000,000 ≈ 3.19 × 1024
uppercase and lowercase 48 1 5,308,416 ≈ 5.31 × 1006
2 28,179,280,429,056 ≈ 2.82 × 1013
4 794,071,845,499,378,500,000,000,000 ≈ 7.94 × 1026
uppercase or lowercase 24 1 331,776 ≈ 3.32 × 1005
2 110,075,314,176 ≈ 1.10 × 1011
4 12,116,574,790,945,107,000,000 ≈ 1.21 × 1022

For comparison: purely random UUIDs can represent 2122 ≈ 5.32 × 1036 different values, but they also require a total of 36 digits of space. “Short versions” of UUIDs do not exist. The uppercase and lowercase variants of my generator, on the other hand, could generate 7 groups of four (including hyphen-minus) on these 36 digits of space, and thus 1.19 × 1047 different identifiers, which is 22 billion times more. This again illustrates how unergonomic UUIDs with their hexadecimal representation really are.

Collision Probabilities

Even though we now know exactly how many different IDs can be generated in theory, this information helps us only to a limited extent, since the random generator does not know the identifiers generated so far and could therefore theoretically regenerate an already existing string. So the question is how big the probability is that among n generated identifiers at least two of them have the same value.

This question corresponds to the well-known (at least among mathematicians) birthday problem oder birthday paradox [J6QX]. Do you already know it? “How many people must be in a room at least such that the probability of two or more of them having a birthday at the same date (i.e., day and month) exceeds 50 %?” The answer will surprise you. You will find it at the end of this article at [TUVQ].

A generalization of the birthday problem applied to the field of cryptography is the so-called birthday attack. Not only does the underlying mathematics answer the question of how high the probability for collisions is (usually for hash functions), but it can also specify how many values n have to be generated from a value set N before they contain at least one equal value for a given probability p [QQWV]. The above mentioned website [PKC4] for generating random identifiers, which I offer as a service, also contains a calculator for the number of IDs that can be generated for an adjustable collision probability.

To give you a quick taste and sense of the magnitudes, let’s consider randomly generated primary keys for a database table consisting of two blocks of four upper and lower case letters (e.g., soug-Ys9P). According to the table above, there are 28 trillion different such identifiers. How many values can we generate in advance and load into the database in one step such that the probability of a collision is less than 1 per thousand?

The calculation for this results in approx. 237,000 entries. In other words, you would have to create an average of 1000 database tables, each with 237,000 pre-generated random IDs, before a collision occurs in a table.

With four blocks of four (e.g., Ggqc-T3XE-zVgi-GQ3q), there are already 1.26 × 1012, i.e., 1.26 trillion possible IDs until there is a collision. For sure you will never need that many. If you want to use “only” 40 billion such IDs, there is no more than a 1 in 1 million chance of a collision. If you are happy with 1 million table entries, you can use such IDs with a certainty of 1015 to 1 without any collisions.

Conclusion

There are various systems for labeling data in computer science or objects in everyday life:

  • Sequences: e.g., 1, 2, 3, …; 10, 20, 30, …; A, B, C, …
  • Value ranges: e.g., colors (red, green, blue, …); cities (Munich, Zurich, Hanover, …); planets and/or moons
  • “Speaking” identifiers: e.g., the room temperature sensor shown below

Room Temperature Sensor in a Large Indoor Swimming Pool with Seemingly “Speaking” Identifiers

In this article, I have additionally shown you the advantages and background of random identifiers and given you the appropriate tools for the job. Should you ever be faced with the task of finding suitable identifiers for a “data set” of any kind in the future, then I recommend that you ask yourself in advance the following questions:

  • Do the individual objects have a natural order or inherent sequence that really justifies labeling them sequentially?
  • Can it be guaranteed that the order will not no longer be able to be adhered to one day, that pre-reserved blocks will not overfill, and that no gaps will appear?
  • Does the system of supposedly speaking identifiers really have an added value, or will they ultimately have to be looked up and maintained in a table nevertheless?
  • Do the identifiers certainly not reveal any involuntary information to the outside world, either directly or indirectly?

If not all questions can be answered with a convincing “yes” for the intended type, then the random identifiers presented here offer a maybe initially unfamiliar, but certainly convincing alternative. After all, it would be extremely annoying if a newly introduced system were to collapse over time, but could not be changed afterwards either.

I look forward to reading about your new computer science and everyday experiences with random identifiers. You can reach me at [MWJG] ;-)

Literature, Links, and Notes

[3LPH]
Java Platform, Standard Edition & Java Development Kit Version 19 API Specification, Class UUID, docs.oracle.com/en/java/javase/19/docs/api/java.base/java/util/UUID.html
[3YKU]
Christian Heitzmann, Theorie und Praxis von Zufallszahlengeneratoren, in: JavaSPEKTRUM 2/2019
[AAPH]
George A. Miller, The Magical Number Seven, Plus or Minus Two: Some Limits on Our Capacity for Processing Information, in: Psychological Review 63, 1956
[EXFQ]
Internet Engineering Task Force (IETF), RFC 4122, A Universally Unique IDentifier (UUID) URN Namespace, www.rfc-editor.org/info/rfc4122
[FVEK]
Source code for download, link.simplexacode.ch/x48n
[G399]
YouTube, Inside Amazon Teil 2: So funktioniert ein Amazon Logistikzentrum!, www.youtube.com/watch?v=GzxyY4NxeWg
[HE4L]
Christian Heitzmann, Future und CompletableFuture, in: JavaSPEKTRUM 2/2021
[MWJG]
info@simplexacode.ch
[J6QX]
Wikipedia, Geburtstagsparadoxon, de.wikipedia.org/wiki/Geburtstagsparadoxon
[PKC4]
SimplexaCode, ID-Generator, link.simplexacode.ch/bk4k
[PNQF]
Daniel J. Levitin, The Organized Mind: Thinking Straight in the Age of Information Overload, Chapter 2: The First Things to Get Straight: How Attention and Memory Work, Dutton, 2014
[QQWV]
Wikipedia, Birthday attack, en.wikipedia.org/wiki/Birthday_attack
[QVAV]
YouTube, Inside Amazon: So funktioniert ein Amazon Logistikzentrum! - Teil 1, www.youtube.com/watch?v=hiZIk45tga8
[TUVQ]
At least 23 people
[UYW7]
Christian Heitzmann, Der Einsatz von enums abseits reiner Aufzählungen, in: JavaSPEKTRUM 1/2019
[W6A6]
RANDOM.ORG, www.random.org

Shortlink to this blog post: link.simplexacode.ch/2iei2022.12

Leave a Reply