How was there no collision among 50,000 random 7-digit hex strings? (The Birthday Problem)

789 Views Asked by At

I've encountered some code that generates a number of UUIDs via UUID.randomUUID(), takes the last 7 digits of each (recent versions of UUID are uniformly distributed in terms of entropy), and uses that as a key to insert rows into a database.

I wondered what the probability of collision was. I remembered the Birthday Problem. This is an instance of that problem, is it not? Instead of 365 days in a year, there are 16^7 possible strings. Then according to that Wikipedia page, the probability of collision given n strings is

enter image description here

where d is 16^7.

A colleague asserted that they were able to insert 50,000 rows without an issue. I doubted this because the probably of collision with n=50,000 and d=16^7 is

1-((16^7-1)/16^7)^(50000*(50000-1)/2) = 99.05%

But then I checked our database. Indeed the 50,000 inserts succeeded.

How was that possible? (Besides that they got really lucky.) Am I misapplying the Birthday Problem?


Edit

Here's a minimal test that shows there should have been collisions.

import java.util.UUID;
import java.util.HashSet;

public class MyClass {
    public static void main(String args[]) {
        final int N = 50000;
        for (int trial = 0; trial < 10; trial++) {
            HashSet<String> strings = new HashSet<>();
            Boolean success = true;
            for (int i = 0; i < N; i++) {
                String uuid = UUID.randomUUID().toString();
                String id = uuid.substring(uuid.length() - 7);
                if (strings.contains(id)) {
                    success = false;
                    // System.out.println(id);
                    break;
                }
                strings.add(id);
            }
            System.out.println(success);
        }
    }
}

For N=50,000, 10 out of 10 attempts had a collision—which would be expected of a 99% collision rate. For N=10,000, I see 0, 1, 2, or 3 out of 10 attempts with collisions, which all fall within expectations for a 17% collision rate.

3

There are 3 best solutions below

1
Andrew Cheong On BEST ANSWER

Finally I have an explanation. Thank you everyone for helpful ideas.

tl;dr - I thought there had had to have been a unique constraint at the time of insert, by the fact that indeed there were 50,000 distinct codes in the database. But it turned out there was not a constraint at that time. There indeed were duplicates among the 50,000 originally, that a month later were found and modified via a one-time SQL statement (modified by appending a "1").

Full explanation:

This code was about generating one-time use promo codes such as SUMMER23-65ACFF9. This is how I reasoned that indeed 50,000 distinct codes had been inserted into the database:

enter image description here

This table didn't have a timestamp field (e.g. last_modified) or I might have had a hint sooner. I only knew that the batch of 50,000 promo codes were inserted on January 6th, 2023—3 months ago.

enter image description here

I browsed the repo at the last commit before January 6th, 2023 to see if something about the code back then might have allowed 50,000 codes to succeed. The relevant code was this:

enter image description here

I believed that, because of the calls to .rollback(), the insert of 50,000 rows was being done atomically. In other words, if a single insert failed, all the inserts up until then should have been rolled back.

(So one possibility was that my colleague(s) just kept retrying and retrying until they hit the 1% jackpot. But when I asked them, that didn't seem to be the case. They did not recall having to retry at all.)

I did wonder if the constraint for unique promo_code_id existed at the time of insertion. I checked that it did exist:

enter image description here

I was hoping to find a timestamp for when this constraint was created but didn't immediately see one. I should have pursued that a little further, but I didn't because: I had already queried for a count of distinct promo_code_ids and gotten 50,000 (see first screenshot above). Constraint or no constraint, the probability of ending up with 50,000 distinct codes was still less than 1%. This is where I made a faulty assumption: that the codes hadn't been tampered with since.

Today I came across a Liquibase XML change in February (a month after the "successful" 50,000 inserts) where we apparently added the constraint:

enter image description here

But notice the SQL that comes with that change, besides adding the unique constraint. We essentially ran a script to add a "1" to the end of all duplicate codes. So that is how the "insert of 50,000 codes without duplicates" succeeded: It didn't—it was without constraint in the first place, and corrected afterward.

4
rzwitserloot On

Running the exact code you pasted, I get false, 10 times. Which means: Yes, collision occurs, quite reliably. Continuing to run until a 'true' is returned, I get:

7685
27479
15262
46177
18297
17230
14050
12091
39249
8921

Which seems to match the math, that's about what you expect (these numbers represent: The 39249th id I added collided with an already existing ID).

I assume that you have asked this question because you didn't actually run this code on your machine or misinterpreted what 'false' means.

If truly you run the EXACT code you pasted and you get anything but 10x false (well, 1 in 120 or so would be a true by happenstance), then something very, very bizarre is going wrong, most likely related to exactly how random number generation works on your bizarro JVM. But, that sounds very farfetched indeed.

Here is the slightly adjusted code to print point-of-duplication:

        final int N = 20000;
        for (int trial = 0; trial < 10; trial++) {
            HashSet<String> strings = new HashSet<>();
            int i = 0;
            while (true) {
                i++;
                String uuid = UUID.randomUUID().toString();
                String id = uuid.substring(uuid.length() - 7);
                if (strings.contains(id)) {
                    success = false;
                    System.out.println(i);
                    break;
                }
                strings.add(id);
            }
//            System.out.println(success);
        }

Some explanations for what your friend has observed. These are wild stabs in the dark:

  • They do have collisions and misconstrued the situation: Whatever checks they did to ensure there are no collisions was done erroneously.
  • Their code will pick up on a collision and 'reroll the dice'. In which case your application will grind to a halt over time, as you get closer and closer to 16^7 entries in the database, the amount of rerolls on the random UUID until you luck into one that doesn't collide grows exponentionally near the end. Your friend didn't tell you this, or forgot that their code does this.
  • The query as designed doesn't actually insert the last 7 characters of the UUID, but instead inserts the whole UUID.
  • The query as designed doesn't actually insert the last 7 characters of the UUID, but instead the db default which is e.g. an auto increment (a db SEQUENCE).
  • The UUID generator your friend uses isn't a 'normal' one (that randoms it up) but a modified variant; various such variants exist; they could for example have their own UUID generator that just generates 000001, 000002, and so on. In which case you won't get a collision until 16^7 records are added.
1
OldFrank On

Interesting statistics - I'm missing the "business-side-view" in this discussion. Assume that 2% - 5% of targeted customers are effectively using the promo code: What is the chance of a customer getting a message "your promo code already used" AND going to competition? Rate of customer loss is probably not worth the effort of all this complex dual-platform uniqueness-enhancement. From a statistical viewpoint: adding a "1" at DB-level won't fix a triple occurence, but as stated above: it won't matter.