Numeric ID to very short unique strings

271 Views Asked by At

I have rather long IDs 1000000000109872 and would like to represent them as strings.

However all the libraries for Rust I've found such as hash_ids and block_id produce strings that are way bigger.

Ideally I'd like 4 to maybe 5 characters, numbers are okay but only uppercase letters. Doesn't need to be cryptographically secure as long as it's unique.

Is there anything that fits my needs?


I've tried this website: https://v2.cryptii.com/decimal/base64 and for 1000000000109872 I get 4rSw, this is very short which is great. But it's not uppercase.

3

There are 3 best solutions below

8
Samwise On

This is the absolute best you can do if you want to guarantee no collisions without having any specific guarantees on the range of the inputs beyond "unsigned int" and you want it to be stateless:

def base_36(n: int) -> str:
    if not isinstance(n, int):
        raise TypeError("Check out https://mypy.readthedocs.io/")
    if n < 0:
        raise ValueError("IDs must be non-negative")
    if n < 10:
        return str(n)
    if n < 36:
        return chr(n - 10 + ord('A'))
    return base_36(n // 36) + base_36(n % 36)

print(base_36(1000000000109872))  # 9UGXNOTWDS

If you're willing to avoid collisions by keeping track of id allocations, you can of course do much better:

ids: dict[int, int] = {}
def stateful_id(n: int) -> str:
    return base_36(ids.setdefault(n, len(ids)))

print(stateful_id(1000000000109872))  # 0
print(stateful_id(1000000000109454))  # 1
print(stateful_id(1000000000109872))  # 0

or if some parts of the ID can be safely truncated:

MAGIC_NUMBER = 1000000000000000
def truncated_id(n: int) -> str:
    if n < MAGIC_NUMBER:
        raise ValueError(f"IDs must be >= {MAGIC_NUMBER}")
    return base_36(n - MAGIC_NUMBER)

print(truncated_id(1000000000109872))  # 2CS0
0
محمد جعفر نعمة On

Short Answer: Impossible.

Long Answer: You're asking to represent 10^16 digits in 36^5 (5 uppercase chars).

Actually, an uppercase/number char would be a one of 36 cases (10 numbers + 26 chars). But, 36^5 = 60,466,176 is less than 10^9, which wouldn't work.

Since 36^10 < 10^16 < 36^11, you'll need at least 11 uppercase chars to represent your (10^16) long IDs.

0
Finomnis On

As you already stated that there is even a checksum inside the original ID, I assume the new representation should contain all of its data.

In this case, your question is strongly related to lossless compression and information content.

Information content says that every data contains a certain amount of information. Information can be measured in bits.

The sad news is that now matter what, you cannot magically reduce your data to less bits. It will always keep the same amount of bits. You can just change the representation to store those bits as compact as possible, but you cannot reduce the number.

You might think of jpg or compressed movies, that are stored very compact; the problem there is they are lossy. They discard information not perceived by the human eye/ear and just delete them.

In your case, there is no trickery possible. You will always have a smallest and a largest ID that you handed out. And all the IDs between your smallest and largest ID have to be distinguishable.

Now some math. If you know the amount of possible states of your data (e.g. the amount of distinguishable IDs), you can compute the required information content like this: log2(N), where N is the number of possible states.

So let's say you have 1000000 different IDs, that would mean you need log2(1000000) = 19.93 bits to represent those IDs. You will never be able to reduce this number to anything less.

Now to actually represent them: You say you want to store them in in a string of 26 different uppercase letters or 10 different digits. This is called a base36 encoding.

Each digit of this can carry log2(36) = 5.17 bits of information. Therefore, to store your 1000000 different IDs, you need at least 19.93/5.17 = 3.85 digits.

This is exactly what @Samwise's answer shows you. His answer is the mathematically most optimal way to encode this. You will never get better than his answer. And the amount if digits will always grow if the amount of possible IDs you want to represent grows. There's just no mathematical way around that.