Tiny URL system design

967 Views Asked by At

I have read and watched many youtube videos & links which all provide same solution which is:

  1. Use a distributed counter like zookeeper
  2. Counter max limit can be 3.5 trillion
  3. Convert the Counter value to Base62

which is all fine when the counter value is small. e.g. generated counter value: 120001 => base62 value FMJQmhBR

but when the counter provides large counter value like below the base62 value length also increases. generated counter value: 120003658=> base62 value HRGZF8RiHC6y

So how can this be a solution for exact tiny url with exact 8 length.

https://www.linqz.io/2018/10/how-to-build-a-tiny-url-service-that-scales-to-billions.html https://www.youtube.com/watch?v=eCLqmPBIEYs https://www.youtube.com/watch?v=JQDHz72OA3c&t=1862s

2

There are 2 best solutions below

6
ti7 On

Given that you are not hashing every url, but a vaguely-predictable number, you could hash the result and take the first N bits

However, there are many solutions for what to do for collisions

  • ignore them - they will be rare (ideally)
  • choose the next value
  • hash the result again (with your input)
  • increment the size of the returned string
  • ...

Here's a great video on cuckoo hashing (which is a structure of hashes relevant here):
https://www.youtube.com/watch?v=HRzg0SzFLQQ


Here's an example in Python which finds an 8-character string from the hash which should be fairly unique (this could then be collected into a sorted data structure mapping it into a URL)

This works by first hashing the value with an avalanching hash (SHA-265) and then loops to find the minimum amount of it (slice from the front of the hex string) to form an 8-char base62 string

This could be made much more efficient (even, for example by bisecting), but may be clearer as-is and depends hugely on unspecified algorithm requirements

import hashlib

BASE62 = "0123456789ABCDEFGHIJKLMNOPQRSTUVWXYZabcdefghijklmnopqrstuvwxyz"

m = hashlib.sha256()
m.update("https://stackoverflow.com/questions/65714033/tiny-url-system-design".encode())
digest  = m.digest()    # hash as bytes       b',\xdb3\x8c\x98g\xd6\x8b\x99\xb6\x98#.\\\xd1\x07\xa0\x8f\x1e\xb4\xab\x1eg\xdd\xda\xd6\xa3\x1d\xb0\xb2`9'
hex_str = digest.hex()  # string of hex chars 2cdb338c9867d68b99b698232e5cd107a08f1eb4ab1e67dddad6a31db0b26039

for hashlen in range(100, 1, -1):
    number = int(hex_str[:hashlen], 16)  # first_n_chars(str(hex)) -> decimal
    val = ""
    while number != 0:
        val = "{}{}".format(BASE62[number % 62], val)  # append new chars to front
        number = number // 62  # integer division
    if len(val) <= 8:
        break

print(val)  # E0IxW0zn

base62 logic from How to fix the code for base62 encoding with Python3?

1
Tom On

First: there absolutely is a compression limit. If your chosen representation has a maximum length, that imposes a hard limit on your key space.

Let's unpack that a little. Let's say you've got 80 guests at a party, and you want to give each guest a unique label (for their drink cup or something). If you've decided that each label will be a single letter from the English alphabet, then you only have enough unique labels for 26 guests.

Second: FMJQmhBR is not the most efficient way to represent the number 120001. It takes 17 bits in binary: 11101010011000001 (not sure which endianness that is). 16 bits is just two ASCII characters, and three ASCII characters can accommodate nearly 17 million unique values. And that's without any kind of special, ZIP-like compression.

--

I think most URL shorteners work essentially by assigning a counting number to each URL that someone shortens. So, the very first URL that gets submitted will be given ID=1: they save the whole URL in the database and associate it with that number. The second URL gets ID=2, etc.

That's pretty crude, though. For a variety of reasons, they don't want to hand those IDs out in order. But if they know how long they want the identifiers to be, it's not hard hand those IDs out in random order:

  • When someone submits a URL, the system picks a random number between 0 and the highest-possible ID. If the URL identifiers are all supposed to be 8 ASCII characters, that means they pick a random number between 0 and 2^(8*8) = 1.844674407e19.
  • Then they check their DB to see if they've handed out that ID. If they have, they pick a different random number. They repeat this until they pick an ID that hasn't been handed out. (I think there are more efficient algorithms for this, but the effect is the same and this is easiest to understand.)