I was reading this video lecture related to universal hashing. It shows the example of hashing IP addresses. Each IP address consists of 4, 32 bit integers, (x1,x2,x3,x4) with any xi having the maximum value of 255.
The tutorial says that the size of the hash table should be greater than 255 or any of the xis. Why is it so?
(For those of you who haven't seen the video, this comes up around 20:45).
The class of functions defined this way are functions of the form
where n is the number of buckets, each xi is in the range 0 to n - 1, each ai is in the range 0 to n - 1, and n is prime.
Your question is why all the xi's have to be less than n. The reason has to do with the proof that this family of hash functions is universal. As Tim explains in the video, one of the ways that you can prove that the hash function is universal is to consider two distinct inputs (call them x and y). That means that they have to differ in some component, and the idea is to assume without loss of generality that they differ in the fourth component. That is, x4 ≠ y4. With a bit of math, under that assumption, you can show that the probability that you get a collision is equal to the probability that this statement is true:
Here, since we chose the hash function randomly, all of the ai's are random. The key insight is that if you treat a1, a2, and a3 as fixed values, then the right-hand side of this equation just some fixed number k. You're then interested in the probability that
Because we're assuming that n > x4, that n > y4, that x4 ≠ y4, and that n is prime. This tells us two very important facts:
x4 - y4 ≠ 0 mod n. This is the main reason that we need n to be larger than the xi's. We'll see why in a minute.
x4 - y4 is coprime with n. Why is that? Well, we know that n is a prime number. Since n > x4 and n > y4, we know that x4 - y4 must be strictly between n-1 and -(n-1). Because we're assuming that n is prime, there are non nontrivial divisors of n in that range, so x4 - y4 and n are coprime.
To see why these facts matter, let's consider two cases for k. First, it's possible that k = 0. In that case, in what case will a4(x4 - y4) = k (mod n)? Since x4 - y4 ≠ 0, we know that this only happens if a4 = 0. Since a4 takes on a uniformly-random value in the range 0 through n-1 inclusive, the probability that we get a collision in this case is 1/n.
Next, suppose that k ≠ 0. In that case, what has to happen for a4(x4 - y4) = k (mod n) to be true? Well, we know that x4 - y4 ≠ 0, so it must have a multiplicative inverse modulo n because x4 - y4 is coprime with n. In fact, it has exactly one multiplicative inverse. The only possible choice of a4 that will cause this to be true is the choice where a4 is the multiplicative inverse of x4 - y4 modulo n multiplied by k. There's exactly one choice of number in the range 0 through n-1 that works, so the probability of picking it is 1/n.
Notice that if x4 - y4 were equal to zero modulo n, the above reasoning wouldn't work. In the first case, every choice of a4 would cause a collision, so the collision probability would be 1. In the second case, no choice of a4 could cause a collision, so the collision probability would be 0. These conditions would invalidate the proof.
Hope this helps!