Linear Hashing calculation?

1.9k Views Asked by At

I am currently studying for my exams and have came up against this question:

(5d) Suppose we are using linear hashing, and start with an empty table with 2 buckets (M = 2), split = 0 and a load factor of 0.9. Explain the steps we go through when the following hashes are added (in order):

5,7,12,11,9

The answer provided for this is:

*— —5— (0,1)
* — —5,7 —

  • split —*—5,7— — (0,1,2)

—12*—5,7— — —

  • split —12—5—*—7— (0,1,2,3)
  • split =M, M = 2*M, split = 0

*—12—5— —7—
*—12—5— —7,11—

  • split —*—5— —7,11—12— (0,1,2,3,4)

—*—5,9— —7,11—12—

  • split — —9*— —7,11—12—5— (0,1,2,3,4,5)

This answer doesn't make any sense to me and the lecturer did not go through this.

How do I tackle this question?

1

There are 1 best solutions below

2
On BEST ANSWER

I edited your question because the answer looks like a list of descriptions of the hash table state as each operation is performed. Did your professor cover linear hashing at all? The Wikipedia description mention a load factor precisely, but it's in the original LH paper by Witold Litwin. it's integral to when a controlled split occurs. I also found these descriptions:

Let l denote the Linear Hashing scheme’s load factor, i.e., l = S/b where S is the total number of records and b is the number of buckets used.

Linear Hashing by Zhang, et al (PDF)

  • The linear hashing algorithm performs splits in a deterministic order, rather than splitting at a bucket that overflowed. The splits are performed in linear order (bucket 0 first, then bucket 1, then 2, ...), and a split is performed when any bucket overflows. If the bucket that overflows is not the bucket that is split (which is the common case), overflow techniques such as chaining are used, but the common case is that few overflow buckets are needed.

snip

  • Instead of splitting on every collision, you can do a split when the "load" (which is bytes stored / (num buckets * bucket size), i.e. utilization of the data structure) crosses some watermark. This is called controlled splitting; the previously described is called uncontrolled splitting.

Linear Hashing: A new Tool for File and Table Addressing Witold Litwin, Summary by: Steve Gribble and Armando Fox, Online Berkley.edu retrieved June 16

So basically, a load factor is a means of predictably controlling when a split will occur. One implementation of linear hashing appears to be called 'uncontrolled split' which adds a new bucket and performs a split whenever a collision occurs. Using a load factor of 0.9 only has a split occur when 90% of the tables buckets are full - or rather, would be full based on the prediction that the buckets are uniformly assigned to.

Based on this and the Wikipedia article I just read, the setup is this:

  • Table is initially empty with two buckets (N = 2) - - (numbered 0 and 1)
    • N for number of buckets makes so much more sense to me than M, so I'm using that in my answer.
    • Apparently N is never changed even as new buckets are added to the table.
  • Our growth factor (L for bucket level) is 0. It is incremented every time every bucket in the table has been split once, which coincides with when our table has doubled in size.
  • Step pointer S (also called a split pointer) points to 0th bucket. It indicates which bucket will have a split applied to it next.

This follows the wikipedia article description I linked to above. Now we need to cover the hash and bucket assignment.

  • A decent hash function for integers you expect to have a normal distribution is to just use the integer itself. So for an input integer I, our hash H(I) is just I. I think this follows the answer key, which is good because the question is unanswerable without knowing H.
  • To determine which bucket an integer I is added to, one of two function values will be used, depending on whether or not the assignment points to before or after S.
    • First, calculate H(I) mod (N x 2L), which is really just I mod (N x 2L). I'm going to call this B(I) below for brevity (also for bucket). Call this the assignment address A.
    • If A is greater than or equal to S, we assign input I to address A and move on.
    • If A (B(I)) is less than S, we actually use a different hash function, I'll call B'(I), which is calculated as I mod (N x 2L + 1), giving us an actual assignment address of A'.
    • I think the reasoning for this is to keep the assignment to buckets more even as buckets are split along the way, but I don't have the mathematical proof of its importance.

I think the * in the answer's notation above denotes the location of the split pointer S. In my notation for the rest of the question below:

  • Let - denote an empty bucket, i denote a bucket with the Integer i in it, and i,j denote a bucket with both i and j in it.
  • So the first step of your answer key "— —5— (0,1)" is saying bucket 0 is empty and bucket 1 has 5 in it. I would rewrite this as - 5 for clarity.

I'm thinking your answer breakdown reads like this:

  1. Add 5 to the table.
    • The linear hashing algorithm puts it into the second bucket (index 1) because:
    • B(5) = 5 mod (2 x 20) = 5 mod (2 x 1) = 5 mod 2 = 1
    • 1 is greater than S, which is still 0, so we use 1 as the address.
    • Table now has - 5 (0th bucket empty, 1st bucket with 5 in it.
    • N, L, and S are unchanged
  2. Add 7 to the table.

    • B(7) = 7 mod 2 = 1, so 7 is added to the same bucket as 5. S still hasn't changed, so again 1 is used as the address.
    • Table now has - 5,7
    • A split occurs! Not because a bucket has overflowed, but because the load factor has been exceeded. 2 items added, 2 total buckets, 2/2 = 1.0 > 0.9 = do a split.
      • First a new bucket is added at the end of the table.
      • S is incremented to 1. N is not incremented. L is unchanged
      • The split is done on a bucket. A split means all the items in the bucket get their assignment recalculated based on the new hash table size. However, one key to linear hashing is that the actual buckets are split in order, so the 0th bucket is split even though the 1st bucket is the one thats full.
    • Post split, the table is now - 5,7 -, with buckets 0 and 2 empty, and 1 still with 5 and 7 in it.
  3. Add 12 to the table.

    • B(12) = 12 mod (2 x 20) = 12 mod 2 = 0
    • S is 1 and B(12) is 0, so we calculate B'(12) instead for our address.
    • Coincidentally, this is 12 mod (2 x 20 + 1) = 12 mod 4, which is still 0, so 12 is added to the 0th bucket.
    • Table now has 12 5,7 -, only the 3rd, new bucket is empty.
    • A split occurs again, because 3/3 = 1.0 > 0.9. This split promises to be more interesting than the last!
    • A new bucket is added to the end of the table, giving us 12 5,7 - -
    • S = 1, so the bucket with 5,7 is split. That means new buckets are picked for 5 and 7.
    • Increment S to 2. This is done after the split target bucket is picked, but before the new buckets are assigned. This ensures the new table is more evenly distributed (again, my supposition, don't have proof).
    • 5 mod 2 = 1, 1 < S, calculate 5 mod 2 x 21 = 5 mod 4 = 1. 5 is re-assigned to its same bucket.
    • 7 mod 2 = 1, 1 < S, calculate 7 mod 2 x 21 = 7 mod 4 = 3. 7 is re-assigned to 3.
    • Table now has 12 5 - 7
    • S = 2, N still equals 2, and L still = 0. S has now reached N x 2L = 2 x 20 = 2, so S is reset to 0 and L is incremented to 1.
  4. Add 11 to the table.

    • B(11) = 11 mod (2 x 21) = 11 mod 4 = 3. 11 is assigned to the 3rd bucket.
    • Table now has 12 5 - 7,11, 4 items and 4 buckets, so a split occurs again.
    • S is 0 again, so the 0th bucket with 12 is reassigned after a new bucket is added. S is incremented to 1 before choosing a new bucket for 12.
    • B(12) = 12 mod (2 x 21) = 12 mod 4 = 0. 0 < 1, so recalculate
    • B'(12) = 12 mod (2 x 21+1) = 12 mod 8 = 4. 12 is assigned to the 4th bucket.
    • Table now contains - 5 - 7,11 12
  5. Add 9 to the table.

I'll leave the steps to the last one for you. There are a few nuances to the LH algorithm that I'm not quite grasping. I might ask additional questions about them. But hopefully that's enough for you to get going on. In the future, I would recommend asking the course instructor directly.