I was working on linear probing. Which hashes the values on mod of table size and wrote some code for it.
public class LinearProbing
{
private int table[];
private int size;
LinearProbing(int size)
{
this.size=size;
table=new int[size];
}
public void hash(int value)
{
int key=value%size;
while(table[key]!=0)
{
key++;
if(key==size)
{
key=0;
}
}
table[key]=value;
}
public void display()
{
for(int i=0;i<size;i++)
{
System.out.println(i+"->"+table[i]);
}
}
}
It works fine for every value except zero(0). When zero is in values to be hashed, as in java array each index is initially initiated with zero. Checking with zero to see whether the index is free or not causing trouble if zero is to be hashed and can be overwritten. I also checked with equality with null but it raises an error type mismatch.
Does anyone have any suggestion?
Computers don't work that way, at least, not without paying a rather great cost.
Specifically, a
new int[10]quite literally just creates a contiguous block of memory that is precisely large enough to hold 10intvariables, and not a bit more than that. Specifically, each int will cover 32 bits, and those bits can be used to represent precisely 2^32 different things. Think about it: If I give you a panel of 3 light switches, and all you get to do is walk in, flip some switches, and walk back out again, then I walk in and I get to look at what you have been flipping, and that is all the communication channel we ever get, we can pre-arrange for 8 different signals. Why 8? Because that's2^3. A bit is like that lightswitch. It's on, or off. There is no other option, and there is no 'unset'. There is no way to represent 'oh, you have not been in the room yet' unless we 'spend' one of our 8 different arrangements on this signal, leaving only 7 left.Thus, if you want each 'int' to also know 'whether it has been set or not', and for 'not set yet' to be different from any of the valid values, you need an entire new bit, and given that modern CPUs don't like doing work on sub-word units, that one bit is excessively expensive. In either case, you have to program it.
For example:
This rather complex piece of machinery 'packs' that additional 'is it set?' bit into a separate array called
set, which we can get away with making 1/32nd the size of the whole thing, as each int contains 32 bits and we just need 1 bit to mark an index slot as 'unset'. Unfortunately, this means we need to do all sorts of 'bit wrangling', and thus we're using the bitwise OR operator (|=), and bit shifts (<<and>>) to isolate the right bit.This is why, usually, this is not the way, bit wrangling isn't cheap.
It's a much, much better idea to take away exactly one of the 2^32 different values a hash can be. You could choose 0, but you can also choose some arbitrarily chosen value; there is a very minor benefit to picking a large prime number. Let's say 7549.
Now all you need to do is decree a certain algorithm: The practical hash of a value is derived from this formula:
Tada: This algorithm means '7549' is free. No practical hash can ever be 7549. That means we can now use 7549 as marker as meaning 'unset'.
The fact that 6961 is now doubled up is technically not relevant: Any hash bucket system cannot just state that equal hashes means equal objects - after all, there are only 2^32 hashes, so collisions are mathematically impossible to avoid. That's why e.g. java's own HashMap doesn't JUST compare hashes - it also calls
.equals. If you shove 2 different (as in, not.equals) objects in the same map that so happen to hash to the same value, HashMap is fine with it. Hence, having more conflicts around 6961 is not particularly relevant.The additional cost associated with the additional chance of collision on 6961 is vastly less than the additional cost associated with keeping track of which buckets have been set or not. After all, assuming good hash distribution, our transformation algorithm that frees up 7549 means 1 in 4 billion items happens to collide twice more likely. That's... an infinitesimal occurrence on top of another infinitesimal, it's not going to matter.
NB: 6961 and 7549 are randomly chosen prime numbers. Prime numbers are merely slightly less likely to collide, it's not crucial that you pick primes here.