I need to explore the topic of integer hashes for a specific application. I have a few requirements:
- integer to integer hash
- "semi" reversibility. I know a hash will not be 1-1 reversible, so please try to understand what I have in mind in terms of an n-1 hash. Let's say I have an original domain of numbers 0...n that I hash into a smaller domain 0...k. If I hash using function f(n) = k, then I want something reversible in the sense that I also have an "inverse" g(k) = {n1,n2,n3, ..., nj} are all the possible domain members that hash to k
- reasonably even and "randomish" distribution
- for my "inverse" function g, I have a tight bound on the size of the set returned, and this size is roughly the same for any given k
- fast integer hash
To explain the application a bit here... I am operating in a very memory restricted environment. I intend to not allow collisions. That is, if there is a collision with an existing value in the table, the insert operation just fails. That's ok. I don't need every insert to succeed. I am ready to make that trade off in favor of space and speed. Now the key thing is this, when I store the value in the table I need to absolutely minimize the number of bits represented. What I am hoping for is basically:
If I hash to value k, I can immediately narrow down what I store to a small subset of the original domain. If the hash is "semi reversible" and if I can enumerate all the possible domain elements hashing to k, then I can order them and assign the ordinal to each possibility. Then I would like to store that much smaller ordinal rather than the original value which will require hopefully many fewer bits. Then I should be able to fully reverse this by enumerating to the ith possibility for stored ordinal i.
The importance of the tight bound on the size of he inverse set g(k) is because I need to know how many bits I need to allocate for each ordinal, and I want to keep things relatively simple by allocating the same number of bits to each table entry. Yes. I will probably be working on smaller than a byte values though. The original domain will be of a relatively small size to start with.
I'm interested in any of your thoughts and any examples anyone might have reference to. I think this should be doable, but I would like to get an idea of the range of possible solutions.
Thanks in advance! Mark
There is no such thing as a free lunch.
If you have an even distribution then
g(k1)
will haven/k
values for eachk1
. So you end up having to storek*n/k
orn
values, which happens to be the same number you started with.You should probably be looking for compression algorithms rather than hash functions. It will improve your google charma.
That said, it is hard to suggest a compression algorithm without knowing the distribution of numbers. If it is truly random, then it will be hard to compress.