I need a 64 bit to 16 bit perfect hash function for a sparsely populated list of keys.
I have a dictionary in python which has 48326 keys of length 64-bits. I would like to create a minimum perfect hash for this list of keys. (I don't want to have to wait for a few days to calculate the MPH so i am ok with it mapping to a 16 bit hash also)
The objective is to eventually port this dictionary to C as an array which contains the dict values and the index is calculated by the Minimum perfect hash function taking key as input . I cannot use external hashing libraries in the C port of the application I am building
Question: Is there any python library which will take my keys as input and provide me the hashing parameters and (based on a defined algorithm used for hashing) as an output.
I found a library perfection 2.0.0 but since my keys are of 64 bit form, this just hung. (even when I tested it on a subset of 2000 keys)
EDIT As suggested in comments I looked at Steve Hanov's Algo and modified the hash function to take a 64 bit integer (changing values of the FNV Prime and offset as per this wiki page)
while I got the result, unfortunately the Maps return -ve index values while i can make it work it means i have to add another 4 cycles to the hash calculations by checking for -ve index
would like to avoid this
Personally, I'd just generate a table with
gperf
, or for a large number of keys, with CMPH, and be done with it.If you must do this in Python, then I found this blog post with some Python 2 code that is very efficient at turning string keys into a minimal perfect hash, using an intermediary table.
Adapting the code in the post to your requirements, produces a minimal perfect hash for 50k items in under 0.35 seconds:
The changes I made:
int.to_bytes()
The adapted code:
The above produces two lists with 50k entries each; the values in the first table are
(boolean, integer)
tuples with the integers in the range[0, tablesize)
(in theory the values could range up to 2^16 but I'd be very surprised if it ever took 65k+ attempts to find a slot arrangement for your data). Your table size is < 50k, so the above arrangement makes it possible to store the entries in this list in 4 bytes (bool
andshort
make 3, but alignment rules add one byte padding) when expressing this as a C array.A quick test to see of the hash tables are correct and produce the right output again:
You only need to implement the lookup function in C:
fnv_hash_int
operation can take a pointer to your 64-bit integer, then just cast that pointer to an array of 8-bit values and increment an index 8 times to access each separate byte; use a suitable function to ensure big-endian (network) order.0xffffffff
in C, as overflow on a C integer value is automatically discarded anyway.len(intermediate) == len(values) == len(dictionary)
and can be captured in a constant.flag
being abool
,d
as an unsignedshort
; that's just 3 bytes, plus 1 padding byte to align on a 4-byte boundary. The datatype in thevalues
array depends on the values in your input dictionary.If you forgive my C skills, here's a sample implementation:
mph_table.h
mph_table.c
which would rely on a generated header file, produced from:
where
dictionary
is your input table.Compiled with
gcc -O3
the hash function is inlined (loop unrolled) and the wholemph_lookup
function clocks in at 300 CPU instructions. A quick benchmark looping through all 50k random keys I generated shows my 2.9 GHz Intel Core i7 laptop can look up 50 million values for those keys per second (0.02 microseconds per key).