I have a large set of strings, on order ~10^12 or so, and I need to choose an appropriate data structure so that, provided a string, I can retrieve and associated integer value in something like O(log(n)) or O(m) time where 'n' is the length of the list of strings and 'm' is the length of each string.
We can expect that our set of strings, each of length 'm' and encoded over some alphabet of size 'q', covers nearly all possible strings of this length. For example, imagine we have 10^12 all-unique binary strings of length m = 39. This implies that we've covered ~54% of the set of all possible binary strings of this length.
As such, I'm concerned about finding an appropriate hashing function for the strings that avoids collisions. Is there a good one I can use? How long will it take me to index my set of n strings?
Or should I go with a suffix tree? We know that Ukkonen’s algorithm allows for linear time construction, and my guess is that this will save space given the large numbers of similar strings?
...
Hi Bob,
the long answer short: the classic HASH+BTREE approach is strong and superfast.
Whether 10 million or 10 billion strings are to be stored in the above structure it doesn't matter - you always have a very low MAX seeks threshold.
Well, you need 10^12 = 1,000,000,000,000 - but this is 1 trillion, it surprises me - even my heavy string corpora are in range of 1 billion.
Just check my implementation in C at: http://www.sanmayce.com/#Section13Level
The fastest hash table-lookup function in C is here:
http://www.sanmayce.com/Fastest_Hash/index.html#KT_torture3
It is 300-500% faster than strong CRC32 8slice variants (both Castagnoli's & Koopman's) while featuring similar collisions.