finding closest hamming distance

1.4k Views Asked by At

I have N < 2^n randomly generated n-bit numbers stored in a file the lookup for which is expensive. Given a number Y, I have to search for a number in the file that is at most k hamming dist. from Y. Now this calls for a C(n 1) + C(n 2) + C(n 3)...+C(n,k) worst case lookups which is not feasible in my case. I tried storing the distribution of 1's and 0's at each bit position in memory and prioritized my lookups. So, I stored probability of bit i being 0/1:

Pr(bi=0), Pr(bi=1) for all i from 0 to n-1.

But it didn't help much since N is too large and have almost equal distribution of 1/0 in every bit location. Is there a way this thing can be done more efficiently. For now, you can assume n=32, N = 2^24.

5

There are 5 best solutions below

2
On

Perhaps you could store it as a graph, with links to the next closest numbers in the set, by hamming distance, then all you need to do is follow one of the links to another number to find the next closest one. Then use an index to keep track of where the numbers are by file offset, so you don't have to search the graph for Y when you need to find its nearby neighbors.

You also say you have 2^24 numbers, which according to wolfram alpha (http://www.wolframalpha.com/input/?i=2^24+*+32+bits) is only 64MB. Could you just put it all in ram to make the accesses faster? Perhaps that would happen automatically with caching on your machine?

1
On

If your application can afford to do some extensive preprocessing, you could, as you're generating the n-bit numbers, compute all the other numbers which are at most k distant from that number and store it in a lookup table. It'd be something like a Map >. riri claims you can fit it in memory, so hash tables might work well, but otherwise, you'd probably need a B+ tree for the Map. Of course, this is expensive as you mentioned before, but if you can do it beforehand, you'd have fast lookups later, either O(1) or O(log(N) + log(2^k)).

1
On

If by "lookup", you mean searching your entire file for a specified number, and then repeating the "lookup" for each possible match, then it should be faster to just read through the whole file once, checking each entry for the hamming distance to the specified number as you go. That way you only read through the file once instead of C(n 1) + C(n 2) + C(n 3)...+C(n,k) times.

0
On

Google gives a solution to this problem for k=3, n=64, N=2^34 (much larger corpus, fewer bit flips, larger fingerprints) in this paper. The basic idea is that for small k, n/k is quite large, and hence you expect that nearby fingerprints should have relatively long common prefixes if you formed a few tables with permuted bits orders. I am not sure it will work for you, however, since your n/k is quite a bit smaller.

1
On

You can use quantum computation for speeding up your search process and at the same time minimizing the required number of steps. I think Grover's search algorithm will be help full to you as it provides quadratic speed up to the search problem.....