I have N < 2^n randomly generated n-bit numbers stored in a file the lookup for which is expensive. Given a number Y, I have to search for a number in the file that is at most k hamming dist. from Y. Now this calls for a C(n 1) + C(n 2) + C(n 3)...+C(n,k) worst case lookups which is not feasible in my case. I tried storing the distribution of 1's and 0's at each bit position in memory and prioritized my lookups. So, I stored probability of bit i being 0/1:
Pr(bi=0), Pr(bi=1) for all i from 0 to n-1.
But it didn't help much since N is too large and have almost equal distribution of 1/0 in every bit location. Is there a way this thing can be done more efficiently. For now, you can assume n=32, N = 2^24.
Perhaps you could store it as a graph, with links to the next closest numbers in the set, by hamming distance, then all you need to do is follow one of the links to another number to find the next closest one. Then use an index to keep track of where the numbers are by file offset, so you don't have to search the graph for Y when you need to find its nearby neighbors.
You also say you have 2^24 numbers, which according to wolfram alpha (http://www.wolframalpha.com/input/?i=2^24+*+32+bits) is only 64MB. Could you just put it all in ram to make the accesses faster? Perhaps that would happen automatically with caching on your machine?