How do I search most common words in very big file (over 1 Gb) wit using 1 Kb or less memory?

127 Views Asked by At

I have very big text file, with dozens of millions of words, one word per line. I need to find top 10 most common words in that file. There is some restrictions: usage of only standard library and usage of less that 1 KB of memory.

It is guaranteed that any 10 words in that file is short enough to fit into said memory limit and there will be enough memory to some other variables such as counters etc.

The only solution I come with is to use another text file as additional memory and buffer. But, it seems to be bad and slow way to deal with that problem.

Are there any better and efficient solutions?

2

There are 2 best solutions below

0
On BEST ANSWER

If you can create a new file however big, you can create a simple disk-based tree database holding each word and its frequency so far. This will cost you O(log n) each time, with n from 1 to N words, plus the final scan of the whole N-sized tree, which adds up to O(N log N).

If you cannot create a new file, you'll need to perform a in-place sorting of the whole file, which will cost about O(N2). That's closer to O((N/k)2), I think, with k the average number of words you can keep in memory for the simplest bubble-sort - but that is O(1/k2)O(N2) = K O(N2) which is still O(N2). At that point you can rescan the file one final time, and after each run of each word you'll know whether that word can enter your top ten, and at which position. So you need to fit just twelve words in memory (the top ten, the current word, and the word just read from the file). 1K should be enough.

So, the auxiliary file is actually the fastest option.

6
On

You can first sort this file (it is possible with limited memory, but will require disk IO of course - see How do I sort very large files as starter).

Then you will be able to read sorted file line by line and calculate frequency of each word one by one - store them, after 10 words - if frequency is higher then all stored in your array - add it to internal array and remove least occurred one, thus you will keep only 10 most frequent words in memory during this stage.

As @John Bollinger mentioned - if your requirment is to print all top 10 words, if for example - all words from files have the same frequency, i.e. they all are "top", then this approach will not work, you need to calculate frequency for each word, store in file, sort it and then print top 10 including all words with the same frequency as 10th one.