Searching word in a file txt file that contains 100,000,000 words in c++

288 Views Asked by At

I have a txt file with 100000000 words in every new line.

I want to write a function that takes an input of the word and searches if the word is there or not in the txt file.

I have tried this with map and trie method but I'm getting std:bac_alloc error, this is due to that large number of words can anyone suggest how to solve the issue

2

There are 2 best solutions below

2
On

Data structures are quite important when programming. If possible I would recommend that you use something like a binary tree. This would require sorting the text file though. If you cannot sort the text file, the best way would be to just iterate over the text file until you get the word that you wanted. Also, your comment should contain more information as to allow us to more easily diagnose your problem

0
On

I assume you want to search this word list over and over. Because for a small number of searches just search linear through the file.

Parsing the word list into a suffix tree takes about 20 times the size of the file, more if not optimized. Since you ran out of memory constructing a trie of the word list I assume it's really big. So lets not keep it in memory but process it a bit so you can search faster.

The solution I would propose is to do a dictionary search.

So first turn every whitespace into a newline so you have one word per line instead of multiple lines with multiple words and then sort the file and store it. While you are at it you can remove duplicates. That is our dictionary. Remember the length of the longest word (L) while you do that.

To access the dictionary you need a helper function to read a word at offset X, which can be at the middle of some word. The function should seek to the offset - L and read 2 * L bytes into a buffer. Then from the middle of the buffer search backward and forward to find the word at offset X.

Now to search you open the dictionary and read the word at offset left=0 and offset right = size_of_file, i.e. the first and last word. If your search term is less then the first word or larger then the last word you are done, word not found. If you found the search term you are done too.

Next in a binary search you would take the std::midpoint of left and right, read the word at that offset and check if the search term is less or more and recurse into that interval. This would require O(log n) reads to find the word or determine it's not present.

A dictionary search can do better. Instead of using the midpoint you can approximate where the word should be in the dictionary. Say your dictionary goes from "Aal" to "Zoo" and you are searching for "Zebra". Would you open the dictionary in the middle? No, you would open it near the end because Zerba is much closer to Zoo than Aal. So you need a function that gives you a value (M) between 0 and 1 of where a search term is located relative to the left and right word. Your "midpoint" for the search is then (right - left) * M. Then, like with binary search, determine if the search term is in the left or right interval and recurse.

A dictionary search takes only log log n reads on average if the word list has reasonably uniform distribution.