Search many strings over a very large text

619 Views Asked by At

I have like 2 million strings and I need to search each of them over a 1 TB text data. Searching all of them is not a best solution to do, so I was thinking about a better way to create a data structure like trie for all of the strings. In other words, a trie in which each node in that is a word. I wanted to ask, is there any good algorithm, data structure or library (in C++) for this purpose?


Let me be more descriptive in this question fellows,

For instance, I have these strings: s1- "I love you" s2- "How are you" s3- "What's up dude"

And I have many text data like: t1- "Hi, my name is Omid and I love computers. How are you guys?" t2- "Your every wish will be done, they tell me..." t3 t4 . . . t10000

Then I want to consider each of texts and search for each of strings on them. At last for this example I would just say: t1 contains s1 and nothing else. I am looking for an efficient way to search for strings but not foolishly for each of them each time.

2

There are 2 best solutions below

0
On

I'm sorry to post a link only answer, but if you don't mind reading research paper, the definitive reference on string matching algorithms seems to me to be http://www-igm.univ-mlv.fr/~lecroq/string/ and the following research paper by Simone Faro and Thierry Lecroq where they compared the relative performance of no less that 85 different string matching algorithms. I'm pretty sure there is one fitting your need among them.

7
On

I would strongly suggest that you use CLucene (http://clucene.sourceforge.net/) which is a port from the Apache Lucene project. This will build you an inverted index and make text searching very fast. If changing languages is an option consider doing this in Java as the CLucene version is a bit out of date. It will be slower but has more features.