Search many strings over a very large text

623 Views Asked by omid askari At 07 June 2025 at 02:33

I have like 2 million strings and I need to search each of them over a 1 TB text data. Searching all of them is not a best solution to do, so I was thinking about a better way to create a data structure like trie for all of the strings. In other words, a trie in which each node in that is a word. I wanted to ask, is there any good algorithm, data structure or library (in C++) for this purpose?

Let me be more descriptive in this question fellows,

For instance, I have these strings: s1- "I love you" s2- "How are you" s3- "What's up dude"

And I have many text data like: t1- "Hi, my name is Omid and I love computers. How are you guys?" t2- "Your every wish will be done, they tell me..." t3 t4 . . . t10000

Then I want to consider each of texts and search for each of strings on them. At last for this example I would just say: t1 contains s1 and nothing else. I am looking for an efficient way to search for strings but not foolishly for each of them each time.

Original Q&A

There are 2 best solutions below

hivert On 18 February 2014 at 17:18

I'm sorry to post a link only answer, but if you don't mind reading research paper, the definitive reference on string matching algorithms seems to me to be http://www-igm.univ-mlv.fr/~lecroq/string/ and the following research paper by Simone Faro and Thierry Lecroq where they compared the relative performance of no less that 85 different string matching algorithms. I'm pretty sure there is one fitting your need among them.

Peter R On 18 February 2014 at 06:27

I would strongly suggest that you use CLucene (http://clucene.sourceforge.net/) which is a port from the Apache Lucene project. This will build you an inverted index and make text searching very fast. If changing languages is an option consider doing this in Java as the CLucene version is a bit out of date. It will be slower but has more features.

Search many strings over a very large text

There are 2 best solutions below

Related Questions in C++

Related Questions in STRING

Related Questions in SEARCH

Related Questions in TRIE

Related Questions in LARGE-TEXT

Trending Questions

Popular # Hahtags

Popular Questions