I have like 2 million strings and I need to search each of them over a 1 TB text data. Searching all of them is not a best solution to do, so I was thinking about a better way to create a data structure like trie for all of the strings. In other words, a trie in which each node in that is a word. I wanted to ask, is there any good algorithm, data structure or library (in C++) for this purpose?
Let me be more descriptive in this question fellows,
For instance, I have these strings: s1- "I love you" s2- "How are you" s3- "What's up dude"
And I have many text data like: t1- "Hi, my name is Omid and I love computers. How are you guys?" t2- "Your every wish will be done, they tell me..." t3 t4 . . . t10000
Then I want to consider each of texts and search for each of strings on them. At last for this example I would just say: t1 contains s1 and nothing else. I am looking for an efficient way to search for strings but not foolishly for each of them each time.
I would strongly suggest that you use CLucene (http://clucene.sourceforge.net/) which is a port from the Apache Lucene project. This will build you an inverted index and make text searching very fast. If changing languages is an option consider doing this in Java as the CLucene version is a bit out of date. It will be slower but has more features.