Imagine there is a big array of Strings S. From that array I need to get only those strings which contain a specific substring. For example if my array is
String s [] = {"hello world", "back to hell", "say hello world"};
and my keyword is "hello", then it should return me first and last element.
I tried using KMP and Boyer-Moor algorithms to check each string in array whether it contains a substring or not, but it takes too much time.
Then I learned about Aho-Corasick algorithm. I am still looking it up, but seemingly it needs an array of substrings and one big string to match,while what I want is exactly opposite.
So I was looking for a suggestions on how to modify Aho-Corasick algorithm for my purposes, or another means to achieve those. Would be thankful for any suggestions.
Is there an efficient algorithm to extract from an array of string only those which contain a specific substring?
317 Views Asked by Akbar Amanov At
1
Build a suffix tree using Ukkonen algorithm or the one suggested in this source(PDF):
Then use the created suffix tree to search for a given pattern. The problem is to find all occurrences of pattern P (length m) in a suffix tree T. According to the above source:
Note that the length of the text (or the number of strings in the array) does not affect search efficiency. Therefore, you could pay for constructing a suffix tree once and then use it many times to search efficiently for short pattern strings.
EDIT: if you are in a hurry and do not mind a little bit of extra time complexity, you can construct suffix arrays instead of suffix trees using this approach(PDF) in just O(n*log^2(n)) with a very small piece of code. Here is the core idea of this approach:
And here is the pseudo-code reproduced from the above source:
After running this code,
P
will contain the suffix array. Search using this approach is also straightforward: