N-gram extractor

155 Views Asked by At

I want to solve a following algorithmic problem. Given a set of strings s1, s2, ..., sn. I want to find a string s_e, which contains all n-grams, which exist in all input string.

Consider an example with three following strings:

ABABC
BABCA
ABCBA

The n-grams are (I mark n-grams which are missing in at least one string with asterisk):

A B C
AB BA BC CA* CB*
ABA* BAB* ABC BCA* BCB* CBA*
ABAB* BABC* ABCA* ABCB* BCBA*
ABABC* BABCA* ABCBA*

So the solution should contain n-grams: "A", "B", "C", "AB", "BA", "BC", "ABC". Thus the shortest such string is "BABC". It contains an n-gram, which does not exist in all input strings, namely "BABC", but that's OK.

And the resulting string I called n-gram extractor, because by iterating through all n-grams of the result you can get all n-grams, which are present in all of the input strings.

If this is a know problem, I would be happy to know a name for it. (was: I thought I have an idea for a solution, but turned out I don't).

For the input it is safe to assume that square of the size of the alphabet (n^2) is smaller than the length of the shortest string.

0

There are 0 best solutions below