There are 1 million of equal length strings(short string).For example
abcdefghi
fghixyzyz
ghiabcabc
zyzdddxfg
. . .
I want to find pair-wise overlap of two string.The overlap of A"abcdefghi" and B"fghixyzyz" is "fghi",which is the maximal suffix of A , the maximal prefix of B ,satisfy the suffix and the prefix are equal.
Is there efficient algorithm which can find the overlap of any two strings in the set?
One of the efficient ways is to build a general suffix tree for the string set. To find the overlap between string x and y:
Follow the path label for string y in the general suffix tree. The deepest node along this path that is incident to the terminal symbol of string x has a path label which is equivalent to the suffix-prefix overlap x->y.
For more details see page 137 ("Solving the all-pairs suffix-prefix problem in linear time") of Gusfield's book "Algorithms on Strings, Trees, And Sequences".
Caution: this uses a LOT of memory if your dataset is large (millions/billions of strings).