Find the overlap of a set of equal length string?

974 Views Asked by At

There are 1 million of equal length strings(short string).For example

abcdefghi

fghixyzyz

ghiabcabc

zyzdddxfg

. . .

I want to find pair-wise overlap of two string.The overlap of A"abcdefghi" and B"fghixyzyz" is "fghi",which is the maximal suffix of A , the maximal prefix of B ,satisfy the suffix and the prefix are equal.

Is there efficient algorithm which can find the overlap of any two strings in the set?

1

There are 1 best solutions below

0
On

One of the efficient ways is to build a general suffix tree for the string set. To find the overlap between string x and y:

Follow the path label for string y in the general suffix tree. The deepest node along this path that is incident to the terminal symbol of string x has a path label which is equivalent to the suffix-prefix overlap x->y.

For more details see page 137 ("Solving the all-pairs suffix-prefix problem in linear time") of Gusfield's book "Algorithms on Strings, Trees, And Sequences".

Caution: this uses a LOT of memory if your dataset is large (millions/billions of strings).