LCS algorithm: How to find out from a Table, how many longest common subsequences are found?

595 Views Asked by At

I implement the Longest Common Subsequence problem in C#. I need to detect ALL the common maximal subsequences between two strings.

To do this, I create a table using Needleman-Wunsch algorithm to store the LCS sequence for each step of the calculation.

Is there any chance to determine, how many maximal subsequences were found (using a table)?

Depending on this I want to choose a method how to collect each subsequence. The point is, for one subsequence recursion is not required, so it will give a better performance. And its crucial for my task.

Here is a code snippet, where the basic functions from the project are implemented:

    private static int[][] GetMatrixLCS(string x, string y)
        {
            var lenX = x.Length;
            var lenY = y.Length;
            matrixLCS = new int[lenX + 1][];
            for (var i = 0; i < matrixLCS.Length; i++)
            {
                matrixLCS[i] = new int[lenY + 1];
            }
            for (int i = 0; i <= lenX; i++)
            {
                for (int j = 0; j <= lenY; j++)
                {
                    if (i == 0 || j == 0)
                        matrixLCS[i][j] = 0;
                    else
                    if (x[i - 1] == y[j - 1])
                        matrixLCS[i][j] = matrixLCS[i - 1][j - 1] + 1;
                    else
                        matrixLCS[i][j] = Math.Max(matrixLCS[i - 1][j], matrixLCS[i][j - 1]);
                }
            }
            return matrixLCS;
        }

    static HashSet<string> FindAllLcs(string X, string Y, int lenX, int lenY)
        {
            var set = new HashSet<string>();
            if (lenX == 0 || lenY == 0)
                return emptySet;
            if (X[lenX - 1] == Y[lenY - 1])
            {
                var tempResult = FindAllLcs(X, Y, lenX - 1, lenY - 1);
                foreach (var temp in tempResult)
                    set.Add(temp + X[lenX - 1]);
                return set;
            }
            if (matrixLCS[lenX - 1][lenY] >= matrixLCS[lenX][lenY - 1])
                set = FindAllLcs(X, Y, lenX - 1, lenY);
            if (matrixLCS[lenX][lenY - 1] >= matrixLCS[lenX - 1][lenY])
                set.UnionWith(FindAllLcs(X, Y, lenX, lenY - 1));
            return set;
        }

And the example with two types of inputs and expected outputs:

    public void SingleOutput()
    {
    var sequence = LCS.FindLCS("ABC", "AB");
    Assert.AreEqual(1, sequence.Length);
    Assert.AreEqual("AB", sequence[0]);
    }

    public void MultipleOutput() 
    { 
    var sequence = LCS.FindLCS("BCAB", "ABC"); 
    Assert.AreEqual(2, sequence.Length); 
    Assert.AreEqual("AB", sequence [0]);
    Assert.AreEqual("BC", sequence [1]);
    }

Any help would be strongly appreciated.

2

There are 2 best solutions below

0
On

The easiest way is to record all matches during the iteration through the matrix using the naive implementation.

I needed allLCS() for tests, if other algorithms provide a valid solution, which must be one of all possible LCS.

The code is on github.

It's in Perl, but easy to understand. Iterate through the matrix and add the matches in the cells. At end the bottom right cell contains the length of the LCS. That's the naive algorithm. Now record at each match the coordinates as an array [i,j] in a hash with the match count as key.

# get all LCS of two arrays
# records the matches by rank
sub allLCS {
  my ($self,$X,$Y) = @_;

  my $m = scalar @$X;
  my $n = scalar @$Y;

  my $ranks = {}; # e.g. '4' => [[3,6],[4,5]]
  my $c = [];
  my ($i,$j);

  for (0..$m) {$c->[$_][0]=0;}
  for (0..$n) {$c->[0][$_]=0;}
  for ($i=1;$i<=$m;$i++) {
    for ($j=1;$j<=$n;$j++) {
      if ($X->[$i-1] eq $Y->[$j-1]) {
        $c->[$i][$j] = $c->[$i-1][$j-1]+1;
        push @{$ranks->{$c->[$i][$j]}},[$i-1,$j-1];
      }
      else {
        $c->[$i][$j] =
          ($c->[$i][$j-1] > $c->[$i-1][$j])
            ? $c->[$i][$j-1]
            : $c->[$i-1][$j];
      }
    }
  }
  my $max = scalar keys %$ranks;
  return $self->_all_lcs($ranks,1,$max);
} 

At the end this collection of recorded matches is connected by the method _all_lcs():

sub _all_lcs {
  my ($self,$ranks,$rank,$max) = @_;

  my $R = [[]];

  while ($rank <= $max) {
    my @temp;
    for my $path (@$R) {
      for my $hunk (@{$ranks->{$rank}}) {
        if (scalar @{$path} == 0) {
          push @temp,[$hunk];
        }
        elsif (($path->[-1][0] < $hunk->[0]) && ($path->[-1][1] < $hunk->[1])) {
          push @temp,[@$path,$hunk];
        }
      }
    }
    @$R = @temp;
    $rank++;
  }
  return $R;
}

The code is inspired by the paper

Ronald I. Greenberg. Fast and Simple Computation of All Longest Common Subsequences

0
On

I think it is possible to consider the dynamic programming a little differently. Maybe it can work:

#include <bits/stdc++.h>

using namespace std;


struct TLSTValue {
    int len;
    int cnt;
};


void update(TLSTValue& v, const TLSTValue& u) {
    if (u.cnt == 0) {
        return;
    }
    if (u.len > v.len) {
        v.len = u.len;
        v.cnt = u.cnt;
    } else if (u.len == v.len) {
        v.cnt += u.cnt;
    }
}

int main(int /* argc */, char** /* argv */)
{
    string a, b;
    while (cin >> a >> b) {
        int n = a.size();
        int m = b.size();

        vector<vector<int>> nxt(n, vector<int>(m));
        for (int j = 0; j < m; ++j) {
            int lst = n;
            for (int i = n - 1; i >= 0; --i) {
                if (a[i] == b[j]) {
                    lst = i;
                }
                nxt[i][j] = lst;
            }
        }

        vector<vector<TLSTValue>> f(n + 1, vector<TLSTValue>(m + 1, {0, 0}));
        f[0][0]= {0, 1};
        TLSTValue ans = {0, 0};
        for (int i = 0; i <= n; ++i) {
            unordered_set<char> st;
            for (int j = 0; j <= m; ++j) {
                update(ans, f[i][j]);
                if (j) {
                    update(f[i][j], f[i][j - 1]);
                }
                if (st.count(b[j])) {
                    continue;
                }
                st.insert(b[j]);
                if (i < n && j < m && f[i][j].cnt && nxt[i][j] < n) {
                    update(f[nxt[i][j] + 1][j + 1], {f[i][j].len + 1, f[i][j].cnt});
                }
            }
        }
        cout << a << " and " << b << ": length = " << ans.len << ", count = " << ans.cnt << endl;
    }

    cerr << "Time execute: " << clock() / (double)CLOCKS_PER_SEC << " sec" << endl;
    return 0;
}

nxt[i][j] is first position starting i position in string a a character with position j in string b. f[i][j] is length and count LCS which ends in a character i - 1 in string a and before position j in string b.

You can try the code here.

Output on some tests:

ABC and AB: length = 2, count = 1
BCAB and ABC: length = 2, count = 2
A and AAA: length = 1, count = 1
AAA and A: length = 1, count = 1
AAAB and ABBB: length = 2, count = 1
ABBB and AAAB: length = 2, count = 1