How to modify a longest common subsequence algorithm to a longest common contiguous subsequence algorithm?

350 Views Asked by At

I'm wondering how to change the class I have so that it can produce the longest contiguous common subsequence between two strings. This code right now is setup so that I can get the length of the longest common subsequence of the two strings. I want the final result string to be unbroken (it can be found in both strings with no characters in between). I would like it if, for example, sending the method ACGGTTGTCGCAGTCC and TGTAGCAG would result in length 4 for GCAG, instead of what it does now, which is length 7 for TGTGCAG.

public class GenomeSequencer {
    private String x;
    private String y;
public String getLongestCommon() {
        int[][] table = lcsLength();
        return Integer.toString(table[x.length()][y.length()]);
    }

    private int[][] lcsLength() {
        int m = x.length();
        int n = y.length();

        int[][] c = new int[m + 1][n + 1];

        for (int i = 0; i <= m; i++) {
            c[i][0] = 0;
        }
        for (int j = 0; j <= n; j++) {
            c[0][j] = 0;
        }
        for (int i = 1; i <= m; i++) {
            for (int j = 1; j <= n; j++) {
                if (x.charAt(i - 1) == y.charAt(j - 1)) {
                    c[i][j] = c[i - 1][j - 1] + 1;
                } else if (c[i - 1][j] >= c[i][j - 1]) {
                    c[i][j] = c[i - 1][j];
                } else {
                    c[i][j] = c[i][j - 1];
                }
            }
        }
        return c;

    }

}
1

There are 1 best solutions below

0
On BEST ANSWER

If you really want to reuse the old solution, consider why it doesn't work anymore.

Of the three transitions, the one when letters are equal should survive, and the two skipping a letter in one of the strings should disappear.

So, the transitions can be just:

                if (x.charAt(i - 1) == y.charAt(j - 1)) {
                    c[i][j] = c[i - 1][j - 1] + 1;
                } else {
                    c[i][j] = 0;
                }

Note that, with this approach, the answer is no longer at c[m][n]: instead, you should search the whole table for the maximum value.


The simplicity of the transitions suggests there are faster solutions to the longest common substring problem (substring is a consecutive subsequence). And indeed there is a linear yet complex solution involving a suffix tree (mentioned in the above link), or simpler solutions using hashes and binary search for the answer's length.