Why "Longest Common Subsequence" prohibits "substitution" using edit distance methodology

264 Views Asked by At

Edit distance is a famous class of problems including:

Different types of edit distance allow different sets of string operations. For instance:

  1. The Levenshtein distance allows deletion, insertion and substitution.
  2. The longest common subsequence (LCS) distance allows only insertion and deletion, not substitution.
  3. The Hamming distance allows only substitution. Hence, it only applies to strings of the same length.
  4. The Damerau–Levenshtein distance allows insertion, deletion, substitution, and the transposition (swapping) of two adjacent characters.
  5. The Jaro distance allows only transposition.

Source

It mentions:

The longest common subsequence (LCS) distance allows only insertion and deletion, not substitution.

My doubts:

  1. I am confused as to why it does not allow substitution. As far as I understand the algorithm, if there is a character that can be potentially matched to a character in the target string, it won't be substituted or deleted.

  2. Why does it allow deletion and insertion? Can't it happen that a matching character is deleted because it occurs later in the target string?

I am having a hard time understanding this. For analysis purposes, I have written an algorithm which displays the solution in a grid, like so (It displays trail in color but can't show on SO):

$ java dynamic_programming.EditDistance republicans democrats true
|  |  |d |e |m |o |c |r |a |t |s |
|  |0 |1 |2 |3 |4 |5 |6 |7 |8 |9 |
|r |1 |2 |3 |4 |5 |6 |5 |6 |7 |8 |
|e |2 |3 |2 |3 |4 |5 |6 |7 |8 |9 |
|p |3 |4 |3 |4 |5 |6 |7 |8 |9 |10|
|u |4 |5 |4 |5 |6 |7 |8 |9 |10|11|
|b |5 |6 |5 |6 |7 |8 |9 |10|11|12|
|l |6 |7 |6 |7 |8 |9 |10|11|12|13|
|i |7 |8 |7 |8 |9 |10|11|12|13|14|
|c |8 |9 |8 |9 |10|9 |10|11|12|13|
|a |9 |10|9 |10|11|10|11|10|11|12|
|n |10|11|10|11|12|11|12|11|12|13|
|s |11|12|11|12|13|12|13|12|13|12|
=========================
12 // (New Edit Distance, Without Substitution)
2

There are 2 best solutions below

7
JamieNguyen On

It's an interesting question. Let me try to clarify what do you mean by insertion, deletion and substitution. Consider this code for Levenshtein distance that considers the whole set of operations

  for j from 1 to n:
    for i from 1 to m:
      if s[i] = t[j]:
        substitutionCost := 0
      else:
        substitutionCost := 1

      d[i, j] := minimum(d[i-1, j] + 1,                   // deletion
                         d[i, j-1] + 1,                   // insertion
                         d[i-1, j-1] + substitutionCost)  // substitution

For the Levenshtein distance, substitution is allowed, therefore if s[i] != t[j] then we should, in fact, use the substitution here since it incurs the cost of 1 unit (in other words, d[i][j] - d[i-1][j-1] <= 1). However, in the LCS distance, you may observe that substitution is not allowed, and d[i][j] - d[i-1][j-1] <= 2, or to say that the cost of up to 2 can be incurred going from d[i-1][j-1] to d[i][j]. This is intuitively true for LCS if you imagine a tracable version of LCS.

Let s = "abcd", t = "abce", consider d[4][4].

  • If substitution is disallowed, you can't magically go abcd -> abce, but you have to take two operations abcd -> abc -> abce, the cost is 2.
  • If substitution is allowed, however, the cost is 1.

It can be noted that the true LCS distance = |s| + |t| - 2LCS(s, t) = 4 + 4 - 2*3 = 2, which shows that the result is consistent if and only if substitution is not allowed.

2
mohammad zawahra On

From what I understood that when using LCS there's three different ways to get from point for example (0,0) to point (1,1) which are

  1. Path One: 0,0 -> 0,1 -> 1,1
  2. Path Two: 0,0 -> 1,0 -> 1,1
  3. Path Three: 0,0 -> 1,1 (diagonal one)

Small graph

All these paths have same length and are considered to be the same in LCS algorithm, all have same score of 0 (if there is no match) even though diagonal path is better and have less editing distance (one substition < one deletion + one insertion).

Given that, while doing Backtracing it can pick any of these paths resulting in sometimes picking insert+delete path over diagonal one. Resulting longer editing distance (I tried it multiple times on two DNA sequences there's always overestimation).

Solution

Now, we can solve this problem easily and still use LCS by using rewards,

  1. Indel paths ---> 0 sometimes we will be forced to use them if the two strings don't have same length
  2. Diagonal Mismatch --> 1 more preferable than than indel paths
  3. Diagonal Match --> 2 best path

Now, after creating the 2D DP array we can run backtracing code and count Indels and Substitions.

Python Code

    def DP(seqA: str, seqB: str) -> list:
        r, c = len(seqA), len(seqB)
        nodes =  [ [0]*(c+1) for i in range(r+1) ]
        for i in range(1, r+1):
            for j in range(1, c+1):
                nodes[i][j] = max(
                        nodes[i - 1][j],
                        nodes[i][j - 1],
                        nodes[i-1][j-1] + (seqA[i-1] == seqB[j-1]) + 1
                    )     
        return nodes
    
    def BacktrackAndResult(DP: list[list], seqA, seqB):
        i, j = len(DP)-1, len(DP[0])-1
        mismatches, indel  = 0, 0
        while True:
            if (i == 0 and j == 0):
                break
            else:
                
                if ( j > 0 and i > 0 ):
                    m = (seqA[i-1] == seqB[j-1]) 
                    if (DP[i-1][j-1] + m + 1 == DP[i][j]):
                        if (not m):
                            mismatches+=1
                        else:
                            common+=1
                        i-=1
                        j-=1
                        continue
                if ( i > 0 and DP[i][j] == DP[i-1][j]):
                    indel+=1
                    i-=1
                    continue
                if ( j > 0 and DP[i][j] == DP[i][j-1]):
                    indel+=1
                    j-=1
                    continue
        return (mismatches+indel)
    
    word1, word2 = "ATGT", "AAGT"    
    x = DP(word1, word2)
    print(BacktrackAndResult(x, word1, word2))