Edit distance is a famous class of problems including:
Different types of edit distance allow different sets of string operations. For instance:
- The Levenshtein distance allows deletion, insertion and substitution.
- The longest common subsequence (LCS) distance allows only insertion and deletion, not substitution.
- The Hamming distance allows only substitution. Hence, it only applies to strings of the same length.
- The Damerau–Levenshtein distance allows insertion, deletion, substitution, and the transposition (swapping) of two adjacent characters.
- The Jaro distance allows only transposition.
It mentions:
The longest common subsequence (LCS) distance allows only insertion and deletion, not substitution.
My doubts:
I am confused as to why it does not allow substitution. As far as I understand the algorithm, if there is a character that can be potentially matched to a character in the target string, it won't be substituted or deleted.
Why does it allow deletion and insertion? Can't it happen that a matching character is deleted because it occurs later in the target string?
I am having a hard time understanding this. For analysis purposes, I have written an algorithm which displays the solution in a grid, like so (It displays trail in color but can't show on SO):
$ java dynamic_programming.EditDistance republicans democrats true
| | |d |e |m |o |c |r |a |t |s |
| |0 |1 |2 |3 |4 |5 |6 |7 |8 |9 |
|r |1 |2 |3 |4 |5 |6 |5 |6 |7 |8 |
|e |2 |3 |2 |3 |4 |5 |6 |7 |8 |9 |
|p |3 |4 |3 |4 |5 |6 |7 |8 |9 |10|
|u |4 |5 |4 |5 |6 |7 |8 |9 |10|11|
|b |5 |6 |5 |6 |7 |8 |9 |10|11|12|
|l |6 |7 |6 |7 |8 |9 |10|11|12|13|
|i |7 |8 |7 |8 |9 |10|11|12|13|14|
|c |8 |9 |8 |9 |10|9 |10|11|12|13|
|a |9 |10|9 |10|11|10|11|10|11|12|
|n |10|11|10|11|12|11|12|11|12|13|
|s |11|12|11|12|13|12|13|12|13|12|
=========================
12 // (New Edit Distance, Without Substitution)
It's an interesting question. Let me try to clarify what do you mean by insertion, deletion and substitution. Consider this code for Levenshtein distance that considers the whole set of operations
For the Levenshtein distance, substitution is allowed, therefore if
s[i] != t[j]then we should, in fact, use the substitution here since it incurs the cost of 1 unit (in other words,d[i][j] - d[i-1][j-1] <= 1). However, in the LCS distance, you may observe that substitution is not allowed, andd[i][j] - d[i-1][j-1] <= 2, or to say that the cost of up to 2 can be incurred going from d[i-1][j-1] to d[i][j]. This is intuitively true for LCS if you imagine a tracable version of LCS.Let
s = "abcd", t = "abce", considerd[4][4].It can be noted that the true LCS distance = |s| + |t| - 2LCS(s, t) = 4 + 4 - 2*3 = 2, which shows that the result is consistent if and only if substitution is not allowed.