Python's Sequence Matcher gives incomplete match

939 Views Asked by At

I am trying to find matching blocks between two strings using Python's SequenceMatcher. The strings are "ABCDPQRUVWXYZ" and "PQRABCDUVWXYZ". However when I apply get_matching_blocks(), the string "PQR" is not found as a matching block.

from difflib import SequenceMatcher

str1 = "ABCDPQRUVWXYZ"
str2 = "PQRABCDUVWXYZ"

matchAll = SequenceMatcher(None, str1, str2, False).get_matching_blocks()

for i in range(0, len(matchAll)):
    print(str1[matchAll[i].a: matchAll[i].a + matchAll[i].size])
3

There are 3 best solutions below

0
On BEST ANSWER

This might do what you want - won't find overlapping matches though (revised to include string locations in s1 and s2 of the substrings):

str1 = "ABCDEPQRUVWXYZ" # added extra non-matching character
str2 = "PQRABCDUVWXYZ"

def find_subs(s1, s2):
    subs = []
    loc = 0
    while s1:
        s1_copy = s1
        while s1_copy:
            while s1_copy and s1_copy not in s2:
                s1_copy = s1_copy[:-1]
            if s1_copy:
                subs.append((loc, s2.index(s1_copy), s1_copy))
                loc += len(s1_copy)
                s1 = s1[len(s1_copy):]
            else:
                s1 = s1[1:]
                loc += 1
            s1_copy = s1                
    return subs

print(find_subs(str1, str2))

prints:

[(0, 3, 'ABCD'), (5, 0, 'PQR'), (8, 7, 'UVWXYZ')]
2
On

The docs state that:

get_matching_blocks()

Return list of triples describing matching subsequences. Each triple is of the form (i, j, n), and means that a[i:i+n] == b[j:j+n]. The triples are monotonically increasing in i and j.

If the function returned "PQR" in your example, the j wouldn't be monotonically increasing, as it would go from the "A" index for the "ABCD" match, back to the "P" index for the "PQR" match.

0
On

Thank to all coders who answered my post.

As a solution, I experimented and found another solution using

SequenceMatcher's find_longest_match() 

method. This consists of basically finding the longest match between two strings repeatedly and then by replacing the matched longest string each time with garbage characters. This works well too.