The goal is to extract the matching "words" (bounded by \b|$|\s), given the difflib SequenceMatcher.get_matching_blocks() output, e.g. given:
s1 = "HYC00 Schulrucksack Damen, Causal Travel Schultaschen 14 Zoll Laptop Rucksack für Mädchen im Teenageralter Leichter Rucksack Wasserabweisend Bookbag College Boys Men Work Daypack"
s2 = "HYC00 School Backpack Women, Causal Travel School Bags 14 Inch Laptop Backpack for Teenage Girls Lightweight Backpack Water-Repellent Bookbag College Boys Men Work Daypack"
The expected matching blocks to extract are:
['HYC00', 'Causal Travel', '14', 'Laptop', 'Bookbag College Boys Men Work Daypack']
The simple cases are when the matching blocks from the difflib are immediately bounded by \b|$|\s, e.g.
import re
from difflib import SequenceMatcher
s1 = "HYC00 Schulrucksack Damen, Causal Travel Schultaschen 14 Zoll Laptop Rucksack für Mädchen im Teenageralter Leichter Rucksack Wasserabweisend Bookbag College Boys Men Work Daypack"
s2 = "HYC00 School Backpack Women, Causal Travel School Bags 14 Inch Laptop Backpack for Teenage Girls Lightweight Backpack Water-Repellent Bookbag College Boys Men Work Daypack"
def is_substring_a_phrase(substring, s1):
if substring:
# Check if matching substring is bounded by word boundary.
match = re.findall(rf"\b{substring}(?=\s|$)", s1)
if match:
return match[0]
def matcher(s1, s2):
x = SequenceMatcher(None, s1, s2)
for m in x.get_matching_blocks():
# Extract the substring.
full_substring = s1[m.a:m.a+m.size].strip()
match = is_substring_a_phrase(full_substring, s1)
if match:
yield match
continue
matcher(s1, s2)
[out]:
['14', 'Laptop', 'Bookbag College Boys Men Work Daypack']
Then to capture the HYC00 and Causal Travel, the matching blocks are respectively HYC00 Sch and men, Causual Travel, so we'll have to do some "chomping" and remove the left, right or left and right most partial "words", i.e.
def matcher(s1, s2):
x = SequenceMatcher(None, s1, s2)
for m in x.get_matching_blocks():
# Extract the substring.
full_substring = s1[m.a:m.a+m.size].strip()
match = is_substring_a_phrase(full_substring, s1)
if match:
yield match
continue
# Extract the left chomp substring.
left = " ".join(s1[m.a:m.a+m.size].strip().split()[1:])
match = is_substring_a_phrase(left, s1)
if match:
yield match
continue
# Extract the right chomp substring.
right = " ".join(s1[m.a:m.a+m.size].strip().split()[:-1])
match = is_substring_a_phrase(right, s1)
if match:
yield match
continue
# Extract the right chomp substring.
leftright = " ".join(s1[m.a:m.a+m.size].strip().split()[1:-1])
match = is_substring_a_phrase(leftright, s1)
if match:
yield match
continue
matcher(s1, s2)
[out]:
['HYC00',
'Causal Travel',
'14',
'Laptop',
'Bookbag College Boys Men Work Daypack']
While the code snippet above works as expected, my questions in parts:
- is there some way to avoid the repeated code for the various chomp and multiple if-else to extract the matching blocks bounded by
\b|$|\s? - is there a direct way to specify in
.get_matching_blocks()to get only the parts bounded by\b|$|\s? - is there other ways of achieving the same objective without using the get_matching_blocks in this messy manner?
From @megaing's comment
[out]: