Chomping left, right string by whitespace to iterate regex matches

52 Views Asked by At

The goal is to extract the matching "words" (bounded by \b|$|\s), given the difflib SequenceMatcher.get_matching_blocks() output, e.g. given:

s1 = "HYC00 Schulrucksack Damen, Causal Travel Schultaschen 14 Zoll Laptop Rucksack für Mädchen im Teenageralter Leichter Rucksack Wasserabweisend Bookbag College Boys Men Work Daypack"

s2 = "HYC00 School Backpack Women, Causal Travel School Bags 14 Inch Laptop Backpack for Teenage Girls Lightweight Backpack Water-Repellent Bookbag College Boys Men Work Daypack"

The expected matching blocks to extract are:

['HYC00', 'Causal Travel', '14', 'Laptop', 'Bookbag College Boys Men Work Daypack']

The simple cases are when the matching blocks from the difflib are immediately bounded by \b|$|\s, e.g.

import re
from difflib import SequenceMatcher

s1 = "HYC00 Schulrucksack Damen, Causal Travel Schultaschen 14 Zoll Laptop Rucksack für Mädchen im Teenageralter Leichter Rucksack Wasserabweisend Bookbag College Boys Men Work Daypack"

s2 = "HYC00 School Backpack Women, Causal Travel School Bags 14 Inch Laptop Backpack for Teenage Girls Lightweight Backpack Water-Repellent Bookbag College Boys Men Work Daypack"

def is_substring_a_phrase(substring, s1):
  if substring:
    # Check if matching substring is bounded by word boundary.
    match = re.findall(rf"\b{substring}(?=\s|$)", s1)
    if match: 
      return match[0]

def matcher(s1, s2):
  x = SequenceMatcher(None, s1, s2)
  for m in x.get_matching_blocks():
    # Extract the substring.
    full_substring = s1[m.a:m.a+m.size].strip()
    match = is_substring_a_phrase(full_substring, s1)
    if match:
      yield match
      continue

matcher(s1, s2)

[out]:

['14', 'Laptop', 'Bookbag College Boys Men Work Daypack']

Then to capture the HYC00 and Causal Travel, the matching blocks are respectively HYC00 Sch and men, Causual Travel, so we'll have to do some "chomping" and remove the left, right or left and right most partial "words", i.e.

def matcher(s1, s2):
  x = SequenceMatcher(None, s1, s2)
  for m in x.get_matching_blocks():
    # Extract the substring.
    full_substring = s1[m.a:m.a+m.size].strip()
    match = is_substring_a_phrase(full_substring, s1)
    if match:
      yield match
      continue

    # Extract the left chomp substring.
    left = " ".join(s1[m.a:m.a+m.size].strip().split()[1:])
    match = is_substring_a_phrase(left, s1)
    if match:
      yield match
      continue


    # Extract the right chomp substring.
    right = " ".join(s1[m.a:m.a+m.size].strip().split()[:-1])
    match = is_substring_a_phrase(right, s1)
    if match:
      yield match
      continue


    # Extract the right chomp substring.
    leftright = " ".join(s1[m.a:m.a+m.size].strip().split()[1:-1])
    match = is_substring_a_phrase(leftright, s1)
    if match:
      yield match
      continue

matcher(s1, s2)

[out]:

['HYC00',
 'Causal Travel',
 '14',
 'Laptop',
 'Bookbag College Boys Men Work Daypack']

While the code snippet above works as expected, my questions in parts:

  • is there some way to avoid the repeated code for the various chomp and multiple if-else to extract the matching blocks bounded by \b|$|\s?
  • is there a direct way to specify in .get_matching_blocks() to get only the parts bounded by \b|$|\s?
  • is there other ways of achieving the same objective without using the get_matching_blocks in this messy manner?
1

There are 1 best solutions below

0
alvas On

From @megaing's comment

from difflib import SequenceMatcher

s1 = "HYC00 Schulrucksack Damen, Causal Travel Schultaschen 14 Zoll Laptop Rucksack für Mädchen im Teenageralter Leichter Rucksack Wasserabweisend Bookbag College Boys Men Work Daypack"

s2 = "HYC00 School Backpack Women, Causal Travel School Bags 14 Inch Laptop Backpack for Teenage Girls Lightweight Backpack Water-Repellent Bookbag College Boys Men Work Daypack"


x = SequenceMatcher(None, s1.split(), s2.split())

for m in x.get_matching_blocks():
    # Extract the substring.
    full_substring = " ".join(s1.split()[m.a:m.a+m.size])
    print(full_substring)

[out]:

HYC00
Causal Travel
14
Laptop
Bookbag College Boys Men Work Daypack