Count number of times a phrase is near another phrase, within n# of words of each other

103 Views Asked by At

I need to count the number of times a specific phrase occurs within 3 words of another specific phrase, per row of a dataframe string. Order does not matter.

To illustrate: X = "black cat", Y = "is my", proximity distance = 3, and String = "The black cat is my black cat", .... the output count would be two (two unique pairs found). "The black cat by the window is my black cat" would also = two matches found. However, "The black cat by the big window is my black cat" = one match found.

Here is my example data, broken code, and desired output:

data = [['ABC123', 'test sentence here has these test words'], ['ABC456', 'test sentence here 
contains these test words in test sentence form'], ['ABC789', 'the third test sentence has no 
more additional test words']]
df = pd.DataFrame(data, columns=['Record ID', 'String'])
print(df)

Record ID | String
----------|-----------------------
ABC123    | test sentence here has these test words
ABC456    | test sentence here contains these test words in test sentence form
ABC789    | the third test sentence has no more additional test words


import pandas as pd  

def phrase_finder(df, text_column, search_phrase, near_phrase, distance):
results = 0
for text in df[text_column]:
    for substring in text.split(search_phrase):
        words = substring.split()
        if len(words) <= distance + 1 and near_phrase in substring:
            results += 1
return results if results else None

search_phrase = "test sentence"
near_phrase = "test words"
distance = 3

print(phrase_finder(df, 'String', search_phrase, near_phrase, distance))

ID        | Number of Matches
----------|-----------------------
ABC123    | 1
ABC456    | 2
ABC789    | 0

This is a direct follow-up to Find word near other word, within N# of words

I was instructed to create a separate question for this rather than posting it on the other one as a follow-up.

2

There are 2 best solutions below

0
On BEST ANSWER

I believe O-O-O was somewhat right about regex - it is a major unsustainable PITA in your use case, IMHO. That said, the problem is quite tricky...

What regex does well is string tokenization. I have applied a rather straightforward approach:

  1. Find all matches for substring 1
  2. Find all matches for substring 2
  3. Count words between these matches

Not sure what are we supposed to do if substrings overlap. The code is as follows. Just string slicing and word counting, no mindboggling magic here (the less magic in the production code, the better!):

import re

def phrase_finder(text: str, str1: str, str2: str, distance: int) -> int:
    results = 0
    for match1 in re.finditer(str1, text):
        for match2 in re.finditer(str2, text):
            if match1.end() < match2.start():
                between_matches = text[match1.end():match2.start()]
                if len(re.findall(r'\w+', between_matches)) <= distance:
                    results += 1
            elif match2.end() < match1.start():
                between_matches = text[match2.end():match1.start()]
                if len(re.findall(r'\w+', between_matches)) <= distance:
                    results += 1
            else:
                # what do we do here?
                pass
    return results

Test cases:

phrase_finder('The black cat is my black cat', 'black cat', 'is my', 3)
# 2
phrase_finder('The black cat by the window is my black cat', 'black cat', 'is my', 3)
# 2
phrase_finder('The black cat by the big window is my black cat', 'black cat', 'is my', 3)
# 1

import pandas as pd
from functools import partial

data = [
    ['', 0],
    ['A', 0],
    ['B', 0],
    ['A B', 1],
    ['B A', 1],
    ['A A B', 2],
    ['A B B', 2],
    ['A B C', 1],
    ['A C C C B', 1], 
    ['A C C C C B', 0], 
    ['A B A', 2], 
    ['A B A A', 3],
    ['A B A A A', 4],
    ['A B A B A', 6]
]
df = pd.DataFrame(data, columns=['text', 'expected_output'])
df['result'] = df['text'].apply(partial(phrase_finder, str1=r'A', str2=r'B', distance=3))
df
#       text    expected_output result
# 0                 0               0
# 1     A           0               0
# 2     B           0               0
# 3     A B         1               1
# 4     B A         1               1
# 5     A A B       2               2
# 6     A B B       2               2
# 7     A B C       1               1
# 8     A C C C B   1               1
# 9     A C C C C B 0               0
# 10    A B A       2               2
# 11    A B A A     3               3
# 12    A B A A A   4               4
# 13    A B A B A   6               6

And it is symmetric as well.

There is one notable pitfall here, however:

phrase_finder(r'AA B A C AAA', r'A', r'B', 3)
# -> 6

The correct way to call it in this case is by supplying word boundaries for regexes (note the r prefix as well!):

phrase_finder(r'AA B A C AAA', r'\bA\b', r'\bB\b', 3)
# -> 1
0
On
import re

def count_proximity(s, s1, s2, t):
  xs = [m.start() for m in re.finditer(s1, s)]
  ys = [m.start() for m in re.finditer(s2, s)]
  
  count = i = j = 0
  while i < len(xs) and j < len(ys):
    x = xs[i]
    y = ys[j]
    if x <= y:
      count += int(len(s[x + len(s1) : y].split()) <= t)
      i += 1
    else
      count += int(len(s[y + len(s2) : x].split()) <= t)
      j += 1
  
  return count


s1 = "black cat"
s2 = "is my"
t = 3

for s in [
  "The black cat is my black cat",
  "The black cat by the window is my black cat",
  "The black cat by the big window is my black cat"
]:
  print(count_proximity(s, s1, s2, t))