Count number of times a phrase is near another phrase, within n# of words of each other

Question

Count number of times a phrase is near another phrase, within n# of words of each other

107 Views Asked by tshobe At 06 June 2025 at 22:17

I need to count the number of times a specific phrase occurs within 3 words of another specific phrase, per row of a dataframe string. Order does not matter.

To illustrate: X = "black cat", Y = "is my", proximity distance = 3, and String = "The black cat is my black cat", .... the output count would be two (two unique pairs found). "The black cat by the window is my black cat" would also = two matches found. However, "The black cat by the big window is my black cat" = one match found.

Here is my example data, broken code, and desired output:

data = [['ABC123', 'test sentence here has these test words'], ['ABC456', 'test sentence here 
contains these test words in test sentence form'], ['ABC789', 'the third test sentence has no 
more additional test words']]
df = pd.DataFrame(data, columns=['Record ID', 'String'])
print(df)

Record ID | String
----------|-----------------------
ABC123    | test sentence here has these test words
ABC456    | test sentence here contains these test words in test sentence form
ABC789    | the third test sentence has no more additional test words


import pandas as pd  

def phrase_finder(df, text_column, search_phrase, near_phrase, distance):
results = 0
for text in df[text_column]:
    for substring in text.split(search_phrase):
        words = substring.split()
        if len(words) <= distance + 1 and near_phrase in substring:
            results += 1
return results if results else None

search_phrase = "test sentence"
near_phrase = "test words"
distance = 3

print(phrase_finder(df, 'String', search_phrase, near_phrase, distance))

ID        | Number of Matches
----------|-----------------------
ABC123    | 1
ABC456    | 2
ABC789    | 0

This is a direct follow-up to Find word near other word, within N# of words

I was instructed to create a separate question for this rather than posting it on the other one as a follow-up.

Original Q&A

There are 2 best solutions below

Abhijit Sarkar On 19 January 2023 at 23:39

import re

def count_proximity(s, s1, s2, t):
  xs = [m.start() for m in re.finditer(s1, s)]
  ys = [m.start() for m in re.finditer(s2, s)]
  
  count = i = j = 0
  while i < len(xs) and j < len(ys):
    x = xs[i]
    y = ys[j]
    if x <= y:
      count += int(len(s[x + len(s1) : y].split()) <= t)
      i += 1
    else
      count += int(len(s[y + len(s2) : x].split()) <= t)
      j += 1
  
  return count


s1 = "black cat"
s2 = "is my"
t = 3

for s in [
  "The black cat is my black cat",
  "The black cat by the window is my black cat",
  "The black cat by the big window is my black cat"
]:
  print(count_proximity(s, s1, s2, t))

**Lodinn** · Accepted Answer

I believe O-O-O was somewhat right about regex - it is a major unsustainable PITA in your use case, IMHO. That said, the problem is quite tricky...

What regex does well is string tokenization. I have applied a rather straightforward approach:

Find all matches for substring 1
Find all matches for substring 2
Count words between these matches

Not sure what are we supposed to do if substrings overlap. The code is as follows. Just string slicing and word counting, no mindboggling magic here (the less magic in the production code, the better!):

import re

def phrase_finder(text: str, str1: str, str2: str, distance: int) -> int:
    results = 0
    for match1 in re.finditer(str1, text):
        for match2 in re.finditer(str2, text):
            if match1.end() < match2.start():
                between_matches = text[match1.end():match2.start()]
                if len(re.findall(r'\w+', between_matches)) <= distance:
                    results += 1
            elif match2.end() < match1.start():
                between_matches = text[match2.end():match1.start()]
                if len(re.findall(r'\w+', between_matches)) <= distance:
                    results += 1
            else:
                # what do we do here?
                pass
    return results

Test cases:

phrase_finder('The black cat is my black cat', 'black cat', 'is my', 3)
# 2
phrase_finder('The black cat by the window is my black cat', 'black cat', 'is my', 3)
# 2
phrase_finder('The black cat by the big window is my black cat', 'black cat', 'is my', 3)
# 1

import pandas as pd
from functools import partial

data = [
    ['', 0],
    ['A', 0],
    ['B', 0],
    ['A B', 1],
    ['B A', 1],
    ['A A B', 2],
    ['A B B', 2],
    ['A B C', 1],
    ['A C C C B', 1], 
    ['A C C C C B', 0], 
    ['A B A', 2], 
    ['A B A A', 3],
    ['A B A A A', 4],
    ['A B A B A', 6]
]
df = pd.DataFrame(data, columns=['text', 'expected_output'])
df['result'] = df['text'].apply(partial(phrase_finder, str1=r'A', str2=r'B', distance=3))
df
#       text    expected_output result
# 0                 0               0
# 1     A           0               0
# 2     B           0               0
# 3     A B         1               1
# 4     B A         1               1
# 5     A A B       2               2
# 6     A B B       2               2
# 7     A B C       1               1
# 8     A C C C B   1               1
# 9     A C C C C B 0               0
# 10    A B A       2               2
# 11    A B A A     3               3
# 12    A B A A A   4               4
# 13    A B A B A   6               6

And it is symmetric as well.

There is one notable pitfall here, however:

phrase_finder(r'AA B A C AAA', r'A', r'B', 3)
# -> 6

The correct way to call it in this case is by supplying word boundaries for regexes (note the r prefix as well!):

phrase_finder(r'AA B A C AAA', r'\bA\b', r'\bB\b', 3)
# -> 1

Count number of times a phrase is near another phrase, within n# of words of each other

There are 2 best solutions below

Related Questions in PYTHON

Related Questions in MATCH-PHRASE

Trending Questions

Popular # Hahtags

Popular Questions