Match star * character at end of word boundary \b

637 Views Asked by At

In building a lightweight tool that detects censored profanity usage, I noticed that detecting special characters at the end of a word boundary is quite difficult.

Using a tuple of strings, I build a OR'd word boundary regular expression:

import re

PHRASES = (
    'sh\\*t',  # easy
    'sh\\*\\*',  # difficult
    'f\\*\\*k',  # easy
    'f\\*\\*\\*',  # difficult
)

MATCHER = re.compile(
    r"\b(%s)\b" % "|".join(PHRASES), 
    flags=re.IGNORECASE | re.UNICODE)

The problem is that the * is not something that can be detected next to a word boundary \b.

print(MATCHER.search('Well f*** you!'))  # Fail - Does not find f***
print(MATCHER.search('Well f***!'))  # Fail - Does not find f***
print(MATCHER.search('f***'))  # Fail - Does not find f***
print(MATCHER.search('f*** this!'))  # Fail - Does not find f***
print(MATCHER.search('secret code is 123f***'))  # Pass - Should not match
print(MATCHER.search('f**k this!'))  # Pass - Should find 

Any ideas for setting this up in a convenient way to support phrases that end in special characters?

4

There are 4 best solutions below

1
On BEST ANSWER

The * is not a word character thus no mach, if followed by a \b and a non word character.

Assuming the initial word boundary is fine but you want to match sh*t but not sh*t* or match f***! but not f***a how about simulating your own word boundary by use of a negative lookahead.

\b(...)(?![\w*])

See this demo at regex101

If needed, the opening word boundary \b can be replaced by a negative lookbehind: (?<![\w*])

2
On

Could embed the boundary requirements in each string like

'\\bsh\\*t\\b', 
'\\bsh\\*\\*',  
'\\bf\\*\\*k\\b',  
'\\bf\\*\\*\\*', 

then r"(%s)" % "|".join(PHRASES)

Or, if the regex engine supports conditionals, its done like this

'sh\\*t', 
'sh\\*\\*',  
'f\\*\\*k',  
'f\\*\\*\\*', 

then r"(?(?=\w)\b)(%s)(?(?<=\w)\b)" % "|".join(PHRASES)

0
On

Use your knowledge of the starts and endings of the phrases and use them with corresponding matchers.
Here is a static version, but it is easy to sort incoming new phrases automatically according to the start and ending.

import re

PHRASES1 = (
    'sh\\*t',  # easy
    'f\\*\\*k',  # easy
)
PHRASES2 = (
    'sh\\*\\*',  # difficult
    'f\\*\\*\\*',  # difficult
)
PHRASES3 = (
    '\\*\\*\\*hole', 
)
PHRASES4 = (
    '\\*\\*\\*sonofa\\*\\*\\*\\*\\*',  # easy
)
MATCHER1 = re.compile(
    r"\b(%s)\b" % "|".join(PHRASES1), 
    flags=re.IGNORECASE | re.UNICODE)
MATCHER2 = re.compile(
    r"\b(%s)[$\s]" % "|".join(PHRASES2), 
    flags=re.IGNORECASE | re.UNICODE)
MATCHER3 = re.compile(
    r"[\s^](%s)\b" % "|".join(PHRASES3), 
    flags=re.IGNORECASE | re.UNICODE)
MATCHER4 = re.compile(
    r"[\s^](%s)[$\s]" % "|".join(PHRASES4), 
    flags=re.IGNORECASE | re.UNICODE)
0
On

I don't fully understand your statement that * is not something that can be found next to a word boundary. However, if I understand what you are looking for correctly from the comments, I think this would work:

\b[\w]\*+[\w]*
  • Word boundary
  • Followed by some letter, like f
  • Followed by one or many *
  • Optionally ending in some letter, like k

Example:

https://regexr.com/4nqie