In building a lightweight tool that detects censored profanity usage, I noticed that detecting special characters at the end of a word boundary is quite difficult.
Using a tuple of strings, I build a OR'd word boundary regular expression:
import re
PHRASES = (
'sh\\*t', # easy
'sh\\*\\*', # difficult
'f\\*\\*k', # easy
'f\\*\\*\\*', # difficult
)
MATCHER = re.compile(
r"\b(%s)\b" % "|".join(PHRASES),
flags=re.IGNORECASE | re.UNICODE)
The problem is that the *
is not something that can be detected next to a word boundary \b
.
print(MATCHER.search('Well f*** you!')) # Fail - Does not find f***
print(MATCHER.search('Well f***!')) # Fail - Does not find f***
print(MATCHER.search('f***')) # Fail - Does not find f***
print(MATCHER.search('f*** this!')) # Fail - Does not find f***
print(MATCHER.search('secret code is 123f***')) # Pass - Should not match
print(MATCHER.search('f**k this!')) # Pass - Should find
Any ideas for setting this up in a convenient way to support phrases that end in special characters?
The
*
is not a word character thus no mach, if followed by a \b and a non word character.Assuming the initial word boundary is fine but you want to match
sh*t
but notsh*t*
or matchf***!
but notf***a
how about simulating your own word boundary by use of a negative lookahead.See this demo at regex101
If needed, the opening word boundary
\b
can be replaced by a negative lookbehind:(?<![\w*])