I have a regex to match a word in a long text, like this:
word = "word"
text = "word subword word"
def char_regex_ascii(word):
return r"\b{}\b".format(re.escape(word))
r = re.compile(my_regex(word), flags= re.X | re.UNICODE)
for m in r.finditer(text):
print(m)
output:
word
word
The reason of \b
is that I don't want to find substrings, but full words: for example I'm not interested in match the word word
in the text subword
, but I want only full words as results, so followed or anticipated by spaces, commas, dots or any kind of punctuation.
It works with the majority of the cases but if I insert a dot a the end of the word like w.o.r.d.
it doesn't match because the last \b
of the regex is after a dot.
word = "w.o.r.d."
text = "w.o.r.d. subword word"
def char_regex_ascii(word):
return r"\b{}\b".format(re.escape(word))
r = re.compile(my_regex(word), flags= re.X | re.UNICODE)
for m in r.finditer(text):
print(m)
output:
(nothing)
I see that using \B
make it work, but I should do several checks at the begin and end of the sentences trying all the combinations of \b
and \B
for many words to find.
word = "w.o.r.d."
text = "w.o.r.d. subword word"
def char_regex_ascii(word):
return r"\b{}\B".format(re.escape(word))
r = re.compile(my_regex(word), flags= re.X | re.UNICODE)
for m in r.finditer(text):
print(m)
output:
w.o.r.d.
Does a general approach exist?
You could use the regex pattern
\w+(?:\.?\w+)*
, along withre.findall
:The pattern used here defines a "word" as being:
Under this definition, acronym style terms such as
w.o.r.d.
would be captured as matches.