Word boundaries to match strings containing dots (.) at begin/end

445 Views Asked by At

I have a regex to match a word in a long text, like this:

word = "word"
text = "word subword word"

def char_regex_ascii(word):
    return r"\b{}\b".format(re.escape(word))

r = re.compile(my_regex(word), flags= re.X | re.UNICODE)
for m in r.finditer(text):
    print(m)

output:

word
word

The reason of \b is that I don't want to find substrings, but full words: for example I'm not interested in match the word word in the text subword, but I want only full words as results, so followed or anticipated by spaces, commas, dots or any kind of punctuation.

It works with the majority of the cases but if I insert a dot a the end of the word like w.o.r.d. it doesn't match because the last \b of the regex is after a dot.

word = "w.o.r.d."
text = "w.o.r.d. subword word"

def char_regex_ascii(word):
    return r"\b{}\b".format(re.escape(word))

r = re.compile(my_regex(word), flags= re.X | re.UNICODE)
for m in r.finditer(text):
    print(m)

output:

(nothing)

I see that using \B make it work, but I should do several checks at the begin and end of the sentences trying all the combinations of \b and \B for many words to find.

word = "w.o.r.d."
text = "w.o.r.d. subword word"

def char_regex_ascii(word):
    return r"\b{}\B".format(re.escape(word))

r = re.compile(my_regex(word), flags= re.X | re.UNICODE)
for m in r.finditer(text):
    print(m)

output:

w.o.r.d.

Does a general approach exist?

1

There are 1 best solutions below

0
On

You could use the regex pattern \w+(?:\.?\w+)*, along with re.findall:

text = "w.o.r.d. subword word"
matches = re.findall(r'\w+(?:\.?\w+)*', text)
print(matches)  # ['w.o.r.d', 'subword', 'word']

The pattern used here defines a "word" as being:

\w+         one or more word characters
(?:
    \.?\w+  followed by optional dot and one or more
            word characters
)*          zero or more times

Under this definition, acronym style terms such as w.o.r.d. would be captured as matches.