How does regex.WORD affect the behavior of \b?

154 Views Asked by At

I'm using the PyPI module regex for regex matching. It says

  • Default Unicode word boundary

    The WORD flag changes the definition of a ‘word boundary’ to that of a default Unicode word boundary. This applies to \b and \B.

But nothing seems to have changed:

>>> r1 = regex.compile(r".\b.", flags=regex.UNICODE)
>>> r2 = regex.compile(r".\b.", flags=regex.UNICODE | regex.WORD)
>>> r1.findall("русский  ελλανικα")
['й ', ' ε']
>>> r2.findall("русский  ελλανικα")
['й ', ' ε']

I didn't observe any difference...?

1

There are 1 best solutions below

4
On BEST ANSWER

The difference between with or without the WORD flag is the way word boundaries are defined.

Given this example:

import regex

t = 'A number: 3.4 :)'

print(regex.search(r'\b3\b', t))
print(regex.search(r'\b3\b', t, flags=regex.WORD))

The first will print a match while the second returns None, why? Because “Unicode word boundary” contains a set of rules for distinguishing word boundaries, while the default python word boundary defines it as any non-\w characters (which is still Unicode alphanumeric).

In the example, 3.4 was split by python’s default word boundary since a \W character was present, the period, therefore it’s a word boundary. For Unicode word boundary, A rule states “Forbidden Breaks on “.”” example as “3.4”, therefore the period wasn’t considered a word boundary.

See all the Unicode word boundary rules here: https://unicode.org/reports/tr29/#Sentence_Boundary_Rules

Conclusion:

They both work with Unicode or your LOCALE, but WORD flag provides additional set of rules for distinguishing word boundaries in addition to just empty string of a \W, since “a word is defined as a sequence of word character [\w]”.