I'm using the PyPI module regex for regex matching. It says
Default Unicode word boundary
The
WORDflag changes the definition of a ‘word boundary’ to that of a default Unicode word boundary. This applies to\band\B.
But nothing seems to have changed:
>>> r1 = regex.compile(r".\b.", flags=regex.UNICODE)
>>> r2 = regex.compile(r".\b.", flags=regex.UNICODE | regex.WORD)
>>> r1.findall("русский ελλανικα")
['й ', ' ε']
>>> r2.findall("русский ελλανικα")
['й ', ' ε']
I didn't observe any difference...?
The difference between with or without the
WORDflag is the way word boundaries are defined.Given this example:
The first will print a match while the second returns
None, why? Because “Unicode word boundary” contains a set of rules for distinguishing word boundaries, while the default python word boundary defines it as any non-\wcharacters (which is still Unicode alphanumeric).In the example,
3.4was split by python’s default word boundary since a\Wcharacter was present, the period, therefore it’s a word boundary. For Unicode word boundary, A rule states “Forbidden Breaks on “.”” example as “3.4”, therefore the period wasn’t considered a word boundary.See all the Unicode word boundary rules here: https://unicode.org/reports/tr29/#Sentence_Boundary_Rules
Conclusion:
They both work with Unicode or your
LOCALE, butWORDflag provides additional set of rules for distinguishing word boundaries in addition to just empty string of a\W, since “a word is defined as a sequence of word character [\w]”.