I'm using the PyPI module regex
for regex matching. It says
Default Unicode word boundary
The
WORD
flag changes the definition of a ‘word boundary’ to that of a default Unicode word boundary. This applies to\b
and\B
.
But nothing seems to have changed:
>>> r1 = regex.compile(r".\b.", flags=regex.UNICODE)
>>> r2 = regex.compile(r".\b.", flags=regex.UNICODE | regex.WORD)
>>> r1.findall("русский ελλανικα")
['й ', ' ε']
>>> r2.findall("русский ελλανικα")
['й ', ' ε']
I didn't observe any difference...?
The difference between with or without the
WORD
flag is the way word boundaries are defined.Given this example:
The first will print a match while the second returns
None
, why? Because “Unicode word boundary” contains a set of rules for distinguishing word boundaries, while the default python word boundary defines it as any non-\w
characters (which is still Unicode alphanumeric).In the example,
3.4
was split by python’s default word boundary since a\W
character was present, the period, therefore it’s a word boundary. For Unicode word boundary, A rule states “Forbidden Breaks on “.”” example as “3.4”, therefore the period wasn’t considered a word boundary.See all the Unicode word boundary rules here: https://unicode.org/reports/tr29/#Sentence_Boundary_Rules
Conclusion:
They both work with Unicode or your
LOCALE
, butWORD
flag provides additional set of rules for distinguishing word boundaries in addition to just empty string of a\W
, since “a word is defined as a sequence of word character [\w
]”.