It seems a little strange to me that \w
matches [a-zA-Z0-9_]
. I wonder why 0-9
and _
are counted between word characters and why -
is not counted between word characters.
If I want to split the sentence:
This is counter-example.
with (\w*\b)
it will split the word counter-example to two parts. Similarly (count.*?\b)
matches only counter
.
Would it be possible to have something like \b
with the result that -
is included in word characters (\w
)?
Or did I misunderstood the usage of \b
? Are there some examples of standard usage of this?
The fact that
\w
matches the underscore along with uppercase and lowercase letters is historical: it is due to the fact that it was first introduced to match C identifiers.Well, this is true for Java's
\w
(yes,\w
will not match accentuated characters in Java).\b
however is an anchor, and it is not defined by the frontier between what is a word character and a non word character, in fact it is implementation-dependent.There is not really an anchor which does what you want, but if you want to match words and dashes, your best bet is
\w*(-\w*)*
.Again, the
normal* (special normal*)*
pattern!(and BTW,
\b
is a "word anchor" in some dialects only, other implementations define\<
and\>
instead for the beginning and end of word anchors respectively)[edit for a gross error]