Does Perl's \w
match all alphanumeric characters defined in the Unicode standard?
For example, will \w
match all (say) Chinese and Russian alphanumeric characters?
I wrote a simple test script (see below) which suggests that \w
does indeed match "as expected" for the non-ASCII alphanumeric characters I tested. But the testing is obviously far from exhaustive.
#!/usr/bin/perl
use utf8;
binmode(STDOUT, ':utf8');
my @ok;
$ok[0] = "abcdefghijklmnopqrstuvwxyz";
$ok[1] = "éèëáàåäöčśžłíżńęøáýąóæšćôı";
$ok[2] = "şźüęłâi̇ółńśłŕíáυσνχατςęςη";
$ok[3] = "τσιαιγολοχβςανنيرحبالтераб";
$ok[4] = "иневоаслкłјиневоцедањеволс";
$ok[5] = "рглсывызтоμςόκιναςόγο";
foreach my $ok (@ok) {
die unless ($ok =~ /^\w+$/);
}
perldoc perlunicode says
So it looks like the answer to your question is "yes".
However, you might want to use the
\p{}
construct to directly access specific Unicode character properties. You can probably use\p{L}
(or, shorter,\pL
) for letters and\pN
for numbers and feel a little more confident that you'll get exactly what you want.