I found this issue when using Perl's one-liners for substituting some utf8 text in files. I am aware of hacks at How to handle utf8 on the command line (using Perl or Python)?. They don't work for this case. OS is linux, locate is set to utf8
# make file to contain pattern
$echo Текст на юникоде>file
$cat file
Текст на юникоде
# also grep finds it
$grep "Текст на юникоде" file
Текст на юникоде
# different perl hacks mentioned at reference question don't work:
$perl -C63 -n -e "print if m{Текст на юникоде}" file
# does not show anything
$perl -Mutf8 -n -e "print if m{Текст на юникоде}" file
# does not show anything
# although it handles parameters correctly
$perl -e 'print "$ARGV[0]\n"' "Текст на юникоде"
Текст на юникоде
# and inside -e options as well
$perl -e 'print "Текст на юникоде\n"'
Текст на юникоде
# when create perl script to find the pattern, it works:
echo "while (<>) {print if m{Текст на юникоде}}">find.pl
$cat find.pl
while (<>) {print if m{Текст на юникоде}}
$perl find.pl file
Текст на юникоде
# and even this strange way it works:
perl -ne '$m="Текст на юникоде";print if m{$m}' file
Текст на юникоде
So here is my question: is there any more simple solution to use utf8 patterns form m and s operators withing perl one-liners and why simple approach does not work?
Thank you!
Just in case:
$uname -a
Linux ubuntu16-pereval 4.4.0-190-generic #220-Ubuntu SMP Fri Aug 28 23:02:15 UTC 2020 x86_64 x86_64 x86_64 GNU/Linux
$locale
LANG=en_US.UTF-8
LANGUAGE=en_US.UTF-8
LC_CTYPE="en_US.UTF-8"
LC_NUMERIC="en_US.UTF-8"
LC_TIME="en_US.UTF-8"
LC_COLLATE="en_US.UTF-8"
LC_MONETARY="en_US.UTF-8"
LC_MESSAGES="en_US.UTF-8"
LC_PAPER="en_US.UTF-8"
LC_NAME="en_US.UTF-8"
LC_ADDRESS="en_US.UTF-8"
LC_TELEPHONE="en_US.UTF-8"
LC_MEASUREMENT="en_US.UTF-8"
LC_IDENTIFICATION="en_US.UTF-8"
LC_ALL=en_US.UTF-8
-C63
applies various flags to tell Perl that input and output files are in UTF8.-Mutf8
tells the Perl compiler that your source code is in UTF8.-C63
effects how Perl sees the data infile
.-Mutf8
effects how Perl sees the code in your-e
option. In order for Perl to understand that the input file and the source code should both be interpreted as UTF8, you need both options.Update: Oh, and I should probably add that the simplest option works as well (but for all the wrong reasons!)
In this case, it works because Perl interprets both the input and the source code as being made up of single-byte Latin-1 characters. Please don't do this :-)