I'm using Google Translate to convert some error codes into Farsi with Perl. Farsi is one such example, I've also found this issue in other languages---but for this discussion I'll stick to the single example:
The translated text of "Geometry data card error" works fine (Example 1) but translating "Appending a default 111 card" (Example 2) gives the "Wide character" error.
Both examples can be run from the terminal, they are just prints.
I've tried the usual things like these, but to no avail:
use utf8;
use open ':std', ':encoding(UTF-8)';
binmode STDOUT, ':encoding(UTF-8)';
Example 1: This works
perl -Mutf8 -le 'print "\x{d8}\x{ae}\x{d8}\x{b7}\x{d8}\x{a7}\x{db}\x{8c} \x{da}\x{a9}\x{d8}\x{a7}\x{d8}\x{b1}\x{d8}\x{aa} \x{d8}\x{af}\x{d8}\x{a7}\x{d8}\x{af}\x{d9}\x{87} \x{d9}\x{87}\x{d9}\x{86}\x{d8}\x{af}\x{d8}\x{b3}\x{db}\x{8c}"'
خطای کارت داده هندسی
Example 2: This produces Wide char warnings and prints noise
perl -Mutf8 -le 'print "\x{d8}\x{a7}\x{d9}\x{81}\x{d8}\x{b2}\x{d9}\x{88}\x{d8}\x{af}\x{d9}\x{86} \x{db}\x{8c}\x{da}\x{a9} \x{da}\x{a9}\x{d8}\x{a7}\x{d8}\x{b1}\x{d8}\x{aa} \x{d9}\x{be}\x{db}\x{8c}\x{d8}\x{b4}\x{200c}\x{d9}\x{81}\x{d8}\x{b1}\x{d8}\x{b6} 111"'
Wide character in print at -e line 1.
# <terminal noise, not Farsi text>
Using Curl
If I do the same request with curl
I get this:
curl 'https://translate.googleapis.com/translate_a/single?client=gtx&sl=auto&tl=fa&hl=fa&dt=t&ie=UTF-8&oe=UTF-8&otf=1&ssel=0&tsel=0&tk=xxxx&dt=dj&q=%41%70%70%65%6E%64%69%6E%67%20%61%20%64%65%66%61%75%6C%74%20%31%31%31%20%63%61%72%64'
[[["افزودن یک کارت پیش\u200cفرض 111","Appending a default 111 card",null,null,3,null,null,[[]],[[["982c75c78c6c8e6005ec3a4021a7f785","tea_GrecoIndoEuropeA_en2elfahykakumksq_2021q3.md"]]]]],null,"en",null,null,null,1,[],[["en"],null,[1],["en"]]]
Notice the \u200c
in the JSON output above which is a "Zero Width Non-Joiner" unicode char. When JSON::from_json
parses the \u200c
it blows up:
perl -Mutf8 -MJSON -e 'print from_json("[\"\\u200c\"]")->[0];'
Wide character in print at -e line 1.
I can "fix" it like this:
my $c = $res->content;
$c =~ s/\\u[0-9a-f]{4}//;
my $json = from_json($c);
and then the output text is correct (right-to-left):
افزودن یک کارت پیشفرض 111
Question: What is going on here?
- Is this a bug in Perl or a JSON?
- Should
\u200c
be parsed properly in some other way?
There's a lot of stuff going on here. I think a lot of it, especially in the first two examples, stems from not understanding the difference between perl's two string modes (byte oriented and Unicode codepoint oriented).
Example 1 is a raw byte string holding bytes that happen to be UTF-8 encoded, and are passed through unchanged; as long as the terminal that's displaying the output is expecting UTF-8, they'll be rendered correctly. Example 2 has a 'wide' character (With a value greater than 255), making it a Unicode string, where each character represented by a
\x{NN}
number greater than 127 is a Unicode codepoint that is encoded as multiple bytes in UTF-8. Printing this causes mojibake and a warning because standard output is byte oriented without a translation layer.As I suggested in a comment, reading
perluniintro
(And the other unicode-related documentation) is a good start for learning how things work.But on to the actual task, extracting text from the JSON returned by your
curl
commands... I'd usejq
instead if this is for a shell script:Compare to the equivalent perl one-liner:
The
-CS
argument tells perl that standard input, output, and error are all UTF-8 encoded. You could also use-CO
to make that just standard output, and usedecode_json()
instead, which expects raw UTF-8 encoded bytes instead of a Unicode string.And in a script instead of a one-liner, using the OO interface to
JSON
and tuning how input strings should be encoded using its methods, plus theopen
pragma (Orbinmode
or an encoding layer foropen
) instead of the-C
option, is the way to go.