I'm trying to get our chat system to support UTF-8, but I'm failing. If, on the client side, I send the following message, passed through encodeURIComponent
:
- îûôó
And put this on the PHP end:
error_log(print_r(array(
$_POST['message'],
urldecode($_POST['message']),
rawurldecode($_POST['message']),
utf8_decode($_POST['message']),
utf8_decode(urldecode($_POST['message'])),
utf8_decode(rawurldecode($_POST['message']))
), true));
This is the output in my error log:
Array
(
[0] => %C3%AE%C3%BB%C3%B4%C3%B3
[1] => îûôó
[2] => îûôó
[3] => %C3%AE%C3%BB%C3%B4%C3%B3
[4] => îûôó
[5] => îûôó
)
So all is fine. However, if I use these, both copied from Wikipedia (Russian language and Japanese language pages, respectively):
- русский язык
- 日本語
It all goes to hell!
Array
(
[0] => %D1%80%D1%83%D1%81%D1%81%D0%BA%D0%B8%D0%B9%20%D1%8F%D0%B7%D1%8B%D0%BA
[1] => руÑÑкий Ñзык
[2] => руÑÑкий Ñзык
[3] => %D1%80%D1%83%D1%81%D1%81%D0%BA%D0%B8%D0%B9%20%D1%8F%D0%B7%D1%8B%D0%BA
[4] => ??????? ????
[5] => ??????? ????
)
Array
(
[0] => %E6%97%A5%E6%9C%AC%E8%AA%9E
[1] => 日本語
[2] => 日本語
[3] => %E6%97%A5%E6%9C%AC%E8%AA%9E
[4] => ???
[5] => ???
)
What do I need to do to make this work?
You have over-URL-encoded your input. The GET/POST/REQUEST superglobals have already taken care of URL-decoding input strings where necessary, you should not need to URL-decode them manually.
Have a look at whatever is causing this request (an XMLHttpRequest?) and remove the excess call to
encodeURIComponent()
. For example if you are using jQueryajax()
and passing in POST-data as an object, jQuery will be callingencodeURIComponent()
for you and you don't need to do it yourself as well.This is UTF-8 misinterpreted as Windows code page 1252 (Western European, similar to ISO-8859-1).
Most likely you have successfully saved UTF-8 bytes to your log file, but whatever you're reading the log file in doesn't realise that it should be rendered as UTF-8.
This only works because the characters you have used to test it also exist in code page 1252.
utf8_decode
is misleadingly named; what it actually does is convert a UTF-8 byte sequence to an ISO-8859-1 byte sequence that would represent the same string. You usually want to work in UTF-8 and not ISO-8859-1 so you should in general avoid utf8_decode.Understandable: Cyrillic characters don't exist in code page 1252.
Assuming you are sending your error_log output to a file, and trying to read the file, stick with the plain UTF-8 bytes, and read your logs in a decent text editor that lets you see and choose the encoding; ideally a modern one that defaults to UTF-8. Alternatively you can persuade Notepad to read a Unicode file by saving as UTF-16 or UTF-8 and including a Byte Order Mark at the start. (It's kind of wrong to include a BOM in a UTF-8 file, but lots of tools in the Microsoft world do it.)