How to properly handle UTF-8 in PHP?

2.2k Views Asked by At

I'm trying to get our chat system to support UTF-8, but I'm failing. If, on the client side, I send the following message, passed through encodeURIComponent:

  • îûôó

And put this on the PHP end:

error_log(print_r(array(
    $_POST['message'],
    urldecode($_POST['message']),
    rawurldecode($_POST['message']),
    utf8_decode($_POST['message']),
    utf8_decode(urldecode($_POST['message'])),
    utf8_decode(rawurldecode($_POST['message']))
), true));

This is the output in my error log:

Array
(
    [0] => %C3%AE%C3%BB%C3%B4%C3%B3
    [1] => îûôó
    [2] => îûôó
    [3] => %C3%AE%C3%BB%C3%B4%C3%B3
    [4] => îûôó
    [5] => îûôó
)

So all is fine. However, if I use these, both copied from Wikipedia (Russian language and Japanese language pages, respectively):

  • русский язык
  • 日本語

It all goes to hell!

Array
(
    [0] => %D1%80%D1%83%D1%81%D1%81%D0%BA%D0%B8%D0%B9%20%D1%8F%D0%B7%D1%8B%D0%BA
    [1] => руÑÑкий Ñзык
    [2] => руÑÑкий Ñзык
    [3] => %D1%80%D1%83%D1%81%D1%81%D0%BA%D0%B8%D0%B9%20%D1%8F%D0%B7%D1%8B%D0%BA
    [4] => ??????? ????
    [5] => ??????? ????
)
Array
(
    [0] => %E6%97%A5%E6%9C%AC%E8%AA%9E
    [1] => 日本語
    [2] => 日本語
    [3] => %E6%97%A5%E6%9C%AC%E8%AA%9E
    [4] => ???
    [5] => ???
)

What do I need to do to make this work?

2

There are 2 best solutions below

1
On BEST ANSWER
$_POST['message'], => [0] => %C3%AE%C3%BB%C3%B4%C3%B3

You have over-URL-encoded your input. The GET/POST/REQUEST superglobals have already taken care of URL-decoding input strings where necessary, you should not need to URL-decode them manually.

Have a look at whatever is causing this request (an XMLHttpRequest?) and remove the excess call to encodeURIComponent(). For example if you are using jQuery ajax() and passing in POST-data as an object, jQuery will be calling encodeURIComponent() for you and you don't need to do it yourself as well.

urldecode($_POST['message']), => îûôó

This is UTF-8 misinterpreted as Windows code page 1252 (Western European, similar to ISO-8859-1).

Most likely you have successfully saved UTF-8 bytes to your log file, but whatever you're reading the log file in doesn't realise that it should be rendered as UTF-8.

utf8_decode(urldecode($_POST['message'])), => îûôó

This only works because the characters you have used to test it also exist in code page 1252. utf8_decode is misleadingly named; what it actually does is convert a UTF-8 byte sequence to an ISO-8859-1 byte sequence that would represent the same string. You usually want to work in UTF-8 and not ISO-8859-1 so you should in general avoid utf8_decode.

русский язык => ??????? ????

Understandable: Cyrillic characters don't exist in code page 1252.

Assuming you are sending your error_log output to a file, and trying to read the file, stick with the plain UTF-8 bytes, and read your logs in a decent text editor that lets you see and choose the encoding; ideally a modern one that defaults to UTF-8. Alternatively you can persuade Notepad to read a Unicode file by saving as UTF-16 or UTF-8 and including a Byte Order Mark at the start. (It's kind of wrong to include a BOM in a UTF-8 file, but lots of tools in the Microsoft world do it.)

3
On

Go UTF8 across the whole stack:

  • Database tables
  • Database connection
  • PHP default character set setting
  • String functions

Database Tables:

Set the db collation to utf8_unicode_ci.
Set all text/varchar fields to utf8_unicode_ci.
Set the database connection to be UTF-8 by executing the following query:

SET NAMES 'utf8'

PHP Charset

Use:

ini_set('default_charset', 'utf-8'); 

PHP String Functions

Some PHP string functions aren't binary safe and so you need to use the mb_* equivalents.

e.g. mb_strlen instead of strlen

HTML:

Set the charset with a meta tag:

<meta http-equiv="Content-Type" content="text/html; charset=utf-8">