mb_convert_encoding() with UTF-16 input in PHP > 8.1

663 Views Asked by At

I'm updating a PHP app which imports CSV encoded in UTF-16 (from Google Keyword Planner) and the values are converted to UTF-8.

Until PHP 8 it's working as expected, but from PHP 8.1 there is a ? added to the values after the conversion from UTF-16 to UTF-8:

var_dump(mb_convert_encoding("\0008\0008\0000\000", "UTF-8", "UTF-16"));

// Output with PHP 8.1.3 - 8.1.13, 8.2.0:
// string(4) "880?"

// Output with PHP 7.4.32, 8.0.8 - 8.0.26:
// string(3) "880"
2

There are 2 best solutions below

1
On BEST ANSWER

Your source equals to "\x00\x38\x00\x38\x00\x30\x00", which is 7 bytes and as such an invalid length for UTF-16, which always needs 2 or 4 bytes per character.

  • You're lucky enough PHP7 did silently accept the first 6 bytes and drop the 7th,
  • while PHP8 now produces a more correct output as per UTF-16 LE and wants to tell you that there is an imcomplete 4th character, because there's only 1 byte for it.

Solution: provide proper input. Maybe it's also because you misunderstood the octal notation and would see it much better without mixing notation and literals altogether:

approach only 6 bytes (value '880') make it 8 bytes (value '8800'
full hexadecimal notation "\x00\x38\x00\x38\x00\x30" "\x00\x38\x00\x38\x00\x30\x00\x30"
mixed hexadecimal notation "\x008\x008\x000" "\x008\x008\x000\x000"
full octal notation "\000\070\000\070\000\060" "\000\070\000\070\000\060\000\060"
mixed octal notation "\0008\0008\0000" "\0008\0008\0000\0000"
concatenated string to make it more clear "\x00". '8'. "\x00". '8'. "\x00". '0' "\x00". '8'. "\x00". '8'. "\x00". '0'. "\x00". '0'
0
On

Avoid PHP, simply use MySQL and its LOAD DATA INFILE. Be sure to set the character set to utf16 or utf16le, depending on the "endian-ness".