PHP's mb_detect_encoding()
doesn't understand the MacRoman
encoding. My app allows users to upload data in csv format and I need to convert it to utf8 because the users are not tech-savvy. I will never be able to get all of them to understand how to do it and control their encoding.
This is what I'm doing:
$encoding_detection_order = array('UTF-8', 'UTF-7', 'ASCII', 'ISO-8859-1', 'EUC-JP', 'SJIS', 'eucJP-win', 'SJIS-win', 'JIS', 'ISO-2022-JP', );
$encoding = mb_detect_encoding($value, $detection_order, true);
$converted_value = iconv($encoding, 'UTF-8//TRANSLIT', $value);
This works great for most situations, but if my user is on a Mac and they save the CSV in MacRoman
encoding, then the above code will usually wrongly detect the text as ISO-8859-1
which causes the iconv()
to produce bad output.
For example, the accented-e in Jaimé
has a hex value of 0x8e
in MacRoman
. In ISO-8859-1
, the 0x8e
character is Ž
and so when I covert it to utf8, I just get the utf8 version of Ž
when I should be getting é
.
I need to be able to dynamically differentiate MacRoman
from other encodings so that I convert it properly.