PHP cannot parse CSV correctly (file is in UTF-16LE)

2.2k Views Asked by At

I am trying to parse a CSV file using PHP.
The file uses commas as delimiter and double quotes for fields containing comma(s), as:

foo,"bar, baz",foo2

The issue I am facing is that I get fields containing comma(s) separated. I get:

  • "2
  • rue du ..."

Instead of: 2, rue du ....


Encoding:
The file doesn't seem to be in UTF8. It has weird wharacters at the beginning (apparently not BOM, looks like this when converted from ASCII to UTF8: ÿþ) and doesn't displays accents.

  • My code editor (Atom) tells the encoding is UTF-16 LE
  • using mb_detect_encoding() on the csv lines it returns ASCII

But it fails to convert:

  • mb_convert_encoding() converts from ASCII but returns asian characters from UTF-16LE
  • iconv() returns Notice: iconv(): Wrong charset, conversion from UTF-16LE/ASCII to UTF8 is not allowed.

Parsing:
I tried to parse with this one-liner (see those 2 comments) using str_getcsv():

$csv = array_map('str_getcsv', file($file['tmp_name']));

I then tried with fgetcsv() :

$f = fopen($file['tmp_name'], 'r');
while (($l = fgetcsv($f)) !== false) {
    $arr[] = $l;
}
$f = fclose($f);

In both ways I get my adress field in 2 parts. But when I try this code sample I get correctly parsed fields:

$str = 'foo,"bar, baz",foo2,azerty,"ban, bal",doe';
$data = str_getcsv($str);
echo '<pre>' . print_r($data, true) . '</pre>';

To sum up with questions:

  • What are the characters at the beginning of the file ?
  • How could I be sure about the encoding ? (Atom reads the file with UTF-16 LE and doesn't display weird characters at the beginning)
  • What makes the csv parsing functions fail ?
  • If I should rely on something else to parse the lines of the CSV, what could I use ?
2

There are 2 best solutions below

1
On BEST ANSWER

I finally solved it myself:

I sent the file into online encoding detection websites which returned UTF16LE. After checking about what is UTF16LE it says it has BOM (Byte Order Mark).
My previous attempts were using file() which returns an array of the lines of a file and with fopen() which returns a resource, but we still parse line by line.

The working solution came in my mind about converting the whole file (every line at once) instead of converting each line separately. Here is a working solution:

$f = file_get_contents($file['tmp_name']);          // Get the whole file as string
$f = mb_convert_encoding($f, 'UTF8', 'UTF-16LE');   // Convert the file to UTF8
$f = preg_split("/\R/", $f);                        // Split it by line breaks
$f = array_map('str_getcsv', $f);                   // Parse lines as CSV data

I don't get the adress fields separated at internal commas anymore.

1
On

Thank you for your answer, this is working better for me than the other solutions I found.

I came up with a little improvement, which removes the first two characters in the file (the BOM markers) and then reads and csv-parse the file line-by-line, for when there is the need of working on each record.

Removing the first two characters in the file proved necessary to me, otherwise the str_getcsv function didn't work well on the first field of the first line.

    $f = substr(file_get_contents($file['tmp_name']), 2);
    $f = mb_convert_encoding($f, 'UTF8', 'UTF-16LE');
    $f = preg_split("/\R/", $f);
    for ($i = 0; $i < count($f); $i++) {
      $record = str_getcsv($f[$i], "\t", '"');
      //...work with the record...
    }