Symfony 4.2 utf8 data cut off when saving to MySQL

518 Views Asked by At

I'm saving a heading from a CSV file to the database.

Viewed with less on Ubuntu the file starts like this:

    Date,Supermarket,Speciality,Takeaway,Caf<E9>/restaurant
    1/06/2019,0.039175903,-0.01496395,0.03603785,0.029072835
    1/07/2019,0.039399919,-0.008250166,0.022385733,0.015478668

The heading data is ($csvHeader)

Array
(
    [0] => Date
    [1] => Supermarket
    [2] => Speciality
    [3] => Takeaway
    [4] => Caf�/restaurant
)

ord(substr($csvHeader,3,1)) === 233

This is read with the following function

protected function getCsvHeaders()
{
    $fh = fopen( $this->getCsvPath(), 'r+' );
    $firstrow = fgetcsv( $fh );
    fclose( $fh );
    return $firstrow;
}

This is saved to a table DataConfiguration:

$dataConf
  ->setColumns(serialize($csvHeader));

which is set to utf8mb4:

    show create table data_configuration;
    +--------------------+---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
    | Table              | Create Table                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                    |
    +--------------------+---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
    | data_configuration | CREATE TABLE `data_configuration` (
    `id` int(11) NOT NULL AUTO_INCREMENT,
    `data_set_id` int(11) NOT NULL,
    `file_type_id` int(11) NOT NULL,
    `columns` varchar(7500) COLLATE utf8mb4_unicode_ci NOT NULL,
    PRIMARY KEY (`id`),
    UNIQUE KEY `unique_idx` (`data_set_id`,`file_type_id`),
    KEY `IDX_54A0B1FD70053C01` (`data_set_id`),
    KEY `IDX_54A0B1FD9E2A35A8` (`file_type_id`),
    CONSTRAINT `FK_54A0B1FD70053C01` FOREIGN KEY (`data_set_id`) REFERENCES `data_set` (`id`),
    CONSTRAINT `FK_54A0B1FD9E2A35A8` FOREIGN KEY (`file_type_id`) REFERENCES `file_type` (`id`)
    ) ENGINE=InnoDB AUTO_INCREMENT=13176 DEFAULT CHARSET=utf8mb4 COLLATE=utf8mb4_unicode_ci |

Doctrine seems to be configured for utf8mb4 as well:

doctrine:
    dbal:
        # configure these for your database server
        driver: 'pdo_mysql'
#        server_version: '5.7'
        charset: utf8mb4
        default_table_options:
            charset: utf8mb4
            collate: utf8mb4_unicode_ci

        url: '%env(resolve:DATABASE_URL)%'
        options:
            1001: true

However the data gets cut off at the utf8 character and subsequent unserialize fails. I can reproduce this on my Ubuntu 18/ AWS RDS environment as well as my local MacOS/Brew environment.

What other avenues can I explore to solve this problem?

1

There are 1 best solutions below

7
On

You are parsing a text file with fgetcsv(). Its documentation states you can encounter issues when using single-byte encoded files:

The locale settings are taken into account by this function. If LC_CTYPE is e.g. en_US.UTF-8, files in one-byte encodings may be read wrongly by this function.

https://www.php.net/manual/en/function.fgetcsv.php

If your file contains French characters that exist beyond the basic ASCII table, you may set this variable to another value:

  1. List installed locales : sh locale -a

en_US.utf8

fr_FR.iso885915

...

The output may vary. I cannot tell you a locale that's guaranteed to exist on your machine. You have to pick something like ISO-8859-1, Windows-1252, not UTF-8.

  1. Before calling fgetcsv(), set the locale to something that matches the file encoding:
setlocale(LC_CTYPE, 'fr_FR.iso885915');
  1. Call fgetcsv()

Alternatively, you can manually convert the encoding:

$row_utf8 = mb_convert_encoding($row_raw, "Windows-1252", "UTF-8");

What does file your.csv gives?