I'm saving a heading from a CSV file to the database.
Viewed with less on Ubuntu the file starts like this:
Date,Supermarket,Speciality,Takeaway,Caf<E9>/restaurant
1/06/2019,0.039175903,-0.01496395,0.03603785,0.029072835
1/07/2019,0.039399919,-0.008250166,0.022385733,0.015478668
The heading data is ($csvHeader)
Array
(
[0] => Date
[1] => Supermarket
[2] => Speciality
[3] => Takeaway
[4] => Caf�/restaurant
)
ord(substr($csvHeader,3,1)) === 233
This is read with the following function
protected function getCsvHeaders()
{
$fh = fopen( $this->getCsvPath(), 'r+' );
$firstrow = fgetcsv( $fh );
fclose( $fh );
return $firstrow;
}
This is saved to a table DataConfiguration:
$dataConf
->setColumns(serialize($csvHeader));
which is set to utf8mb4:
show create table data_configuration;
+--------------------+---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
| Table | Create Table |
+--------------------+---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
| data_configuration | CREATE TABLE `data_configuration` (
`id` int(11) NOT NULL AUTO_INCREMENT,
`data_set_id` int(11) NOT NULL,
`file_type_id` int(11) NOT NULL,
`columns` varchar(7500) COLLATE utf8mb4_unicode_ci NOT NULL,
PRIMARY KEY (`id`),
UNIQUE KEY `unique_idx` (`data_set_id`,`file_type_id`),
KEY `IDX_54A0B1FD70053C01` (`data_set_id`),
KEY `IDX_54A0B1FD9E2A35A8` (`file_type_id`),
CONSTRAINT `FK_54A0B1FD70053C01` FOREIGN KEY (`data_set_id`) REFERENCES `data_set` (`id`),
CONSTRAINT `FK_54A0B1FD9E2A35A8` FOREIGN KEY (`file_type_id`) REFERENCES `file_type` (`id`)
) ENGINE=InnoDB AUTO_INCREMENT=13176 DEFAULT CHARSET=utf8mb4 COLLATE=utf8mb4_unicode_ci |
Doctrine seems to be configured for utf8mb4 as well:
doctrine:
dbal:
# configure these for your database server
driver: 'pdo_mysql'
# server_version: '5.7'
charset: utf8mb4
default_table_options:
charset: utf8mb4
collate: utf8mb4_unicode_ci
url: '%env(resolve:DATABASE_URL)%'
options:
1001: true
However the data gets cut off at the utf8 character and subsequent unserialize fails. I can reproduce this on my Ubuntu 18/ AWS RDS environment as well as my local MacOS/Brew environment.
What other avenues can I explore to solve this problem?
You are parsing a text file with
fgetcsv()
. Its documentation states you can encounter issues when using single-byte encoded files:https://www.php.net/manual/en/function.fgetcsv.php
If your file contains French characters that exist beyond the basic ASCII table, you may set this variable to another value:
sh locale -a
The output may vary. I cannot tell you a locale that's guaranteed to exist on your machine. You have to pick something like ISO-8859-1, Windows-1252, not UTF-8.
fgetcsv()
, set the locale to something that matches the file encoding:fgetcsv()
Alternatively, you can manually convert the encoding:
What does
file your.csv
gives?