Cloud SQL: Python Client API defaults to utf8mb3 character set on CSV file import

886 Views Asked by At

I am using Google Cloud's import API to load CSV file in Google Cloud Storage to a Cloud SQL MySQL 5.7 database:

https://cloud.google.com/sql/docs/mysql/admin-api/v1beta4/instances/import

The import API generates a LOAD DATA INFILE command to MySQL database that assumes 'utf8' character set. However, the data in my CSV files uses 'utf8mb4' encoding which is a superset of 'utf8'. This causes the load process to fail when some strings cannot be encoded to 'utf8':

Exception: CloudSQL Exception: {'kind': 'sql#operation', 'selfLink': 'https://www.googleapis.com/sql/v1beta4/projects/***', 'targetProject': '***', 'targetId': '***', 'targetLink': 'https://www.googleapis.com/sql/v1beta4/projects/***', 'name': '0211c99e-0633-42f1-9ee1-069473308273', 'operationType': 'IMPORT', 'status': 'RUNNING', 'user': '***', 'insertTime': '2019-01-14T02:36:39.861Z', 'startTime': '2019-01-14T02:36:39.972Z', 'error': {'kind': 'sql#operationErrors', 'errors': [{'kind': 'sql#operationError', 'code': 'ERROR_RDBMS', 'message': "Import CSV error: Error 1300: Invalid utf8 character string: ''Afikanisitani|'Apekanikana|A Phu Han (Afghanistan)|A Phú Hãn '\n"}]}, 'importContext': {'kind': 'sql#importContext', 'uri': '***', 'database': '**', 'importUser': '', 'csvImportOptions': {'table': '***'}}}

Relevant piece: "'message': "Import CSV error: Error 1300: Invalid utf8 character string:"

Is there a way to add 'utf8mb4' CHARACTER SET using the import API (or any other character set for that matter)?

I tried experimenting with adding 'character set': 'utf8mb4' in 'csvImportOptions' dictionary, but it appears import API only expects 'table' and 'columns' keys in that dict.

Note that if I directly run the LOAD DATA INFILE command from MySQL client, I am able to import the CSV with no issues:

LOAD DATA INFILE 'myCSVFile.csv'
INTO TABLE 'my table'
CHARACTER SET utf8mb4
FIELDS TERMINATED BY ','
LINES TERMINATED BY '\n'
IGNORE 1 ROWS
from pprint import pprint

from googleapiclient import discovery
from oauth2client.client import GoogleCredentials

credentials = GoogleCredentials.get_application_default()

service = discovery.build('sqladmin', 'v1beta4', credentials=credentials)

# Project ID of the project that contains the instance.
project = 'my-project'  # TODO: Update placeholder value.

# Cloud SQL instance ID. This does not include the project ID.
instance = 'my-instance'  # TODO: Update placeholder value.

instances_import_request_body = {
    "importContext": {
        "kind": "sql#importContext",
        "fileType": "CSV",
        "uri": gcs_uri,
        "database": database,
        "csvImportOptions": {
            "table": table
        }
    }
}

request = service.instances().import_(project=project, instance=instance, body=instances_import_request_body)
response = request.execute()

ADDITIONAL DATA POINT It is clear to me that the LOAD DATA INFILE query that Google's API is generating is defaulting to 'utf8' character set.

Fails with the same error message as the API

LOAD DATA INFILE 'problematic.csv'
INTO TABLE my_table
**CHARACTER SET utf8**
FIELDS TERMINATED BY ','
OPTIONALLY ENCLOSED BY '\"'
ESCAPED BY '\"'

ERROR 1300 (HY000): Invalid utf8 character string: ''Afikanisitani|'Apekanikana|A Phu Han (Afghanistan)|A Phú Hãn '

Works:

LOAD DATA INFILE 'problematic.csv'
INTO TABLE my_table
**CHARACTER SET utf8mb4**
FIELDS TERMINATED BY ','
OPTIONALLY ENCLOSED BY '\"'
ESCAPED BY '\"'

Query OK, 75641 rows affected (1.14 sec)
Records: 75641  Deleted: 0  Skipped: 0  Warnings: 0

The documentation here is incorrect: https://cloud.google.com/sql/docs/mysql/import-export/importing

  LOAD DATA INFILE ... ***CHARACTER SET 'utf8mb4'***
  FIELDS TERMINATED BY ',' OPTIONALLY ENCLOSED BY '\"' ESCAPED BY '\"'.
1

There are 1 best solutions below

8
On

The solution is to set the following flag on your Google Cloud SQL instance:

character-set-server: utf8mb4

Configuring Database Flags

CLI Command:

gcloud sql instances patch [INSTANCE_NAME] --database-flags character-set-server=utf8mb4

The database flags are listed under the settings as the collection databaseFlags:

gcloud sql instances describe [INSTANCE_NAME]

Warning: Any flag not included in the patch command will be set back to its default value.