How can I import CSV data with newline characters in its values into a GCloud Big Query table?

1.2k Views Asked by At

My bq load command and error

$ bq load --replace --source_format=CSV --autodetect temp.BQTable ./source.csv
Upload complete.
Waiting on bqjob_123 ... (0s) Current status: DONE
BigQuery error in load operation: Error processing job 'my-project-id-123:bqjob_123': Error
while reading data, error message: Error detected while parsing row starting at position: 333. Error: Missing close
double quote (") character.
Failure details:
- Error while reading data, error message: CSV processing encountered
too many errors, giving up. Rows: 0; errors: 1; max bad: 0; error
percent: 0

Excel and my node.js module for Csv serialization can process data with values with newlines just fine.

Papaparse, my npm module for serializing, uses '\r\n' as the default newline to separate rows so there shouldn't be any ambiguity there.

Here's some sample csv data, ie, source.csv

id,name
1,"this value
includes a newline"

bq and Cloud SDK versions

$ gcloud components update


Your current Cloud SDK version is: 289.0.0
You will be upgraded to version: 337.0.0

┌─────────────────────────────────────────────────────────────────────────────┐
│                      These components will be updated.                      │
├─────────────────────────────────────────────────────┬────────────┬──────────┤
│                         Name                        │  Version   │   Size   │
├─────────────────────────────────────────────────────┼────────────┼──────────┤
│ BigQuery Command Line Tool                          │     2.0.67 │  < 1 MiB │
│ BigQuery Command Line Tool (Platform Specific)      │     2.0.65 │  < 1 MiB │
│ Cloud SDK Core Libraries                            │ 2021.04.16 │ 18.0 MiB │
│ Cloud SDK Core Libraries (Platform Specific)        │ 2021.03.12 │  < 1 MiB │
│ Cloud Storage Command Line Tool                     │       4.61 │  3.9 MiB │
│ Cloud Storage Command Line Tool (Platform Specific) │       4.59 │  < 1 MiB │
│ gcloud cli dependencies                             │ 2021.04.16 │ 10.8 MiB │
│ gcloud cli dependencies                             │ 2021.04.16 │  < 1 MiB │
└─────────────────────────────────────────────────────┴────────────┴──────────┘

I was on version 289 because they patched newer version to not allow you to import a file (CSV, JSON, etc) of size 100 MB or bigger into BQ using bq load :(

2

There are 2 best solutions below

0
On BEST ANSWER

Add the flag below to your command:

--allow_quoted_newlines=true

Not sure if this will work with \r\n, it definetly work for me for \n encoded newlines.

Hope it helps.

0
On

Remove all newline and carriage return characters from all string values in csv.

Javascript

const _ = require('lodash');
const rows = [{ id: 1, name: `this value
  includes a newline` 
}];

rows.forEach(row => {
  return _.mapValues(row, value => typeof value === 'string'
    ? value.replace(/[\r\n]+/g, ' ')
    : value
  );
});

This isn't a good solution since it discards data and assumes those newlines weren't important. Let's hope there aren't newlines in your PK's or foreign fields.