I have multiple CSV files stored in a google cloud storage bucket. I have created an airflow DAG task that takes all these files in the same bucket and creates one external table since all these files follow the same schema just with different data. I am running into a problem trying to query this data now. The Error I am receiving is:
Error while reading table: irm-eap-edp-ingestion-nonprod.route4me.routes_external, error message: Missing close quote character (").; line_number: 78 byte_offset_to_start_of_line: 19620 column_index: 7 column_name: "destination_name" column_type: STRING value: "Survey # 87/1, 87..." File: gs://irm-eap-edp-ingestion-nonprod-us-archive/route4me_data/Detailed-Routes-2023-08-29.csv
So It is saying in the 2023-08-29 file, there is a problem with the column "destination_name"
I opened this file in VSCode and see this on that line.
The destination_name column data is the light blue data that is spread over 4 lines.
"Survey # 87/1, 87/2, 88, 89/1 and 122, Byranahalli Village, Kasaba Hobli, Nelamangala Taluk, Bangalore North – 562123"
So it would imagine it is having a problem since this record has multiple newlines in its row. However, in my task i have enabled: "allow_quoted_newlines": True. I assumed this would of fixed this problem with there being newlines inside the quotes. Is this not the correct option? What am I doing wrong, here is my airflow task:
create_external_table = BigQueryUpsertTableOperator(
task_id=f"create_external_{TABLE}_table",
dataset_id=DATASET,
project_id=INGESTION_PROJECT_ID,
table_resource={
"tableReference": {"tableId": f"{TABLE}_external"},
"externalDataConfiguration": {
"sourceFormat": "CSV",
"schema": {"fields": schema},
"allow_quoted_newlines": True,
"allow_jagged_rows":True,
"autodetect": True,
"sourceUris": [f"gs://{ARCHIVE_BUCKET}/{DATASET}_data/*.csv"],
},
"labels": labeler.get_labels_bigquery_table_v2(
target_project=INGESTION_PROJECT_ID,
target_dataset=DATASET,
target_table=f"{TABLE}_external",
),
},
)