Error creating an external table in BQ off partitioned parquet files

195 Views Asked by At

I'm trying to create an external table in BigQuery off partitioned parquet data found in GCS. I'm doing this by running the following SQL statement:

CREATE OR REPLACE EXTERNAL TABLE `project-id.poc.parquet_test`
WITH PARTITION COLUMNS
OPTIONS (
uris = ['gs://poc/beam_parquet_poc_ii/*.parquet'],
format = 'PARQUET',
hive_partition_uri_prefix = 'gs://poc/beam_parquet_poc_ii',
require_hive_partition_filter = false);

However, I am getting this error message when I run this:

Error while reading table: parquet_test, error message: Failed to add partition key CREATED_DATE (type: TYPE_DATE) to schema, because another column with the same name was already present. This is not allowed. Full partition schema: [CREATED_DATE:TYPE_DATE, PIPELINE_NAME:TYPE_STRING].

The data is stored in GCS with paths like:

gs://poc/beam_parquet_poc_ii/CREATED_DATE=2023-08-17/PIPELINE_NAME=poc-1692308092558/00000-of-00002.parquet 
gs://poc/beam_parquet_poc_ii/CREATED_DATE=2023-08-17/PIPELINE_NAME=poc-1692308092558/00001-of-00002.parquet 
...

The data looks like this:

                         TRANSACTION_ID  EVENT_ID  \
0  2cd36716-5e36-4924-9895-812a19d9c64e  3688390      

STORE_NBR  ITEM_NBR  \
     5649  654509568   

         PIPELINE CREATED_DATE  
poc-1692308092558   2023-08-17  
...

Any thoughts on how to fix this? Is it that I cant have the columns I'm partitioning by inside the underlying parquet files?

1

There are 1 best solutions below

0
On

After talking to some people, I realized that the underlying data in the parquet files cannot have columns named the same as what I'm using in the partitioning scheme. So I dropped CREATED_DATE & PIPELINE_NAME and that solved the problem