I'm trying to create an external table in BigQuery off partitioned parquet data found in GCS. I'm doing this by running the following SQL statement:
CREATE OR REPLACE EXTERNAL TABLE `project-id.poc.parquet_test`
WITH PARTITION COLUMNS
OPTIONS (
uris = ['gs://poc/beam_parquet_poc_ii/*.parquet'],
format = 'PARQUET',
hive_partition_uri_prefix = 'gs://poc/beam_parquet_poc_ii',
require_hive_partition_filter = false);
However, I am getting this error message when I run this:
Error while reading table: parquet_test, error message: Failed to add partition key CREATED_DATE (type: TYPE_DATE) to schema, because another column with the same name was already present. This is not allowed. Full partition schema: [CREATED_DATE:TYPE_DATE, PIPELINE_NAME:TYPE_STRING].
The data is stored in GCS with paths like:
gs://poc/beam_parquet_poc_ii/CREATED_DATE=2023-08-17/PIPELINE_NAME=poc-1692308092558/00000-of-00002.parquet
gs://poc/beam_parquet_poc_ii/CREATED_DATE=2023-08-17/PIPELINE_NAME=poc-1692308092558/00001-of-00002.parquet
...
The data looks like this:
TRANSACTION_ID EVENT_ID \
0 2cd36716-5e36-4924-9895-812a19d9c64e 3688390
STORE_NBR ITEM_NBR \
5649 654509568
PIPELINE CREATED_DATE
poc-1692308092558 2023-08-17
...
Any thoughts on how to fix this? Is it that I cant have the columns I'm partitioning by inside the underlying parquet files?
After talking to some people, I realized that the underlying data in the parquet files cannot have columns named the same as what I'm using in the partitioning scheme. So I dropped CREATED_DATE & PIPELINE_NAME and that solved the problem