Amazon S3 parquet file - Transferring to GCP / BQ

867 Views Asked by At

Good morning everyone. I have a GCS Bucket, which has files that have been transferred from our Amazon S3 bucket. These files are in .gz.parquet format. I am trying to set up a transfer from the GSC bucket to BigQuery with the transfer feature, however I am running into issues with the parquet file format.

When I create a transfer and specify the file format as Parquet, I receive an error stating that the data is not in parquet format. When I tried specifying the file in CSV, weird values appear in my table as shown in the image linked: Results 2

I have tried the following URIs:

  • bucket-name/folder-1/folder-2/dt={run_time|"%Y-%m-%d"}/b=1/geo/*.parquet. FILE FORMAT: PARQUET. RESULTS: FILE NOT IN PARQUET FORMAT.

  • bucket-name/folder-1/folder-2/dt={run_time|"%Y-%m-%d"}/b=1/geo/*.gz.parquet. FILE FORMAT: PARQUET. RESULTS: FILE NOT IN PARQUET FORMAT.

  • bucket-name/folder-1/folder-2/dt={run_time|"%Y-%m-%d"}/b=1/geo/*.gz.parquet. FILE FORMAT: CSV. RESULTS: TRANSFER DONE, BUT WEIRD VALUES.

  • bucket-name/folder-1/folder-2/dt={run_time|"%Y-%m-%d"}/b=1/geo/*.parquet. FILE FORMAT: CSV. RESULTS: TRANSFER DONE, BUT WEIRD VALUES.

Does anyone have any idea on how I should proceed? Thank you in advance!

2

There are 2 best solutions below

1
On

There is a dedicated documentation explaining how to copy Parquet data from Cloud storage bucket to Big Query which is given below. Could you please go thru it and update us if its still not solving your problem.

https://cloud.google.com/bigquery/docs/loading-data-cloud-storage-parquet

Regards, Anbu.

4
On

Seeing the looks of your URIs, the page you are looking for is this one, for loading hive partitioned parquet files into BigQuery.

You can try something like below in Cloud Shell:

bq load --source_format=PARQUET --autodetect \
--hive_partitioning_mode=STRINGS \
--hive_partitioning_source_uri_prefix=gs://bucket-name/folder-1/folder-2/ \
dataset.table `gcs_uris`