Why Temporary GCS bucket is needed to write a dataframe to BigQuery: pyspark

4.2k Views Asked by At

Recently I face an issue while writing the dataframe data into BigQuery using pyspark. Here it was:

pyspark.sql.utils.IllegalArgumentException: u'Temporary or persistent GCS bucket must be informed

After research the issue I found that Temporary GCS bucket to be mentioned spark.conf.

bucket = "temp_bucket"
spark.conf.set('temporaryGcsBucket', bucket)

I think there is no concept to have a file for a table in Biquery like Hive.

I would like to know more about it, why we need to have temp-gcs-bucket to write the data into bigquery?

I was searching for the reason behind this but I couldn't.

Please clarify.

1

There are 1 best solutions below

0
On

Spark BigQuery connector has two write modes(writeMethod), 1. Direct 2.Indirect while writing data into BigQuery. This is a optional parameter, default is Indirect.

Indirect
You can specify indirect option like this option("writeMethod","indirect"). Its optional, and Indirect is default. This requires you to specify a temporary gcs bucket, if not you will get the error. The need of temporary bucket is .

The connector writes the data to BigQuery by first buffering all the data into a Cloud Storage temporary table. Then it copies all data from into BigQuery in one operation.

Taken from the GCFS spark example docs here

Direct

In this method the data is written directly to BigQuery using the BigQuery Storage Write API

In scala you can specify like this option("writeMethod","direct"). which eliminates the need for a temporary bucket.

You can read more about the bigquery connector here