Recently I face an issue while writing the dataframe data into BigQuery using pyspark. Here it was:
pyspark.sql.utils.IllegalArgumentException: u'Temporary or persistent GCS bucket must be informed
After research the issue I found that Temporary GCS bucket to be mentioned spark.conf
.
bucket = "temp_bucket"
spark.conf.set('temporaryGcsBucket', bucket)
I think there is no concept to have a file for a table in Biquery like Hive.
I would like to know more about it, why we need to have temp-gcs-bucket to write the data into bigquery?
I was searching for the reason behind this but I couldn't.
Please clarify.
Spark BigQuery connector has two write modes(writeMethod), 1. Direct 2.Indirect while writing data into BigQuery. This is a optional parameter, default is Indirect.
Indirect
You can specify indirect option like this
option("writeMethod","indirect")
. Its optional, and Indirect is default. This requires you to specify a temporary gcs bucket, if not you will get the error. The need of temporary bucket is .Taken from the GCFS spark example docs here
Direct
In this method the data is written directly to BigQuery using the BigQuery Storage Write API
In scala you can specify like this
option("writeMethod","direct")
. which eliminates the need for a temporary bucket.You can read more about the bigquery connector here