Upload the PySpark dataframe to bigquery as a dataproc job

363 Views Asked by Mano Bhargav At 09 August 2025 at 00:11

I'm trying to submit a PySpark job on Dataproc cluster. My Pyspark job is uploading a dataframe to bigquery. When I do it using submit job on the cluster, I face an error, the job fails. But, when I provide this jar :
"gs://spark-lib/bigquery/spark-bigquery-latest_2.12.jar", in the jar file parameter in submit job, the job executes successfully.

What I wanted is to find a way to avoid providing this jar during run-time and just run the job by giving the location of .py file alone. How can I do it? Is it somehow possible to specify this jar within the .py file itself?

I used the below approach to provide the jar in the .py file itself but it doesn't seem to work.

from pyspark.sql import SparkSession
spark = SparkSession.builder.master('yarn')\
.config('spark.jars', 'gs://spark-lib/bigquery/spark-bigquery-latest_2.12.jar') \
.appName('df-to-bq-sample').enableHiveSupport().getOrCreate()

Can anyone suggest a way to do it, and I do not want to use CLI for this. Thank you!

Original Q&A

There are 1 best solutions below

David Rabinowitz On 07 April 2022 at 19:30

First of all, the mentioned is a must when reading and writing to BigQuery. If you don't want to add it to the job submission, you can add the BigQuery connector jar on cluster creation using the connectors init action like this:

REGION=<region>
CLUSTER_NAME=<cluster_name>
gcloud dataproc clusters create ${CLUSTER_NAME} \
    --region ${REGION} \
    --initialization-actions gs://goog-dataproc-initialization-actions-${REGION}/connectors/connectors.sh \
    --metadata spark-bigquery-connector-version=0.24.2

Upload the PySpark dataframe to bigquery as a dataproc job

There are 1 best solutions below

Related Questions in PYSPARK

Related Questions in GOOGLE-BIGQUERY

Related Questions in JAR

Related Questions in GOOGLE-CLOUD-DATAPROC

Related Questions in SPARK-BIGQUERY-CONNECTOR

Trending Questions

Popular # Hahtags

Popular Questions