Upload the PySpark dataframe to bigquery as a dataproc job

366 Views Asked by At

I'm trying to submit a PySpark job on Dataproc cluster. My Pyspark job is uploading a dataframe to bigquery. When I do it using submit job on the cluster, I face an error, the job fails. But, when I provide this jar :
"gs://spark-lib/bigquery/spark-bigquery-latest_2.12.jar", in the jar file parameter in submit job, the job executes successfully.

What I wanted is to find a way to avoid providing this jar during run-time and just run the job by giving the location of .py file alone. How can I do it? Is it somehow possible to specify this jar within the .py file itself?

I used the below approach to provide the jar in the .py file itself but it doesn't seem to work.

from pyspark.sql import SparkSession
spark = SparkSession.builder.master('yarn')\
.config('spark.jars', 'gs://spark-lib/bigquery/spark-bigquery-latest_2.12.jar') \
.appName('df-to-bq-sample').enableHiveSupport().getOrCreate()

Can anyone suggest a way to do it, and I do not want to use CLI for this. Thank you!

1

There are 1 best solutions below

0
On

First of all, the mentioned is a must when reading and writing to BigQuery. If you don't want to add it to the job submission, you can add the BigQuery connector jar on cluster creation using the connectors init action like this:

REGION=<region>
CLUSTER_NAME=<cluster_name>
gcloud dataproc clusters create ${CLUSTER_NAME} \
    --region ${REGION} \
    --initialization-actions gs://goog-dataproc-initialization-actions-${REGION}/connectors/connectors.sh \
    --metadata spark-bigquery-connector-version=0.24.2