Read from BigQuery into Spark in efficient way?

15.4k Views Asked by Mahmoud Hanafy At 04 January 2017 at 10:57

When using BigQuery Connector to read data from BigQuery I found that it copies all data first to Google Cloud Storage. Then reads this data in parallel into Spark, but when reading big table it takes very long time in copying data stage. So is there more efficient way to read data from BigQuery into Spark?

Another Question: reading from BigQuery composed of 2 stages (copying to GCS, reading in parallel from GCS). does copying stage affected by Spark cluster size or it take fixed time?

Original Q&A

There are 3 best solutions below

Graham Polley On 04 January 2017 at 11:47 BEST ANSWER

Maybe a Googler will correct me, but AFAIK that's the only way. This is because under the hood it also uses the BigQuery Connector for Hadoop, which accordng to the docs:

The BigQuery connector for Hadoop downloads data into your Google Cloud Storage bucket before running a Hadoop job..

As a side note, this is also true when using Dataflow - it too performs an export of BigQuery table(s) to GCS first and then reads them in parallel.

WRT whether or not the copying stage (which is essentially a BigQuery export job) is influenced by your Spark cluster size, or if it's a fixed time - no. BigQuery export jobs are nondeterministic, and BigQuery uses its own resources for exporting to GCS i.e. not your Spark cluster.

SANN3 On 27 February 2020 at 06:42

spark-bigquery-connector uses the BigQuery storage API which is super fast.

abhay On 10 April 2021 at 10:10

I strongly suggest one to verify do you really need to move data to spark engine from BQ Storage.
BQ comes with it's compute and storage capabilities. what is stopping to leverage compute of native BQ. it is free if you are on fixed slot billing model. Native BQ compute is no less then in any case to spark computation capabilities.. if you have pipelines in spark except ingestion, prefer to move pre aggregated, enrichment , ETL to directly in BQ. it would perform better, cost effective and easy to manage. BQ is server less services you don't need to predict the nodes you required to process the data if volumes changes abruptly.

Another Down side with Spark is COST-

Storage API usage adds lot of cost if you are working large datasets.Dataproc/Dataflow use storage API to read data from Big query
Dataproc Nodes cost
Dataproc service cost
optional- if BQ slot cost would be wasted as you won't be using it.

Read from BigQuery into Spark in efficient way?

There are 3 best solutions below

Related Questions in APACHE-SPARK

Related Questions in GOOGLE-BIGQUERY

Related Questions in GOOGLE-CLOUD-DATAPROC

Related Questions in GOOGLE-HADOOP

Trending Questions

Popular # Hahtags

Popular Questions