Read from BigQuery into Spark in efficient way?

15.3k Views Asked by At

When using BigQuery Connector to read data from BigQuery I found that it copies all data first to Google Cloud Storage. Then reads this data in parallel into Spark, but when reading big table it takes very long time in copying data stage. So is there more efficient way to read data from BigQuery into Spark?

Another Question: reading from BigQuery composed of 2 stages (copying to GCS, reading in parallel from GCS). does copying stage affected by Spark cluster size or it take fixed time?

3

There are 3 best solutions below

3
On BEST ANSWER

Maybe a Googler will correct me, but AFAIK that's the only way. This is because under the hood it also uses the BigQuery Connector for Hadoop, which accordng to the docs:

The BigQuery connector for Hadoop downloads data into your Google Cloud Storage bucket before running a Hadoop job..

As a side note, this is also true when using Dataflow - it too performs an export of BigQuery table(s) to GCS first and then reads them in parallel.

WRT whether or not the copying stage (which is essentially a BigQuery export job) is influenced by your Spark cluster size, or if it's a fixed time - no. BigQuery export jobs are nondeterministic, and BigQuery uses its own resources for exporting to GCS i.e. not your Spark cluster.

0
On

I strongly suggest one to verify do you really need to move data to spark engine from BQ Storage.
BQ comes with it's compute and storage capabilities. what is stopping to leverage compute of native BQ. it is free if you are on fixed slot billing model. Native BQ compute is no less then in any case to spark computation capabilities.. if you have pipelines in spark except ingestion, prefer to move pre aggregated, enrichment , ETL to directly in BQ. it would perform better, cost effective and easy to manage. BQ is server less services you don't need to predict the nodes you required to process the data if volumes changes abruptly.

Another Down side with Spark is COST-

  1. Storage API usage adds lot of cost if you are working large datasets.Dataproc/Dataflow use storage API to read data from Big query
  2. Dataproc Nodes cost
  3. Dataproc service cost
  4. optional- if BQ slot cost would be wasted as you won't be using it.
0
On

spark-bigquery-connector uses the BigQuery storage API which is super fast.