I'm pretty new to databricks. I have it running with Gloogle Cloud.
I have been able to connect to bigquery where I have my tables, and bring some data using SparkR:
library(SparkR)
sparkR.session()
df <- loadDF(
path = "project_id.dataset_id.table_id",
source = "bigquery",
header = "true",
inferSchema = "true"
)
This works and returns a SparkDataFrame:
df
SparkDataFrame[variable:string, position:bigint, label:string, measurement_level:string, wave_id:bigint, country_id:bigint]
I'm trying to do it using sparklyr but something is not working:
library(sparklyr)
sc <- spark_connect(method = "databricks")
df <- spark_read_bigquery(
sc,
name = "my_table",
projectId = project_id
datasetId = dataset_id,
tableId = table_id
)
This returns an error:
com.google.cloud.spark.bigquery.repackaged.com.google.inject.ProvisionException:
Unable to provision, see the following errors:
1) Error in custom provider, java.lang.NullPointerException: null value in
entry: bq.staging_dataset.gcs_bucket=null at
com.google.cloud.spark.bigquery.SparkBigQueryConnectorModule.provideSparkBigQueryConfig(SparkBigQueryConnectorModule.java:65)
while locating com.google.cloud.spark.bigquery.SparkBigQueryConfig
1 error
Run `sparklyr::spark_last_error()` to see the full Spark error (multiple lines)
To use the previous style of error message set
`options("sparklyr.simple.errors" = TRUE)`
I can see that bq.staging_dataset.gcs_bucket=null. Does it mean it is not finding the bucket? I has worked with SparkR though.
Am I doing something wrong?
There are two things you have to do for your sparklyr connection to work:
library(sparkbq)to your code. You might have to install it, because does not come preinstalled by default on databricks clusters.sparkbq::bigquery_defaults(). Your error says that you didn't specify agcsBucketto store temporary files. You will also need to specify aserviceAccountKeyFile,datasetLocation("US" or "EU") andtype("direct" and "avro" are preferable).You probably are not a GCP admin, so go to the GCP console and navigate to Buckets, which you can do by typing in the search bar. There you can select a bucket and copy its name. Analogously, navigate to Service Accounts, select and click on an account and copy/paste its ID.
Your code should look like this:
Finally, in the docs of spark_read_bigquery() you can find a similar example with explanation of the arguments passed to bigquery_defaults.