When using bigrquery from R, the tidyverse API asks for authentication, which, once provided, allows bigquery to be accessed from the R session. Downloading results can be time consuming, and the if the oauth token expires mid transfer the download fails:
Error: Invalid Credentials [authError]
Example code
# install.packages("tidyverse"); install.packages('bigrquery')
Sys.setenv(BIGQUERY_TEST_PROJECT="your-gcp-project-id")
library(tidyverse)
library(dbplyr)
library(bigrquery)
billing <- bq_test_project()
connection <- dbConnect(
bigrquery::bigquery(),
project = "your-gcp-project-id",
dataset = "dataset-name",
billing = billing
)
mytable <- tbl(connection, "mytable")
mytable %>%
-- some heavy dplyr wrangling --
%>% collect()
Which returns
Running job [/] 20s
Complete
Billed 400GB
Downloading 230,000,000 rows
Downloading data [=====--------] 28% ETA: 24h
but after some time
Error: Invalid Credentials [authError]
Question
How can the length of time before the 0Auth token expires be increased?
Not a complete answer, but the details of my research so far to save people from retracing my steps.
Timeout does not appear controllable via dbplyr
dbconnect
accepts both the driver and arguments to pass to the driver (documentation):dbConnect(drv, ...)
.timeout
. This issue has an example using Cassandra:con <- dbConnect(odbc::odbc(), "Cassandra (DSN)", timeout = 10)
.bigquery
. The documentation lists the following arguments (project
,dataset
,billing
,page_size
,quiet
,use_legacy_sql
,bigint
) and notes that other arguments are currently ignored.So given the above, it seems unlikely that the timeout can be controlled by R via
dbplyr
,DBI
or the connection.Splitting the query into multiple shorter queries
While not a preference of OP (comments make it clear) this is still a potential solution. I use an approach of filtering based on an unique ID column, with a wrapper function to reduce some of the additional clutter:
We then iterate as follows:
I set
DEVELOPER_MODE
to true when testing/developing, and to false when I want the entire thing to run.Other avenues to consider
-- heavy dplyr wrangling here --
is. Because dbplyr does not translate very efficient sql code, in my work on SQL server, saving intermediate tables has cut hours off my runtimes. Given that downloading 10GB should be much faster than several hours, the bottleneck could be bigquery conducting all the wrangling on the fly (and that the initial 20 second execution is with lazy evaluation). This link suggests there is a six hour limit on the duration of a single execution.