dplyr syntax for arrow to sum columns specified in a variable

96 Views Asked by At

Working in R, I would like Arrow to sum a set of columns specified in a variable.

library(arrow) 
library(dplyr)

example_data = InMemoryDataset$create(data.frame(a1 = c(1,2,3), b2=c(4,5,6), c3=c(7,8,9)))
cols_to_sum = c('a1','b2','c3')

Arrow is capable of doing this:

example_data %>% mutate(computed_sum = a1+b2+c3)  %>% compute()

#Succeeds

However I would like to pass the variable rather than specifying the columns explicitly. The dplyr syntax I'd usually use for this does not work with Arrow:

example_data %>% 
  mutate(computed_sum = rowSums(across(all_of(cols_to_sum))))  %>% 
  compute()

#Error: Expression rowSums(across(all_of(cols_to_sum))) not supported in Arrow
#Call collect() first to pull data into R.

Reconstructing the literal input string with parse() and eval() does work but seems like a cumbersome workaround for what should be a common operation:

temp_expression =  parse( text=paste(cols_to_sum, collapse = '+') )
example_data %>% 
  mutate(computed_sum = eval(temp_expression) )  %>% 
  compute()

#Succeeds

However the above process without an explicit temporary variable fails:

example_data %>% 
  mutate(computed_sum = eval( parse( text=paste(cols_to_sum, collapse = '+') ) ) )  %>% 
  compute()

#Error: Expression eval(parse(text = paste(cols_to_sum, collapse = "+"))) not supported in Arrow                                                                               
#Call collect() first to pull data into R. 

What is the correct/best/intended way to use Arrow's R interface to specify recursive computations (e.g., sum) on columns listed in a variable? Do I need to build strings and eval() them to make this happen?

Non-Arrow solutions won't work for me. I am working with data far too large for memory, distributed as hive-partitioned parquets and accessed by Arrow's open_dataset().

1

There are 1 best solutions below

1
On BEST ANSWER

I'm not sure why, but if you store the recursive code in a function (named or anonymous), it will let you run recursive code (or more simply written with Reduce):

library(arrow) 
library(dplyr)

example_data = InMemoryDataset$create(data.frame(a1 = c(1,2,3), b2=c(4,5,6), c3=c(7,8,9)))
cols_to_sum = c('a1','b2','c3')

f <- function(...) Reduce(`+`, list(...))

example_data %>%
  mutate(computed_sum = f(!!!syms(cols_to_sum))) %>%
  collect()
#>   a1 b2 c3 computed_sum
#> 1  1  4  7           12
#> 2  2  5  8           15
#> 3  3  6  9           18

# calling directly errors out
example_data %>% mutate(computed_sum = Reduce(`+`, syms(cols_to_sum)))
#> Error: Expression Reduce(`+`, syms(cols_to_sum)) not supported in Arrow
#> Call collect() first to pull data into R.

# anonymous functions do work
example_data %>% mutate(computed_sum = (function(...) Reduce(`+`, list(...)))(!!!syms(cols_to_sum)))
#> InMemoryDataset (query)
#> a1: double
#> b2: double
#> c3: double
#> computed_sum: double (add_checked(add_checked(a1, b2), c3))
#> 
#> See $.data for the source Arrow object