Working in R, I would like Arrow to sum a set of columns specified in a variable.
library(arrow)
library(dplyr)
example_data = InMemoryDataset$create(data.frame(a1 = c(1,2,3), b2=c(4,5,6), c3=c(7,8,9)))
cols_to_sum = c('a1','b2','c3')
Arrow is capable of doing this:
example_data %>% mutate(computed_sum = a1+b2+c3) %>% compute()
#Succeeds
However I would like to pass the variable rather than specifying the columns explicitly. The dplyr syntax I'd usually use for this does not work with Arrow:
example_data %>%
mutate(computed_sum = rowSums(across(all_of(cols_to_sum)))) %>%
compute()
#Error: Expression rowSums(across(all_of(cols_to_sum))) not supported in Arrow
#Call collect() first to pull data into R.
Reconstructing the literal input string with parse() and eval() does work but seems like a cumbersome workaround for what should be a common operation:
temp_expression = parse( text=paste(cols_to_sum, collapse = '+') )
example_data %>%
mutate(computed_sum = eval(temp_expression) ) %>%
compute()
#Succeeds
However the above process without an explicit temporary variable fails:
example_data %>%
mutate(computed_sum = eval( parse( text=paste(cols_to_sum, collapse = '+') ) ) ) %>%
compute()
#Error: Expression eval(parse(text = paste(cols_to_sum, collapse = "+"))) not supported in Arrow
#Call collect() first to pull data into R.
What is the correct/best/intended way to use Arrow's R interface to specify recursive computations (e.g., sum) on columns listed in a variable? Do I need to build strings and eval() them to make this happen?
Non-Arrow solutions won't work for me. I am working with data far too large for memory, distributed as hive-partitioned parquets and accessed by Arrow's open_dataset().
I'm not sure why, but if you store the recursive code in a function (named or anonymous), it will let you run recursive code (or more simply written with
Reduce
):