data.table's GForce - Apply multiple functions to multiple columns (with optional arguments)

71 Views Asked by At

My goal is to apply multiple functions to multiple columns AND to have GForce turned on.

Say I have the below dataframe

library(data.table)

df <- data.table(fruit = c('a', 'a', 'a', 'b')
                 , revenue = 1:4
                 , profit = c(2,NA,4,5)
                 ); df

   fruit revenue profit
1:     a       1      2
2:     a       2     NA
3:     a       3      4
4:     b       4      5

and I wanteed to apply multiple functions to multiple columns (all except fruit)

# functions
y <- \(i) {c(min(i, na.rm = T)
             , max(i, na.rm = T)
             )
           }

# apply
df[, lapply(.SD, y)
   , fruit
   , verbose = T
   ]

Finding groups using forderv ... forder.c received 4 rows and 1 columns
0.000s elapsed (0.000s cpu) 
Finding group sizes from the positions (can be avoided to save RAM) ... 0.000s elapsed (0.000s cpu) 
lapply optimization changed j from 'lapply(.SD, y)' to 'list(y(revenue), y(profit))'
GForce is on, left j unchanged
Old mean optimization is on, left j unchanged.
Making each group and running j (GForce FALSE) ... 
  memcpy contiguous groups took 0.000s for 2 groups
  eval(j) took 0.012s for 2 calls
0.020s elapsed (0.020s cpu) 

   fruit revenue profit
1:     a       1      2
2:     a       3      4
3:     b       4      5
4:     b       4      5

Now, the above works! However, notice it said (GForce FALSE). So GForce was NOT on.

I think this is because, as Waldi pointed out, when \(i) sum(i) is used, GForce is NOT on. I then tried the below and passing na.rm = T only in lapply

# functions
z <- \(i) {c(min
             , max
              )
           }

# apply
df[, lapply(.SD, z, na.rm = T)
   , fruit
   , verbose = T
   ]

Finding groups using forderv ... forder.c received 4 rows and 1 columns
0.000s elapsed (0.000s cpu) 
Finding group sizes from the positions (can be avoided to save RAM) ... 0.000s elapsed (0.000s cpu) 
lapply optimization changed j from 'lapply(.SD, z, na.rm = T)' to 'list(z(revenue, na.rm = T), z(profit, na.rm = T))'
GForce is on, left j unchanged
Old mean optimization is on, left j unchanged.
Making each group and running j (GForce FALSE) ... Error in z(revenue, na.rm = T) : unused argument (na.rm = T)

This time the error is as per above. Specifically Error in z(revenue, na.rm = T) : unused argument (na.rm = T)

Any help would be much appreciated

2

There are 2 best solutions below

2
Roland On BEST ANSWER

From help("gforce"):

Expressions in j which contain only the functions min, max, mean, median, var, sd, sum, prod, first, last, head, tail (for example, DT[, list(mean(x), median(x), min(y), max(y)), by=z]), they are very effectively optimised using what we call GForce. These functions are automatically replaced with a corresponding GForce version with pattern g*, e.g., prod becomes gprod.

You are obviously not passing an expression containing these functions. They are hidden (to data.table's gforce optimization) inside the y function.

I would do this:

res <- df[, 
          c(lapply(.SD, min, na.rm = TRUE), lapply(.SD, max, na.rm = TRUE)), 
          by = fruit,
          verbose = T
]
#Finding groups using forderv ... forder.c received 4 rows and 1 columns
#0.000s elapsed (0.000s cpu) 
#Finding group sizes from the positions (can be avoided to save RAM) ... 0.000s #elapsed (0.000s cpu) 
#lapply optimization changed j from 'c(lapply(.SD, min, na.rm = TRUE), lapply(.SD, #max, na.rm = TRUE))' to 'list(min(revenue, na.rm = TRUE), min(profit, na.rm = #TRUE), max(revenue, na.rm = TRUE), max(profit, na.rm = TRUE))'
#GForce optimized j to 'list(gmin(revenue, na.rm = TRUE), gmin(profit, na.rm = #TRUE), gmax(revenue, na.rm = TRUE), gmax(profit, na.rm = TRUE))' (see ?GForce)
#Making each group and running j (GForce TRUE) ... gforce initial population of grp #took 0.000
#gforce assign high and low took 0.000
#gforce eval took 0.000
#0.000s elapsed (0.000s cpu) 

setnames(res, -1, paste(names(res)[-1], 
                        rep(c("min", "max"), each = ncol(df) - 1), 
                        sep = "."))


res <- melt(res, measure.vars = measure(eco, fun, sep = "."))
#Warning message:
#In melt.data.table(res, measure.vars = measure(eco, fun, sep = ".")) :
#  'measure.vars' [revenue.min, profit.min, revenue.max, profit.max, ...] are not all of the same type. By order of hierarchy, the molten data value column will be of type 'double'. All measure variables not of type 'double' will be coerced too. Check DETAILS in ?melt.data.table for more on coercion.

dcast(res, fruit + fun ~ eco)
#Key: <fruit, dim>
#    fruit    fun profit revenue
#   <char> <char>  <num>   <num>
#1:      a    max      4       3
#2:      a    min      2       1
#3:      b    max      5       4
#4:      b    min      5       4

The warning is because of the different column types in df ("integer" and "double"). Ensure they are identical to avoid it.

1
thelatemail On

The only relatively simple suggestion I can give is to not try to do it inside a single df[] call, but rather make two separate calls to allow the optimisation to work. E.g.:

## bigger data example
df <- data.table(
    fruit = rep(1:2e6, each=2)
  , revenue = 1:4
  , profit = c(2,NA,4,5)
)

rbind(
    df[, lapply(.SD, min, na.rm=TRUE), by=fruit, verbose=TRUE],
    df[, lapply(.SD, max, na.rm=TRUE), by=fruit, verbose=TRUE]
)[order(fruit)]
##Making each group and running j (GForce TRUE) ... 
##Making each group and running j (GForce TRUE) ...
## About 0.13s total elapsed according to system.time()

y <- function(i) {
    c(min(i, na.rm = T),
      max(i, na.rm = T))
}

# apply
df[
  , lapply(.SD, y)
  , fruit
  , verbose = T
]
##Making each group and running j (GForce FALSE) ... 
##3.760s elapsed (3.770s cpu)