I read about the collapse package recently and tried to translate the following data.table code to collapse to see if it's faster in real world examples.
Here's my data.table code:
library(data.table)
library(nycflights13)
data("flights")
flights_DT <- as.data.table(flights)
val_var <- "arr_delay"
id_var <- "carrier"
by <- c("month", "day")
flights_DT[
j = list(agg_val_var = sum(abs(get(val_var)), na.rm = TRUE)),
keyby = c(id_var, by)
][
i = order(-agg_val_var),
j = list(value_share = cumsum(agg_val_var)/sum(agg_val_var)),
keyby = by
][
j = .SD[2L],
keyby = by
][
order(-value_share)
]
#> month day value_share
#> 1: 10 3 0.5263012
#> 2: 1 24 0.5045664
#> 3: 1 20 0.4885145
#> 4: 10 17 0.4870692
#> 5: 3 6 0.4867606
#> ---
#> 361: 5 4 0.3220295
#> 362: 6 15 0.3205974
#> 363: 1 28 0.3197260
#> 364: 11 25 0.3161550
#> 365: 6 14 0.3128286
Created on 2021-03-11 by the reprex package (v1.0.0)
I managed to translate the first data.table call, but struggled later on.
It would be great to see how collapse would be used to handle this use case.
I think it only makes sense to translate
data.tablecode tocollapseifyou've come up with an arcane expression in
data.tableto do something complex statistical it is is not good at (such as weighted aggregation, computing quantiles or the mode by groups, lagging / differencing an irregular panel, grouped centering or linear / polynomial fitting)you actually don't need the
data.tableobject but would much rather work with vectors / matrices / data.frame's / tibblesyou want to write a statistical program and would much prefer standard evaluation programming over NS eval and
data.tablesyntax orcollapseis indeed substantially faster for your specific application.Now to the specific code you have provided. It mixes standard and non-standard evaluation (e.g. through the use of
get()), which is somethingcollapseis not very good at. I'll give you 3 solutions ranging from full NS eval to full standard eval base R style programming.Created on 2021-03-12 by the reprex package (v0.3.0)
Note the use of
na.last = NAwich actually removes cases whereagg_val_varis missing. This is needed here becausefsum(NA)isNAand not0likesum(NA, na.rm = TRUE). Now the hybrid example which is probably closes to the code you provided:Created on 2021-03-12 by the reprex package (v0.3.0)
Note here that I used
frenameat the end to give the result column the name you wanted, as you cannot mix standard and non-standard eval in the same function incollapse. Finally, a big advantage ofcollapseis that you can use it for pretty low-level programming:Created on 2021-03-12 by the reprex package (v0.3.0)
I refer you to the blog post on programming with
collapsefor a more interesting example on how this can benefit the development of statistical codes.Now for the evaluation, I wrapped these solutions in functions, where
DT()is thedata.tablecode you provided, run with 2 threads on a windows machine. This checks equality:Now the benchmark: