Speeding up For loop for large dataframe

145 Views Asked by At

I have a very large dataframe and my goal is to list the cumulative USD by user ID. The dataframe looks like this, but it is much larger:

dt<-sample(seq(as.Date("2013-01-01"),as.Date("2013-05-01"),by="days"),10)
s<-c(rep(5252525,5),rep(1313131,5))
usd<-round(rnorm(10,100),2)
money<-data.frame(dt,s,usd)
money<-money[order(money$dt),]
money$Cumulative<-NA
users<-unique(money$s)

I started with a for loop, but it was very slow:

for (i in 1:length(users)){
    temp=which(money$s==users[i])
    money$Cumulative[temp]=cumsum(money$usd[temp])
}

I read on StackOverflow that I could use data.table to improve overall speed, and this helped somewhat:

money<-data.table(money)
setkey(money,s)

for (i in 1:length(users)){
    temp=which(money$s==users[i])
    money$Cumulative[temp]=cumsum(money$usd[temp])
}

I'd like to make this calculation even faster. What should I do next?

2

There are 2 best solutions below

1
Joshua Ulrich On BEST ANSWER

Since money is already ordered by the dt column, you can just use ave:

money$Cumulative <- ave(money$usd, money$s, FUN=cumsum)

Or you can use data.table:

moneyDT <- as.data.table(money[,1:3])
moneyDT[,cumulative := cumsum(usd), by=s]
0
garborg On

It sounds like you're looking for the data.table option.

Using Joshua's proposed methods and pointing out that using toy data can be misleading (i.e. methods similar perform similarly for minimal dataset lil_money, but not for more realistic dataset money):

Results

Unit: microseconds
              expr     min      lq  median      uq       max neval
 useAve(lil_money) 694.269 730.491 741.358 756.753 13687.951  1000
  useBy(lil_money) 709.664 748.603 759.470 775.770  5341.338  1000

Unit: milliseconds
          expr       min        lq    median        uq       max neval
 useAve(money) 3940.8970 3966.0950 4002.4319 4090.3967 4145.2672    10
  useBy(money)  105.7129  106.5789  109.6566  117.2939  122.1414    10

Identical output: TRUE

Code

require(microbenchmark)
require(data.table)

start <- as.Date("2001-01-01")
money <- CJ(s=1:1e4, dt=start + 0:1e3)[, usd := runif(.N)]

lil_money <- money[s < 10 & dt < start + 10]

useAve <- function(DT) { DT[, cum_ave := ave(usd, s, FUN=cumsum)] }
useBy <- function(DT) { DT[, cum_by := cumsum(usd), by=s] }

print(microbenchmark(useAve(lil_money), useBy(lil_money),  times=1000))
print(microbenchmark(useAve(money), useBy(money),  times=10))

cat("Identical output:", identical(money$cum_ave, money$cum_by))