How to generate rolling mean with grouped data. Here's the data
set.seed(31)
dd<-matrix(sample(seq(1:20),30,replace=TRUE),ncol=3)
Add a group identifier, and sort by group identifier
du<-sample(seq(1:4),10,replace=TRUE)
d<-cbind(du,dd)
d<-d[order(d[,1]),]
This gives the rolling mean but ignores group bounderis
d_roll_mean <- apply(d[,2:4], 2,
function(x) {
rollapply(zoo(x), 3, mean, partial=TRUE, align='right')
}
)
This gives the results below
# cbind(d,d_roll_mean)
# [1,] 1 3 3 12 3.000000 3.000000 12.000000
# [2,] 2 10 13 8 6.500000 8.000000 10.000000
# [3,] 2 17 2 17 10.000000 6.000000 12.333333
# [4,] 3 14 6 3 13.666667 7.000000 9.333333
# [5,] 3 6 20 1 12.333333 9.333333 7.000000
# [6,] 3 1 16 19 7.000000 14.000000 7.666667
# [7,] 3 19 2 11 8.666667 12.666667 10.333333
# [8,] 4 12 1 9 10.666667 6.333333 13.000000
# [9,] 4 10 13 12 13.666667 5.333333 10.666667
# [10,] 4 8 20 7 10.000000 11.333333 9.333333
Here's the goal, rolling mean by group boundary
# Desired
# [1,] 1 3 3 12 3.000000 3.000000 12.000000
# [2,] 2 10 13 8 10.000000 13.000000 8.000000
# [3,] 2 17 2 17 13.500000 7.500000 12.500000
# [4,] 3 14 6 3 14.000000 6.000000 3.000000
# [5,] 3 6 20 1 10.000000 13.000000 2.000000
# [6,] 3 1 16 19 7.000000 14.000000 7.666667
# [7,] 3 19 2 11 8.666667 12.666667 10.333333
# [8,] 4 12 1 9 12.000000 1.000000 9.000000
# [9,] 4 10 13 12 11.000000 7.000000 10.500000
# [10,] 4 8 20 7 10.000000 8.000000 9.333333
This is close, but generates a list by factor, instead of a matrix
doApply <- function(x) {
apply(x, 2,
function(y) {
rollapply(zoo(y), 3, mean, partial=TRUE, align='right')
})
}
d2_roll_mean <- by(d[,2:4], d[,1], doApply)
So there are some answers to the question, here's how they compare in execution time
set.seed(31)
nrow=20000
ncol=600
nun=350
nValues = 20
dd<-matrix(sample(seq(1:nValues),nrow*ncol,replace=TRUE),ncol=ncol)
du<-sample(seq(1:nun),nrow,replace=TRUE)
d<-cbind(du,dd)
d<-d[order(d[,1]),]
library(zoo)
doApply <- function(x) {
apply(x, 2,
function(y) {
rollapply(zoo(y), 3, mean, partial=TRUE, align='right')
})
}
library(data.table)
library(caTools)
fun1<-function(d) {by(d[,-1], d[,1], doApply)}
fun2<- function(d){
DT <- data.table(d, key='du')
DT[, lapply(.SD, function(y)
runmean(y, 3, alg='fast',align='right')), by=du]
}
system.time(d2_roll_mean <- fun1(d))
system.time(d2_roll_mean2 <- fun2(d))
The timing indicates using data tables is about 10 times faster than rollapply.
user system elapsed
fun1 1048.910 0.378 1049.158
fun2 107.296 0.097 107.392
I don't get equality, but by inspection they seem the same...
d2a<-do.call(rbind,d2_roll_mean)
d2b<-cbind(1,d2a)
d2c<-data.table(d2b)
setnames(d2c,names(d2c),names(d2_roll_mean2))
all.equal(d2c,d2_roll_mean2)
The output of all equal is
[1] "Attributes: < Length mismatch: comparison on first 1 components >"
[2] "Component “du”: Mean relative difference: 175.6631"
When the above approach was applied to data, the following error was generated
Error in `[<-`(`*tmp*`, (k2 + 1):n, , value = 2) :
subscript out of bounds
This error was the result of some factors have too few rows. Those rows were removed, and the process worked. Ref: How to drop factors that have fewer than n members
The only thing missing is a
do.call(rbind,d2_roll_mean)
. Add original data:EDIT: I ran this through
system.time()
for a bigger example, and it does take its sweet time:by()
andapply()
are not the fastest functions. It may actually be faster to walk through the columns using afor
loop and doing this by brute force, relying on the fact thatd
is sorted by ID.