Speeding up calculation across columns

67 Views Asked by At

I have a few moderately large data frames and need to do a calculation across different columns in the data; for example I want to compare column i in one data frame with i - 1 in another. I currently use a for loop. The calculation involves element-wise comparison of each pair of values so is somewhat slow: e.g. I take each column of data, turn it into a matrix and compare with the transpose of itself (with some additional complications). In my application (in which the data have about 100 columns and 3000 rows) this currently takes about 95 seconds. I am looking for ways to make this more efficient. If I were comparing the SAME column of each data frame I would try using mapply, but because I need to make comparisons across different columns I don't see how this could work. The current code is something like this:

d1 <- as.data.frame(matrix(rnorm(100000), nrow=1000))
d2 <- as.data.frame(matrix(rnorm(100000), nrow=1000))

r <- list()
ptm2 <- proc.time()
for(i in 2:100){
  t <- matrix(0 + d1[,i] > 0,1000,1000)
  u <- matrix(d1[,i],1000,1000)*t(matrix(d2[,i-1],1000,1000))
  r[[i]] <- t * u
}
proc.time() - ptm2

This takes about 3 seconds on my computer; as mentioned the actual calculation is a bit more complicated than this MWE suggests. Obviously one could also improve efficiency in the calculation itself but I am looking for a solution to the 'compare column i to column i-1' issue.

1

There are 1 best solutions below

0
ThetaFC On

Based on your example, if you align the d1 and d2 matrices ahead of time based on which columns you are comparing, then here is how you could use mapply. It appears to be only marginally faster, so parallel computing would be a better way to achieve speed gains.

d1 <- as.data.frame(matrix(rnorm(100000), nrow=1000))
d2 <- as.data.frame(matrix(rnorm(100000), nrow=1000))

r <- list()
ptm2 <- proc.time()
for(i in 2:100){
  t <- matrix(0 + d1[,i] > 0,1000,1000)
  u <- matrix(d1[,i],1000,1000)*t(matrix(d2[,i-1],1000,1000))
  r[[i]] <- t * u
}
proc.time() - ptm2
#user  system elapsed 
#0.90    0.87    1.79 
#select last 99 columns of d1 and first 99 columns of d2 based on your calcs
d1_99 <- as.data.frame(d1[,2:100]) #have to convert to data.frame for mapply to loop across columns; a data.frame is simply a list of vectors of equal length
d2_99 <- as.data.frame(d2[,1:99])
ptm3 <- proc.time()
r_test <- mapply(function(x, y) {
  t <- matrix(x > 0, 1000, 1000) #didn't understand why you were adding 0 in your example
  u <- matrix(x,1000,1000)*t(matrix(y,1000,1000))
  t * u
}, x=d1_99, y=d2_99, SIMPLIFY = FALSE)
proc.time() - ptm3
#user  system elapsed 
#0.91    0.83    1.75 
class(r_test)
#[1] "list"
length(r_test)
#[1] 99
#test for equality
all.equal(r[[2]], r_test[[1]])
#[1] TRUE
all.equal(r[[100]], r_test[[99]])
#[1] TRUE