I have a few moderately large data frames and need to do a calculation across different columns in the data; for example I want to compare column i in one data frame with i - 1 in another. I currently use a for loop. The calculation involves element-wise comparison of each pair of values so is somewhat slow: e.g. I take each column of data, turn it into a matrix and compare with the transpose of itself (with some additional complications). In my application (in which the data have about 100 columns and 3000 rows) this currently takes about 95 seconds. I am looking for ways to make this more efficient. If I were comparing the SAME column of each data frame I would try using mapply, but because I need to make comparisons across different columns I don't see how this could work. The current code is something like this:
d1 <- as.data.frame(matrix(rnorm(100000), nrow=1000))
d2 <- as.data.frame(matrix(rnorm(100000), nrow=1000))
r <- list()
ptm2 <- proc.time()
for(i in 2:100){
t <- matrix(0 + d1[,i] > 0,1000,1000)
u <- matrix(d1[,i],1000,1000)*t(matrix(d2[,i-1],1000,1000))
r[[i]] <- t * u
}
proc.time() - ptm2
This takes about 3 seconds on my computer; as mentioned the actual calculation is a bit more complicated than this MWE suggests. Obviously one could also improve efficiency in the calculation itself but I am looking for a solution to the 'compare column i to column i-1' issue.
Based on your example, if you align the d1 and d2 matrices ahead of time based on which columns you are comparing, then here is how you could use
mapply. It appears to be only marginally faster, so parallel computing would be a better way to achieve speed gains.