I realized that computing mutual information on a dataframe with NA using R's infotheo package does not yield errors but incorrect results. The problem is described in more detail here but while I now have a mathematically correct solution which only removes pairwise incomplete cases instead of across all columns the performance for large data sets it catastrophic. I guess it is the nested for loop which causes the long compute times, does anyone have an idea how to improve performance of the below code?
library(infotheo)
v1 <- c(1,2,3,4,5,NA,NA,NA,NA,NA)
v2 <- c(1,NA,3,NA,5,NA,7,NA,9,NA)
v3 <- c(NA,2,3,NA,NA,6,7,NA,7,NA)
v4 <- c(NA,NA,NA,NA,NA,6,7,8,9,10)
df <- cbind.data.frame(v1,v2,v3,v4)
ColPairMap<-function(df){
t <- data.frame(matrix(ncol = ncol(df), nrow = ncol(df)))
colnames(t) <- colnames(df)
rownames(t) <- colnames(df)
for (j in 1:ncol(df)) {
for (i in 1:ncol(df)) {
c(1:ncol(df))
if (nrow(df[complete.cases(df[,c(i,j)]),])>0) {
t[j,i] <- natstobits(mutinformation(df[complete.cases(df[,c(i,j)]),j], df[complete.cases(df[,c(i,j)]),i]))
} else {
t[j,i] <- 0
}
}
}
return(t)
}
ColPairMap(df)
Thanks in advance!
I found a tweak which is not helping for toy data sets as
dfabove but for real world data sets, especially when executed on some proper H/W I've seen examples where it reduces a 2.5hrs compute time to 14min! The code below is a complete copy&pastable exmple which incorporates Rui's solution using a nestedforloop and building on this idea another solution using a nested 'foreach' loop parallelizing the task on 75% of the available cores. You can control the size of the data set and consequently the compute time by adjusting n.