How to permute correlations between several pairs of variables in R?

118 Views Asked by At

I would like to compute the correlation between pairs of variables from two data frames, ideally using cor.test, so I can compare the p value from using either the estimate or the statistic. My real data (rather than practice dummy data) has 20 samples and 40 variables in df1 and 20 samples and 345 variables in df2.

Some examples where they have looked at permutating correlations between two variables use a for loop, https://bookdown.org/curleyjp0/psy317l_guides5/permutation-testing.html#correlation-coefficient-permutation-tests and https://dgarcia-eu.github.io/SocialDataScience/5_SocialNetworkPhenomena/056_PermutationTests/PermutationTests, which works if only having two variables/vectors.

I'm attempting to use foreach to permute each of my y variables in df2 1,000 times, and then calculate the p value by looking at the number of times the permuted statistic >= original statistic / number of permutations. I've been practicing with 10 permutations and a smaller dataset to see if I can get my code working

df <- as.data.frame(matrix(round(runif(n=12*12, min=1, max=20), 0), ncol=12))
colnames(df) <- paste0("a", ncol(df))
df2 <- as.data.frame(matrix(round(runif(n=12*20, min=1, max=20), 0), ncol=20))
colnames(df2) <- paste("b", ncol(df2) )
comp <- expand.grid(colnames(df1), colnames(df2))


res <-  foreach(i=1:nrow(comp), .combine=rbind, .packages=c("magrittr", "dplyr")) %dopar% {
perm <- \(){
N <- 10
samp <- sample(nrow(df2))
cor.test(df[, comp[i,1]], df2[samp, comp[i,2]], method="spearman", exact=FALSE)[c("estimate")]
}
perm_stat <- replicate(N, perm())
orig_stat <- cor.test(df[, comp[i,1]], df2[, comp[i,2]], method="spearman", exact=FALSE)[c("estimate")] 
(sum(abs(perm_stat) >= abs(orig_stat)) + 1)/N 
   }

  res2 <- as.data.frame(unlist(res)) %>% dplyr::mutate(Var1=comp[,1], Var2=comp[,2])

If I hash out everything bar orig_stat, it runs to get the coefficient/estimate of the correlations between the variables in df and df2. But is not obviously carrying out the permutation. Shall I attempt to use a double loop via foreach or look at an alternative approach such as parallelised version of Map e.g mcmapply.

   ##double loop format 
   out <- foreach(j=1:ncol(df1), .combine = 'rbind', .packages=c("magrittr", "dplyr")) %:%
      foreach(i = 1:ncol(df2), .combine = 'c') %dopar% {
       a <- cor.test(df1[,j], df2[,i], method = "spearman")$p.value
     }
1

There are 1 best solutions below

7
jblood94 On

There seems to be some confusion here about the use of a permutation test. cor.test is already giving the information you need--no permutations required.

Say, for example, cor.test(x, y, method = "spearman") returns a p-value of 0.05. For a two-sided test using Spearman's rho statistic, the interpretation is that Spearman's rank correlation coefficient between x and y[sample(length(y))] will be further from 0 than the coefficient between x and y approximately 5% of the time.

Similarly, if we repeatedly called cor.test(x, y[sample(length(y))], method = "spearman"), we would see a p-value less than 0.05 approximately 5% of the time. Demonstrating:

set.seed(293885308L)
x <- runif(20)
y <- runif(20)
correl <- cor(x, y, method = "spearman")

# a single `cor.test`
(p <- cor.test(x, y, method = "spearman")$p.value)
#> [1] 0.04984582

# a permutation test on top of `cor.test`
mean(replicate(1e5, cor.test(x, y[sample(20)], method = "spearman")$p.value) <= p)
#> [1] 0.04982

# a permutation test instead of `cor.test`
mean(abs(replicate(1e6, cor(x, y[sample(20)], method = "spearman"))) >= correl)
#> [1] 0.04973

For the relatively small datasets described in the question, we can quickly and easily get the p-value of all pairwise column combinations:

df <- as.data.frame(matrix(runif(20*40), 20, 40))
df2 <- as.data.frame(matrix(runif(20*345), 20, 345))

microbenchmark::microbenchmark(
  cor.test = mapply(\(y) sapply(df, \(x) cor.test(x, y, method = "spearman")$p.value), df2),
  times = 10
)
#> Unit: seconds
#>      expr      min       lq     mean   median       uq      max neval
#>  cor.test 1.739906 1.753571 1.818344 1.773926 1.841813 1.989383    10