Pairwise correlation based on effect size values with missing values of some variables

Question

Pairwise correlation based on effect size values with missing values of some variables

73 Views Asked by SOF_helps At 14 August 2025 at 08:44

I have data with seven variables and I want to calculate pairwise correlation (also the significance level of each correlation). The data I have is the effect size of a treatment on these seven variables. -ve value shows inhibiting effects and +ve value promoting effects. The higher or lower the value is the higher or lower the inhibiting or promoting effect is on certain variables. Data also contain a large number of missing values, so in pairwise correlation, I want R to ignore the correlation if one or both of the variables missing the value.

Here is a sample dataset

set.seed(123)

# Create the dataset with effect sizes and missing values
mydata <- data.frame(
  Var1 = sample(c(-20:14, NA), 200, replace = TRUE),
  Var2 = sample(c(-20:14, NA), 200, replace = TRUE),
  Var3 = sample(c(-20:14, NA), 200, replace = TRUE),
  Var4 = sample(c(-20:14, NA), 200, replace = TRUE),
  Var5 = sample(c(-20:14, NA), 200, replace = TRUE),
  Var6 = sample(c(-20:14, NA), 200, replace = TRUE),
  Var7 = sample(c(-20:14, NA), 200, replace = TRUE)
)

# Set more than 50% missing values in each column
for (col in 1:7) {
  missing_indices <- sample(1:200, size = 101)
  mydata[missing_indices, col] <- NA
}

My question is "Is it possible to calculate pairwise correlation along with the significance level using effect size values in this case?

Original Q&A

There are 1 best solutions below

**DaveArmstrong** · Answer 1

You can use cor.test() which only uses the relevant data for the pair of observations. Unlike cor() cor.test() works only for one x and one y at a time. In the code below, I use outer() to run through all pairs of values. I do this for both the correlation (stored in r) and its p-value (stored in p). First, we make the data according to your specification.

set.seed(123)

# Create the dataset with effect sizes and missing values
mydata <- data.frame(
  Var1 = sample(c(-20:14, NA), 200, replace = TRUE),
  Var2 = sample(c(-20:14, NA), 200, replace = TRUE),
  Var3 = sample(c(-20:14, NA), 200, replace = TRUE),
  Var4 = sample(c(-20:14, NA), 200, replace = TRUE),
  Var5 = sample(c(-20:14, NA), 200, replace = TRUE),
  Var6 = sample(c(-20:14, NA), 200, replace = TRUE),
  Var7 = sample(c(-20:14, NA), 200, replace = TRUE)
)

# Set more than 50% missing values in each column
for (col in 1:7) {
  missing_indices <- sample(1:200, size = 101)
  mydata[missing_indices, col] <- NA
}

Next, we can make the correlations and their p-values.

r <- outer(
  1:ncol(mydata),
  1:ncol(mydata),
  Vectorize(function(x, y) cor.test(mydata[, x], mydata[, y])$estimate)
)
p <- outer(
  1:ncol(mydata),
  1:ncol(mydata),
  Vectorize(function(x, y) cor.test(mydata[, x], mydata[, y])$p.value)
)

round(r, 2)
#>       [,1]  [,2]  [,3]  [,4]  [,5]  [,6]  [,7]
#> [1,]  1.00  0.03 -0.07 -0.01  0.15 -0.05 -0.09
#> [2,]  0.03  1.00  0.16  0.19 -0.05 -0.11 -0.07
#> [3,] -0.07  0.16  1.00 -0.37  0.04 -0.24 -0.17
#> [4,] -0.01  0.19 -0.37  1.00  0.18  0.18 -0.04
#> [5,]  0.15 -0.05  0.04  0.18  1.00  0.24 -0.14
#> [6,] -0.05 -0.11 -0.24  0.18  0.24  1.00  0.13
#> [7,] -0.09 -0.07 -0.17 -0.04 -0.14  0.13  1.00
round(p, 2)
#>      [,1] [,2] [,3] [,4] [,5] [,6] [,7]
#> [1,] 0.00 0.83 0.65 0.94 0.30 0.72 0.55
#> [2,] 0.83 0.00 0.33 0.19 0.78 0.46 0.63
#> [3,] 0.65 0.33 0.00 0.02 0.78 0.11 0.33
#> [4,] 0.94 0.19 0.02 0.00 0.21 0.25 0.81
#> [5,] 0.30 0.78 0.78 0.21 0.00 0.10 0.36
#> [6,] 0.72 0.46 0.11 0.25 0.10 0.00 0.38
#> [7,] 0.55 0.63 0.33 0.81 0.36 0.38 0.00

^{Created on 2023-06-28 with reprex v2.0.2}

The question you asked is not entirely one of mechanics. As you can see, the operations for calculating pairwise correlations with p-values is not all that difficult. The other part of your question is - does it make sense to calculate these values on variables that contain the effect size? I suppose if the observations are the same (i.e., the rows represent the same observation for each effect size) then it could make sense to do this. If you're looking for more advice about that part of your question, you would likely be better off posting it on Cross Validated.

Edit: Errors in correlations

In the comments the OP suggested that there were some pairs that were producing errors. The problem with the actual use-case data (not the example data) is that there are two pairs of variables where there are only two observations. This triggers an error in cor.test(). To solve this, you could put a try() statement to catch the errors and then return a missing value if there was an error. The code will generate errors, but those errors are absorbed by try() so the computation will continue. The resulting r and p matrices will have some missing values where there were pairs of variables with too few values to calculate the correlation.

mydata <- structure(list(SOM = c(NA, -2.87, NA, NA, NA, 21.6, 6.04, NA, 
NA, -14.17, -0.77, 25.21, 3.3, 39.99, 225.37, 11.01, 2.85, 3.59, 
3.3, 3.91, 8.23, NA, NA, NA, NA, NA, NA, NA, NA, NA, -2.6, NA, 
NA, NA, NA, NA, NA, 1.09, -0.79, NA, NA, NA, NA, NA, NA, NA, 
-1.59, -2.18, -4.63, NA, NA, 1267.92, 3.21, NA, -2.28, NA, 3.64, 
4.63, NA, NA, NA, NA, NA, NA, NA, NA, 11.71, NA, NA, NA, NA, 
NA, -0.2, NA, NA, NA, NA, NA, NA, 5.57, 11.61, NA, 67.13, 84.4, 
NA, NA, NA), SOC = c(NA, NA, 1.39, 0.8, 0.4, NA, NA, NA, NA, 
NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, 3.33, 1.99, -0.75, 
91.54, 11.06, 49.46, NA, NA, NA, NA, -0.3, 3.33, 2.64, 5.97, 
1.99, 15.42, NA, NA, NA, -0.68, 0, 0, -0.23, 0.23, NA, NA, NA, 
NA, 0.25, 0.48, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, 
NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, 
3.89, 7.89, NA, NA, NA, NA, NA, NA), MBC = c(NA, NA, NA, NA, 
NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, 2.92, 1.4, 
1.45, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, 5.81, 3.44, 
7.56, 1.36, 10.57, NA, NA, NA, NA, 0.68, -0.05, -0.02, 0.33, 
NA, NA, NA, NA, 1.71, 0, NA, NA, 6.15, NA, NA, NA, NA, NA, NA, 
6.4, 2.89, 3.62, -0.5, -1.49, NA, NA, NA, NA, NA, NA, NA, NA, 
NA, NA, NA, NA, NA, NA, 15.6, NA, NA, NA, NA, 3.63, NA, NA), 
aN = c(-0.05, 10.67, 1.45, 1.96, 4.25, 7.76, 9.06, NA, 2.51, 
NA, -3.44, 6.6, 2.61, 18.26, 258.89, -54.22, NA, 2.56, 6.51, 
3.82, NA, NA, NA, NA, 17.01, 9.15, 23.48, 5.22, -5.41, -9.54, 
-2.06, 3.34, -1.54, 3.58, 1.26, 3.48, 6.52, 1.14, 6.31, NA, 
NA, NA, NA, NA, NA, NA, -0.95, -3.5, -1.11, -26.92, -1.72, 
NA, 0.09, NA, NA, NA, NA, NA, 1.28, NA, NA, NA, NA, NA, NA, 
NA, 7.31, 0.16, 0.08, 3.04, 0.74, -0.41, 0.14, 4.91, 2.7, 
0.73, 4.41, -0.15, NA, NA, NA, NA, 14.93, 23.42, NA, NA, 
NA), SoilP = c(-0.54, -9.39, 0.3, 1.85, NA, 61.46, 39.68, 
NA, 0.94, -31.07, 4.25, 11.3, 10.45, 23.09, 136.85, 796.34, 
-6.65, NA, NA, NA, 0.64, NA, NA, NA, 1.14, 1.04, 4.51, 5.36, 
9.88, -2.68, -2.45, 0.84, 12.28, 6.69, 14.72, 0, 23.48, 2.5, 
-0.32, 1.01, -1.49, 7.7, 6.52, 0.63, -0.17, 2.07, -0.29, 
0.91, -2, NA, NA, 1.11, NA, NA, 93.33, NA, 8.36, 8.06, 1.24, 
2.98, NA, NA, NA, NA, NA, -0.9, 15.87, -1.08, 0.09, -1.86, 
-0.96, -0.77, 42.5, 1.15, 0.02, 2.11, 0.25, -0.06, NA, NA, 
NA, -1.69, 109.66, 161.08, NA, -1.24, -2.22), SoilK = c(-0.3, 
2.93, NA, NA, NA, 68.38, 24.01, NA, -1.31, -4.74, 1.84, 1.05, 
0.21, 0.49, 928.42, 397.98, -14.39, NA, NA, NA, 1.61, NA, 
NA, NA, -27.11, NA, -2.95, -4.83, -0.93, -7.12, -0.1, NA, 
1.21, 2.67, 2.14, 1.02, 5.46, 0.74, 3.05, 1.67, -0.22, NA, 
NA, NA, NA, -1.95, -0.5, -0.22, -0.27, NA, NA, 0.48, NA, 
NA, NA, NA, NA, NA, 0.79, NA, NA, NA, NA, NA, NA, 2.83, 57.68, 
-2.28, 0.14, 0.07, -0.17, -0.13, 6.99, 1.96, 2.02, 2.82, 
7.12, 2.13, NA, NA, NA, NA, 65.01, 132.41, NA, NA, NA), Ureases = c(NA, 
NA, -1.32, 0.49, NA, -1.74, 0, 13.84, NA, 9.09, NA, NA, NA, 
NA, NA, NA, NA, 1.89, 1.66, 3.16, NA, NA, NA, NA, NA, NA, 
NA, NA, NA, NA, NA, NA, NA, NA, NA, 6.09, 10.87, 40.6, 2.92, 
NA, NA, NA, NA, NA, NA, NA, 1.05, 0.35, 4, NA, NA, 4.89, 
NA, 3.61, NA, 15.26, NA, NA, NA, NA, 0.03, 1.76, 3.51, 2.55, 
1.63, NA, 3.04, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, 
NA, NA, NA, NA, -1.2, 6.4, 4.84, NA, NA), ALP = c(NA, NA, 
NA, NA, NA, -0.51, -0.55, 26.8, NA, NA, NA, NA, NA, NA, NA, 
NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, 
NA, NA, NA, NA, NA, 9.13, 11.3, 0.9, 1.69, NA, NA, NA, NA, 
NA, NA, NA, NA, NA, NA, NA, NA, 2.72, NA, 3.26, NA, 8.27, 
18.11, 27.81, NA, 7.39, NA, NA, NA, NA, NA, NA, 4.71, NA, 
NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, 4.4, NA, NA, NA, 
NA, NA, NA, NA, NA), pH = c(-0.56, 17.82, NA, NA, 2.53, 0.6, 
-0.46, NA, 2.4, 9.59, 1.42, NA, NA, NA, 2.54, -0.92, 8.7, 
2.69, 4.57, 3.38, 1.65, 0.43, 2.04, 3.47, NA, NA, NA, NA, 
NA, NA, -2.72, 10.42, -0.98, -1.12, -1.74, NA, NA, NA, NA, 
NA, -0.51, 4.38, -0.87, 1.71, 0, NA, -2.1, -1.6, -0.31, 5.02, 
2.8, -5.68, 11.76, NA, 0, NA, -7.85, 2.18, -0.44, NA, NA, 
NA, NA, NA, NA, 10.12, NA, 1.92, 1.2, 0.57, 0.9, 0.91, NA, 
-0.89, -1.75, -0.9, -1.13, -0.47, 2.67, NA, NA, 2.09, NA, 
0.19, NA, 0.33, 0)), row.names = c(NA, 87L), class = "data.frame")

r <- outer(
  1:ncol(mydata),
  1:ncol(mydata),
  Vectorize(function(x, y) {
    tr <- try(cor.test(mydata[, x], mydata[, y]))
    if(inherits(tr, "try-error")){
      NA
    }else{
      tr$estimate
    }
}))
round(r, 2)
#>        [,1]  [,2]  [,3]  [,4]  [,5]  [,6]  [,7]  [,8]  [,9]
#>  [1,]  1.00    NA  0.96  0.91 -0.03  0.09 -0.02 -0.16 -0.31
#>  [2,]    NA  1.00  0.68  0.62 -0.03 -0.90  0.85    NA -0.51
#>  [3,]  0.96  0.68  1.00  0.24  0.92  0.77  0.55  0.24 -0.50
#>  [4,]  0.91  0.62  0.24  1.00 -0.06  0.80 -0.10 -0.20  0.07
#>  [5,] -0.03 -0.03  0.92 -0.06  1.00  0.53 -0.15 -0.31 -0.10
#>  [6,]  0.09 -0.90  0.77  0.80  0.53  1.00 -0.23 -0.34  0.01
#>  [7,] -0.02  0.85  0.55 -0.10 -0.15 -0.23  1.00  0.16  0.39
#>  [8,] -0.16    NA  0.24 -0.20 -0.31 -0.34  0.16  1.00 -0.04
#>  [9,] -0.31 -0.51 -0.50  0.07 -0.10  0.01  0.39 -0.04  1.00

p <- outer(
  1:ncol(mydata),
  1:ncol(mydata),
  Vectorize(function(x, y) {
    tr <- try(cor.test(mydata[, x], mydata[, y]))
    if(inherits(tr, "try-error")){
      NA
    }else{
      tr$p.value
    }
  }))

round(p, 2)
#>       [,1] [,2] [,3] [,4] [,5] [,6] [,7] [,8] [,9]
#>  [1,] 0.00   NA 0.04 0.00 0.87 0.69 0.95 0.70 0.17
#>  [2,]   NA 0.00 0.01 0.02 0.93 0.00 0.15   NA 0.05
#>  [3,] 0.04 0.01 0.00 0.50 0.00 0.13 0.06 0.85 0.09
#>  [4,] 0.00 0.02 0.50 0.00 0.69 0.00 0.71 0.66 0.70
#>  [5,] 0.87 0.93 0.00 0.69 0.00 0.00 0.57 0.36 0.51
#>  [6,] 0.69 0.00 0.13 0.00 0.00 0.00 0.43 0.41 0.96
#>  [7,] 0.95 0.15 0.06 0.71 0.57 0.43 0.00 0.65 0.24
#>  [8,] 0.70   NA 0.85 0.66 0.36 0.41 0.65 0.00 0.93
#>  [9,] 0.17 0.05 0.09 0.70 0.51 0.96 0.24 0.93 0.00

^{Created on 2023-06-28 with reprex v2.0.2}

Pairwise correlation based on effect size values with missing values of some variables

There are 1 best solutions below

Edit: Errors in correlations

Related Questions in R

Related Questions in CORRELATION

Related Questions in R-CORRPLOT

Related Questions in GGCORRPLOT

Trending Questions

Popular # Hahtags

Popular Questions