extract correlations of sub-sets of genes based on a key -> value data frame

56 Views Asked by At

I have two data frames. The first one contains a gene-gene correlation matrix, 1484 x 1484 (each cell corresponds to the correlation value between I and J genes). The second one contains a key -> value sort of information, and it looks like this:

                       Complex            Protein_ID
1                      BCL6-HDAC4 complex       Bcl6
125                    BCL6-HDAC5 complex      Hdac5
249                    BCL6-HDAC7 complex       Bcl6
373 Multisubunit ACTR coactivator complex      Ep300
497                   Condensin I complex       Smc2
621                                BLOC-3       Hps4

I am interested in extracting the correlations of genes belonging to the same complex from my matrix and storing them on a new data frame, where I will have, per complex, the values of gene-gene correlations. It would ideally look like this:

#this is a simulated data.frame

                    Complex                                Correlation values
                    BCL6-HDAC4 complex                     0.64
                    BCL6-HDAC4 complex                     -0.25
                    Multisubunit ACTR coactivator complex  0.31
                    Multisubunit ACTR coactivator complex  0.30

Any ideas on how I can get there?

1

There are 1 best solutions below

0
Tobo On BEST ANSWER
library(data.table) # >= V1.15.0

df <-
  melt(data.table(cors),                    # matrix to long data.frame
       variable.name = "i",
       value.name = "cor"
  )[, let(i = as.integer(i), j = rowid(i))  # cols for i and j
  ][i < j                                   # keep distinct correlations
  ][, Complex := lkps$Complex[i]            # look up Complex for i
  ][Complex == lkps$Complex[j]]             # keep if Complex for j is same

Example data (10 genes, 3 groups, only showing first 6 cols of correlation matrix):

set.seed(1)
n_genes <- 10
cors <- cor(matrix(rnorm(n_genes * 50), nrow = 50, ncol = n_genes))
lkps <- data.frame(
  Complex = sample(c("Complex A", "Complex B", "Complex C"), n_genes, replace = TRUE),
  Protein_ID = replicate(n_genes, paste0(sample(c(letters, LETTERS), 4, replace = TRUE), collapse = "")))

> cors
             [,1]         [,2]         [,3]        [,4]         [,5]        [,6]
 [1,]  1.00000000 -0.039087178  0.026287227 -0.27185574  0.013674895 -0.11933102
 [2,] -0.03908718  1.000000000  0.003552006 -0.02391178  0.039833039  0.02218480
 [3,]  0.02628723  0.003552006  1.000000000  0.21648782  0.127791868  0.12197135
 [4,] -0.27185574 -0.023911775  0.216487818  1.00000000 -0.082713154 -0.24277681
 [5,]  0.01367489  0.039833039  0.127791868 -0.08271315  1.000000000  0.09888519
 [6,] -0.11933102  0.022184800  0.121971345 -0.24277681  0.098885194  1.00000000
 [7,]  0.19468192  0.006755358 -0.074116195  0.12591453  0.184806771 -0.14283941
 [8,] -0.14785348 -0.255064246 -0.054761988 -0.03252786  0.004459162  0.03851846
 [9,]  0.02336706  0.198299294  0.069506207  0.14657036  0.183043022 -0.10887799
[10,] -0.36678892  0.240101899  0.031648477  0.17387651  0.131315992 -0.12944992

> lkps
     Complex Protein_ID
1  Complex C       jMXs
2  Complex C       ruTw
3  Complex A       zoCU
4  Complex C       PCev
5  Complex A       aWvm
6  Complex B       vfRO
7  Complex A       GxvG
8  Complex B       jSsh
9  Complex B       lkpQ
10 Complex B       ufxz

Result:

            cor     i     j   Complex
          <num> <int> <int>    <char>
 1: -0.03908718     1     2 Complex C
 2: -0.27185574     1     4 Complex C
 3: -0.02391178     2     4 Complex C
 4:  0.12779187     3     5 Complex A
 5: -0.07411620     3     7 Complex A
 6:  0.18480677     5     7 Complex A
 7:  0.03851846     6     8 Complex B
 8: -0.10887799     6     9 Complex B
 9: -0.12944992     6    10 Complex B
10: -0.05267148     8     9 Complex B
11:  0.04892611     8    10 Complex B
12:  0.18778267     9    10 Complex B