Sum of all pairwise row products as a two way matrix

797 Views Asked by At

I'm looking at some high throughput gene data and doing a type of correlation analysis based on Bayesian statistics. One of the things I need to do is to find every pairwise combination of products in the dataset and find the sum of each resultant row.

So for example, for the high throughput dataset matrix Dataset

(Dataset <- structure(list(`Condition 1` = c(1L, 3L, 2L, 2L), `Condition 2` = c(2L, 1L, 7L, 2L), `Condition 3` = c(4L, 1L, 2L, 5L)), .Names = c("Condition 1", "Condition 2", "Condition 3"), class = "data.frame", row.names = c("Gene A", "Gene B", "Gene C", "Gene D")))
       Condition 1 Condition 2   Condition 3
Gene A           1           2             4
Gene B           3           1             1
Gene C           2           7             2
Gene D           2           2             5

First I want to multiply every possible pair of rows together to get the following matrix called Comb:

              Condition 1 Condition 2 Condition 3
Gene A Gene A           1           4           9
Gene A Gene B           3           2           4
Gene A Gene C           2          14           8
Gene A Gene D           2           4          20
Gene B Gene B           9           1           1
Gene B Gene C           6           7           2
Gene B Gene D           6           2           5
Gene C Gene C           4          49           4
Gene C Gene D           4          14          10
Gene D Gene D           4           4          25

After I want to find the row sums for each product and get the sums in the form of a matrix (which I will call CombSums):

            Gene A       Gene B      Gene C      Gene D 
Gene A          NA           10          24          26
Gene B          10           NA          15          13
Gene C          24           15          NA          28
Gene D          26           13          28          NA

When I tried to do it, the best I could come up with was

combs <- combn(seq_len(nrow(Dataset)), 2)
Comb <- Dataset[combs[1,], ] * Dataset[combs[2,], ]
rownames(Comb) <- apply(combn(rownames(Comb), 2), 2, paste, collapse = " ")
CombSums <- rowSums(Comb)

Which would gives me the sums as a list, such as below:

                    [1,]
Gene A Gene B       10
Gene A Gene C       24 
Gene A Gene D       26 
Gene B Gene C       15
Gene B Gene D       13
Gene C Gene D       28

Unfortunately, I want it as a two-way matrix and not a list so this doesn't quite work, so if anyone could suggest a way to get the sums as a matrix, that would be a great help.

3

There are 3 best solutions below

3
On BEST ANSWER

If speed is an important factor (e.g. if you're processing a huge matrix), you might find an Rcpp implementation helpful. This only fills the upper triangular portion of the matrix.

library(Rcpp)
cppFunction(
 "NumericMatrix josilberRcpp(NumericMatrix x) {
   const int nr = x.nrow();
   const int nc = x.ncol();
   NumericMatrix y(nr, nr);
   for (int col=0; col < nc; ++col) {
    for (int i=0; i < nr; ++i) {
      for (int j=i; j < nr; ++j) {
        y(i, j) += x(i, col) * x(j, col);
      }
    }
   }
   return y;
}")
josilberRcpp(as.matrix(Dataset))
#      [,1] [,2] [,3] [,4]
# [1,]   21    9   24   26
# [2,]    0   11   15   13
# [3,]    0    0   57   28
# [4,]    0    0    0   33

Benchmarking is provided in my other answer. Note that the benchmarking does not include the compile time using cppFunction, which can be quite significant. Therefore this implementation is probably only useful for very large inputs or when you need to use this function many times.

7
On

You can do this by computing the pairwise products for each column in your original data frame with lapply and outer, and then you can add all those pairwise products together with Reduce and +.

Reduce("+", lapply(dat, function(x) outer(x, x)))
#      [,1] [,2] [,3] [,4]
# [1,]   21    9   24   26
# [2,]    9   11   15   13
# [3,]   24   15   57   28
# [4,]   26   13   28   33

A variation on that theme that is less memory intensive (because it doesn't need to store each column's matrix at the same time) but more typing would be:

ret <- outer(dat[,1], dat[,1])
for (i in 2:ncol(dat)) ret <- ret + outer(dat[,i], dat[,i])
ret
#      [,1] [,2] [,3] [,4]
# [1,]   21    9   24   26
# [2,]    9   11   15   13
# [3,]   24   15   57   28
# [4,]   26   13   28   33

Here's a benchmark of the approaches proposed so far on a 100 x 100 data frame:

# Larger dataset
set.seed(144)
dat <- as.data.frame(matrix(rnorm(10000), nrow=100))

josilber <- function(dat) Reduce("+", lapply(dat, function(x) outer(x, x)))
josilber2 <- function(dat) {
  ret <- outer(dat[,1], dat[,1])
  for (i in 2:ncol(dat)) ret <- ret + outer(dat[,i], dat[,i])
  ret
}
frank <- function(DF) {
  mat   <- as.matrix(DF)
  pairs <- combn(1:nrow(DF),2)
  vals  <- rowSums(mat[pairs[1,],]*mat[pairs[2,],])
  res   <- matrix(,nrow(DF),nrow(DF),dimnames=list(rownames(DF),rownames(DF)))
  res[lower.tri(res)] <- vals
  res
}

library(microbenchmark)
microbenchmark(josilber(dat), josilber2(dat), josilberRcpp(as.matrix(dat)), frank(dat))
# Unit: microseconds
#                          expr       min        lq      mean    median        uq       max neval
#                 josilber(dat)  6867.499 45437.277 45506.731 46372.576 47549.834 85494.063   100
#                josilber2(dat)  6831.692  7982.539 10245.459  9109.023 10883.965 50612.600   100
#  josilberRcpp(as.matrix(dat))   989.592  1112.316  1290.617  1204.388  1483.638  2384.348   100
#                    frank(dat) 13043.912 53369.804 52488.997 53921.402 54855.583 62566.730   100
2
On

Using combn, you can avoid doing redundant calculations:

mat   <- as.matrix(DF)

pairs <- combn(1:nrow(DF),2)

vals  <- rowSums(mat[pairs[1,],]*mat[pairs[2,],])
res   <- matrix(,nrow(DF),nrow(DF),dimnames=list(rownames(DF),rownames(DF)))
res[lower.tri(res)] <- vals

#       GeneA GeneB GeneC GeneD
# GeneA    NA    NA    NA    NA
# GeneB     9    NA    NA    NA
# GeneC    24    15    NA    NA
# GeneD    26    13    28    NA

Your Comb matrix is the intermediate result mat[pairs[1,],]*mat[pairs[2,],].


The entire calculation can be done inside combn, alternately:

vals <- combn(rownames(DF),2,FUN=function(x)sum(apply(DF[x,],2,prod)))

As @josilber pointed out in a comment below, this is incredibly slow, however.


Data:

DF <- read.table(header=TRUE,text="Condition1 Condition2   Condition3
GeneA           1           2             4
GeneB           3           1             1
GeneC           2           7             2
GeneD           2           2             5")