I'm looking at some high throughput gene data and doing a type of correlation analysis based on Bayesian statistics. One of the things I need to do is to find every pairwise combination of products in the dataset and find the sum of each resultant row.
So for example, for the high throughput dataset matrix Dataset
(Dataset <- structure(list(`Condition 1` = c(1L, 3L, 2L, 2L), `Condition 2` = c(2L, 1L, 7L, 2L), `Condition 3` = c(4L, 1L, 2L, 5L)), .Names = c("Condition 1", "Condition 2", "Condition 3"), class = "data.frame", row.names = c("Gene A", "Gene B", "Gene C", "Gene D")))
Condition 1 Condition 2 Condition 3
Gene A 1 2 4
Gene B 3 1 1
Gene C 2 7 2
Gene D 2 2 5
First I want to multiply every possible pair of rows together to get the following matrix called Comb
:
Condition 1 Condition 2 Condition 3
Gene A Gene A 1 4 9
Gene A Gene B 3 2 4
Gene A Gene C 2 14 8
Gene A Gene D 2 4 20
Gene B Gene B 9 1 1
Gene B Gene C 6 7 2
Gene B Gene D 6 2 5
Gene C Gene C 4 49 4
Gene C Gene D 4 14 10
Gene D Gene D 4 4 25
After I want to find the row sums for each product and get the sums in the form of a matrix (which I will call CombSums
):
Gene A Gene B Gene C Gene D
Gene A NA 10 24 26
Gene B 10 NA 15 13
Gene C 24 15 NA 28
Gene D 26 13 28 NA
When I tried to do it, the best I could come up with was
combs <- combn(seq_len(nrow(Dataset)), 2)
Comb <- Dataset[combs[1,], ] * Dataset[combs[2,], ]
rownames(Comb) <- apply(combn(rownames(Comb), 2), 2, paste, collapse = " ")
CombSums <- rowSums(Comb)
Which would gives me the sums as a list, such as below:
[1,]
Gene A Gene B 10
Gene A Gene C 24
Gene A Gene D 26
Gene B Gene C 15
Gene B Gene D 13
Gene C Gene D 28
Unfortunately, I want it as a two-way matrix and not a list so this doesn't quite work, so if anyone could suggest a way to get the sums as a matrix, that would be a great help.
If speed is an important factor (e.g. if you're processing a huge matrix), you might find an Rcpp implementation helpful. This only fills the upper triangular portion of the matrix.
Benchmarking is provided in my other answer. Note that the benchmarking does not include the compile time using
cppFunction
, which can be quite significant. Therefore this implementation is probably only useful for very large inputs or when you need to use this function many times.