Aggregate observations across samples by rownames (dplyr) in R

371 Views Asked by At

Aloha,

I am trying to get the total counts for each row name in my sample matrix. For some reason, I have tried both rowsum and then converting to a data frame and using dplyr::group_by but they are giving errors. Here is a subset of example data:

mat = matrix(c(0,1,2,3,4), nrow=3, ncol = 5)
rownames(mat) <- c("CHO", "NO", "O")
colnames(mat) <-  c("sample_1", "sample_2", "sample_3", "sample_4", "sample_5")`

I would like to have a resulting data frame with the formula name, then the sum of observations across samples and the percent of samples formula was observed in overall.

It seems easy enough but I have tried all different combinations aggregating the data with no avail and would be very appreciative of some guidance.

2

There are 2 best solutions below

0
On BEST ANSWER

We may need only rowSums

rowSums(mat)

If there are duplicate rownames (in the example data, the rownames are unique), then we use rowsum with group specified as the rownames

rowsum(mat, row.names(mat))

and then we use rowSums on top of that

rowSums(rowsum(mat, row.names(mat)))
0
On

The answer from @akrun does exactly what the poster asks, but I find myself in a similar but slightly different situation a lot, but with a dataframe that has what would be duplicate rownames in a matrix, so these are now values in a column instead (called "compound" below), like this:

set.seed(2347813)
df <- data.frame(matrix(sample(c(0,1,2,3,4,5,6,7,8,9), 
                               size=30, replace=T), nrow=6, ncol=5)) 
colnames(df) <-  c("sample_1", "sample_2", "sample_3", "sample_4", "sample_5")
df$compound <- c("CHO", "NO", "O", "CHO", "NO", "O") 

which looks like:

  sample_1 sample_2 sample_3 sample_4 sample_5 compound
1        0        1        4        9        1      CHO
2        3        8        3        0        5       NO
3        8        9        7        1        7        O
4        8        2        9        7        7      CHO
5        3        8        9        0        5       NO
6        6        1        6        7        5        O

then I want to summarize the data in various ways but I want to group it by compound, and use piping (%>% below) from dplyr/tidyverse.
like the original question, if we want the total by compound we would do:

df %>%
  group_by(compound) %>%
  summarize(total=sum(c_across(starts_with("sample"))))

which would give us:

  compound total
  <chr>    <dbl>
1 CHO         48
2 NO          44
3 O           57

but I think the best part is the ability to do multiple summarizing operations at once. let's say we wanted the total, mean, standard deviation, and the mean of only sample_1 through sample_3, we can get that all in one piped command:

df %>%
  group_by(compound) %>%
  summarize(total=sum(c_across(starts_with("sample"))), 
            grand_mean=mean(c_across(starts_with("sample"))),
            sd=sd(c_across(starts_with("sample"))), 
            mean_13=mean(c_across(sample_1:sample_3))) # mean of sample 1-3

which gives us:

# A tibble: 3 x 5
  compound total grand_mean    sd mean_13
  <chr>    <dbl>      <dbl> <dbl>   <dbl>
1 CHO         48        4.8  3.58    4   
2 NO          44        4.4  3.20    5.67
3 O           57        5.7  2.71    6.17

combining piping (%>%), group_by, mutate, and the new versions of across (above I used c_across) you can get a lot done in one go.