I am currently working on a scientific project where each sample has two or three replicates.
The data (we'll call the dataframe 'raw') looks something along the lines of:
| SampleID | Var1 | Var2 | Var3 |...
| Unknown 1 | NA | 12 | NA |
| Unknown 1 | NA | 19 | 23 |
| Unknown 2 | NA | NA | 50 |
| Unknown 2 | 18.5 | NA | 32 |
| Unknown 2 | 8.12 | NA | 20 |
...
Where unknown 1 has two replicates and unknown 2 has three replicates. The NAs are for when values could not be obtained since the signal was too strong/too low.
I want to create a new dataframe called 'averages' that contains the averages of the replicates for each group (Sample ID) over all the Var# columns so that it looks more like this:
Averages of each sample over different variables
| SampleID | Var1 | Var2 | Var3 |...
| Unknown 1 | NA | 15.5 | 23 |
| Unknown 2 | 26.62 | NA | 51 |
...
What I am struggling with is handling the NAs the way I want to when calculating the averages. In the new averages dataframe, I still want it to show:
'NA' for samples where ALL replicates were 'NA'. If I use any function that omits NAs when calculating the means, any sample with replicates that contain ALL NAs in at least one variable are not added to the new averages dataframe, which is not what I want. I still want to see all the samples listed in the new dataframe, just with their replicate/triplicate values combined into averages.
In the case where (1 of 2) or (2 of 3) replicates contained NA values, I want those NA values to be omitted. For instance, in the table above, 'Unknown 2' has triplicate values and in 'Var1', there is one triplicate value that is NA whereas the other triplicate values are "18.5" and "8.12". Thus, I want the NA value to be omitted and the mean calculation to instead look like (18.5 + 8.12)/2 even though there is supposed to be 3 replicate values.
#Approach 1 - worked only for one column at a time, omits the samples whose replicates ALL have NA values
averages <- aggregate(raw$'Var1', list(raw$SampleID), FUN=mean, na.rm=TRUE)
#Approach 2 - did not not omit the NAs in mean calculation
averages <- aggregate(.~SampleID, raw, mean, na.rm=TRUE)
#Approach 3 - #tried putting the colnames into a list, then calling the list in the following code. Same issue with samples being omitted if they have replicates with NA values.
avgcolnames <- colnames(raw[,sapply(results,is.numeric)])
averages <- aggregate(cbind(avgcolnames) ~ SampleID, data = raw,
FUN = mean, na.rm = TRUE)
#Approach 4 - # worked on only on one column at a time and have to specify new column names
averages <- raw %>%
group_by(SampleID) %>%
summarize(Var1 = mean(Var1, na.rm = TRUE))
Approach #4 is along the way, next step is to make it generic, which we can do with
tidy-selectfunctionality and an anonymous function:Here we select all
numericcolumns withacross(where(is.numeric), <fn>)and the~ mean(.x, na.rm=TRUE)is an anonymous function where.xis the argument and computes the mean withNAs excluded.