Edited: How to write Loop for PostHoc Tukey test and multiple comparision

764 Views Asked by At

This is my sample of a big data matrix & and each column has named with multiple information and separated by an underscore.

structure(list(Gene = c("AGI4120.1_UBQ", "AGI570.1_Acin"), WT_Tissue_0T_1 = c(0.886461437, 1.093164915), WT_Tissue_0T_2 = c(1.075140682, 1.229862834), WT_Tissue_0T_3 = c(0.632903012, 1.094003128), WT_Tissue_1T_1 = c(0.883151274, 1.26322126), WT_Tissue_1T_2 = c(1.005627276, 0.962729188), WT_Tissue_1T_3 = c(0.87123469, 0.968078993), WT_Tissue_3T_1 = c(0.723601456, 0.633890322), WT_Tissue_3T_2 = c(0.392585237, 0.534819363), WT_Tissue_3T_3 = c(0.640185369, 1.021934772), WT_Tissue_5T_1 = c(0.720291294, 0.589244505), WT_Tissue_5T_2 = c(0.362131744, 0.475251717), WT_Tissue_5T_3 = c(0.549486925, 0.618177919), mut1_Tissue_0T_1 = c(1.464415756, 1.130533457), mut1_Tissue_0T_2 = c(1.01489573, 1.114915728), mut1_Tissue_0T_3 = c(1.171797418, 1.399956009), mut1_Tissue_1T_1 = c(0.927507448, 1.231911575), mut1_Tissue_1T_2 = c(1.089705396, 1.256782289 ), mut1_Tissue_1T_3 = c(0.993048659, 0.999044465), mut1_Tissue_3T_1 = c(1.000993049, 1.103486794), mut1_Tissue_3T_2 = c(1.062562066, 0.883617224 ), mut1_Tissue_3T_3 = c(1.037404833, 0.851875438), mut1_Tissue_5T_1 = c(0.730883813, 0.437440083), mut1_Tissue_5T_2 = c(0.480635551, 0.298762126 ), mut1_Tissue_5T_3 = c(0.85468388, 0.614923997)), row.names = c(NA, -2L), class = c("tbl_df", "tbl", "data.frame"), spec = structure(list( cols = list(Gene = structure(list(), class = c("collector_character", "collector")), WT_Tissue_0T_1 = structure(list(), class = c("collector_double", "collector")), WT_Tissue_0T_2 = structure(list(), class = c("collector_double", "collector")), WT_Tissue_0T_3 = structure(list(), class = c("collector_double", "collector")), WT_Tissue_1T_1 = structure(list(), class = c("collector_double", "collector")), WT_Tissue_1T_2 = structure(list(), class = c("collector_double", "collector")), WT_Tissue_1T_3 = structure(list(), class = c("collector_double", "collector")), WT_Tissue_3T_1 = structure(list(), class = c("collector_double", "collector")), WT_Tissue_3T_2 = structure(list(), class = c("collector_double", "collector")), WT_Tissue_3T_3 = structure(list(), class = c("collector_double", "collector")), WT_Tissue_5T_1 = structure(list(), class = c("collector_double", "collector")), WT_Tissue_5T_2 = structure(list(), class = c("collector_double", "collector")), WT_Tissue_5T_3 = structure(list(), class = c("collector_double", "collector")), mut1_Tissue_0T_1 = structure(list(), class = c("collector_double", "collector")), mut1_Tissue_0T_2 = structure(list(), class = c("collector_double", "collector")), mut1_Tissue_0T_3 = structure(list(), class = c("collector_double", "collector")), mut1_Tissue_1T_1 = structure(list(), class = c("collector_double", "collector")), mut1_Tissue_1T_2 = structure(list(), class = c("collector_double", "collector")), mut1_Tissue_1T_3 = structure(list(), class = c("collector_double", "collector")), mut1_Tissue_3T_1 = structure(list(), class = c("collector_double", "collector")), mut1_Tissue_3T_2 = structure(list(), class = c("collector_double", "collector")), mut1_Tissue_3T_3 = structure(list(), class = c("collector_double", "collector")), mut1_Tissue_5T_1 = structure(list(), class = c("collector_double", "collector")), mut1_Tissue_5T_2 = structure(list(), class = c("collector_double", "collector")), mut1_Tissue_5T_3 = structure(list(), class = c("collector_double", "collector"))), default = structure(list(), class = c("collector_guess", "collector"))), class = "col_spec"))

I want to follow a Tukey Test and plot bar charts for each Gene (Response vs. Time; filled by both the genotypes) with multiple comparisons letters.

Syntax

df1 <- df %>% 
         gather(var, response, WT_Tissue_0T_1:mut1_Tissue_5T_3) %>% 
         separate(var, c("Genotype", "Tissue", "Time"), sep = "_") %>% 
         arrange(desc(Gene))
df2 <- df1 %>% 
         group_by(`Gene`,Genotype,Tissue,Time) %>% 
         mutate(Response=mean(response),n=n(),se=sd(response)/sqrt(n))

Two-way ANOVA

fit1 <- aov(Response ~ Genotype*Time, df2) 

Hereafter, I want to go proceed Tukey Test (Multiple Comparison), for e.g. Gene "AGI4120.1_UBQ", plot Response vs. Time, and see how each genotype (WT & mut1) behave in each Time point (0T, 1T, 3T, and 5T)? if the response is significantly different or not and denote by letters in the plot.

As below lsmeans syntax combine all the Genes to one and give output, how can I make it to loop for each gene separately (i.e. "AGI4120.1_UBQ", "AGI570.1_Acin") and get letters to show statistically distinct groups (aka "compact letter display")

 lsmeans(fit1, pairwise ~ Genotype | Time) 

My final aim is to plot each gene like this below graph and denote significance letters.

df2$genotype <- factor(df2$genotype, levels = c("WT","mut1")) colours <- c('#336600','#ffcc00')library(ggplot2)ggplot(df2,aes( x=Time, y=Response, fill=Genotype))+geom_bar(stat='identity', position='dodge')+scale_fill_manual(values=colours)+geom_errorbar(aes(ymin=average_measure-se, ymax=average_measure+se)+facet_wrap(~`Gene`)+labs(x='Time', y='Response')

Expecting Graph for denoting compact letter display

I would appreciate your kind help, if possible.

1

There are 1 best solutions below

6
On

There are a number of issues with your code. I'd go so far as to say it's not really an appropriate post for StackOverflow, as the problems here are diverse and not really extendable beyond this specific constellation of bugs and syntax issues. But I'll post a few suggestions as an answer to get you started - hope they help.

1. lsmeans():
This function expects a fitted model (like from lm()), or a ref.grid object. But you are passing it a data frame, without any of the regression properties it needs to compute least-squares means. (Think: How does lsmeans() know what the dependent variable should be, when you ask for pairwise comparison with Genes as the independent variable?) Check out the Using lsmeans documentation for more details.

Based on your data, I'd say you probably want to run a multi-level regression, using something like the lme4 package, with Gene and Genotype and maybe Time as nested grouping levels.

But for demonstration I'll keep it simple with lm(). Passing a fitted regression object to lsmeans() works as intended:

fit <- lm(Response ~ Gene + Genotype + Time, data=df2)

lsmeans(fit, pairwise ~ Gene)[[2]] 

# Output
contrast                        estimate        SE df t.ratio p.value
AGI4120.1_UBQ - AGI570.1_Acin -0.0515123 0.0299492 42   -1.72  0.0928

Results are averaged over the levels of: Genotype, Time 

2. ggplot():
You haven't defined colours or average_measure in the code you've provided; calling these undeclared variables will cause failure.

Structurally, I would suggest you use df1 and allow ggplot to do the grouping, rather than group_by into df2. Then use stat="summary" and fun.y="mean" to accomplish the summarise() computations you did in df2. Doing this also allows you to use the mean_se function for your error bars. Like this:

ggplot(df1,aes( x=Time, y=response, fill=Genotype))+ 
  geom_bar(stat='summary', fun.y='mean', position=position_dodge(0.9))+
    stat_summary(fun.data = mean_se, geom = "errorbar", 
                 color="gray60", width=.1, position=position_dodge(0.9)) +
  scale_fill_manual(values=c("steelblue","orange"))+
  facet_wrap(~`Gene`)+ 
  labs(x='Time', y='Response')

facet plot

Finally, note that your use of separate() in df1 will throw a warning, although not an error, as there is an extra value created by splitting on sep="_". You can avoid that (if it is causing confusion) by adding a level to capture the final value (which appears to be a time index):

separate(var, c("Genotype", "Tissue", "Time", "Index"), sep = "_")