Apply FDR correction on a large number of outcome variables

711 Views Asked by At

I'm using R (not the 4 version yet ahah) I was advised to use FDR correction on my linear models. I have >200 participants, 140 continuous outcome variables, and each outcome variable is tested on the same 4 predictors. So all the models are: Y ~ x1 + x2 + x3 + x4, for all the 140 variables, where x1 is the predictor I'm interested in and the others (x2,x3,x4) I'm just using to control for their effects over the Y. How do I apply the FDR? For what do I have to correct? Do I have to correct for all the 140 outcome variables? Do I have to only correct for the 4 predictors? If you could explain the process and how to decide for what to correct in fdr it would be really good as I am struggling in understanding it. Thank you very much for the help, Best

1

There are 1 best solutions below

0
On

So you need to control for 140 test between predictor and outcomes, and you do the FDR for each predictor. We can try an example where x1 has an effect on response y 1 to 30, and no effect on the others, whereas x2,x3,x4 doesn't, first the data:

set.seed(111)
X = matrix(runif(200*4),ncol=4)
colnames(X) = paste0("x",1:4)
Y = matrix(rnorm(140*200),ncol=140)
colnames(Y) = paste0("y",1:140)
Y[,1:30] = 1.5*X[,1]+Y[,1:30]

Good to use broom to tidy it up, we can fit a multi-response linear model, but each Y is regressed separately, the output is like:

library(broom)
library(dplyr)
model = lm(Y ~ X)
tidy(model)
# A tibble: 700 x 6
   response term        estimate std.error statistic    p.value
   <chr>    <chr>          <dbl>     <dbl>     <dbl>      <dbl>
 1 y1       (Intercept)   0.288      0.268     1.07  0.285     
 2 y1       Xx1           1.22       0.248     4.93  0.00000178
 3 y1       Xx2          -0.356      0.251    -1.42  0.158 

Now we clean up some of the terms, group by response and we can apply FDR using p.adjust, "BH" stands for Benjamini-Hochberg:

adjusted = tidy(model) %>% 
mutate(term=gsub("X","",term)) %>% 
filter(term!="(Intercept)") %>% 
group_by(term) %>% 
mutate(padj = p.adjust(p.value,"BH")) %>%
ungroup()

So before we look at the FDR results, we can think about multiple testing like this. If a predictor doesn't have an effect on any of the responses, and you do 140 test, you expect around 0.05*140 = 7 of the test to give you a p-value of 0.05. We can check for each predictor, how many of them have p < 0.05:

adjusted %>% group_by(term) %>% summarize(sig=sum(p.value<0.05))
# A tibble: 4 x 2
  term    sig
  <chr> <int>
1 x1       36
2 x2        7
3 x3        6
4 x4        7

How will the p-value distribution look like? So you can see x1 bucks the trend in the above, and we can visualize this by plotting the pvalue distribution:

library(ggplot2)
adjusted %>%
ggplot(aes(x=p.value)) + geom_histogram() +
facet_wrap(~term) + theme_bw()

enter image description here

For x2,x3 and x4, we simulated them under the null, no effect on any responses and you can see the p-value follows a even distribution.

If we simply use a cutoff of 0.05, we will get all 7 false positives in the other predictors x1-x4, while some of them in x1 will be correct. FDR basically corrects for this expected distribution of p-values and we can check how many of them are significant at 5% FDR:

adjusted %>% group_by(term) %>% summarize(sig=sum(padj<0.05))
# A tibble: 4 x 2
  term    sig
  <chr> <int>
1 x1       31
2 x2        0
3 x3        0
4 x4        0

So we don't get any more hits with x2,x3,x4 which has no effect, while x1, which we simulated under 30 true effects gives 31 hits. You can also check out this video that explains in greater detail how this above works