Why TukeyHSD test keeps returning NA for a linear model in R?

21 Views Asked by At

I have a dataframe with four columns: BMI, AGE, SEX and Geno. Geno is a factor variable has 4 levels(A|A,A|G,G|A,ref:G|G). I would like to fit a linear regression model use BMI~AGE + SEX + Geno, and would do a TukeyHSD test for a pairwise comparison between different levels of Geno. The lm model shows Geno is significant, but the TukeyHSD test keeps returning NA. Did I do something wrong or TukeyHSD is not suitable for linear model with both numeric and factor variables? I'm a new to the post-hoc test. If Tukey test is not a good option, any other method I can use? Thanks.

the data frame looks like:

       RSID Indiv_ID Indiv_Geno SEX AGE      BMI
1 rs1121980    23102        A|A   1  90 34.59993
2 rs1121980      158        A|A   1  58 26.39891
3 rs1121980     9790        A|A   2  90 29.05149
4 rs1121980    22320        G|G   2  73 23.62653
5 rs1121980    10441        G|A   1  59 32.26947
6 rs1121980    10376        G|G   2  90 26.81053

My code is as below:

fto_geno_pheno = na.omit(fto_geno_pheno)

snp_1 = fto_geno_pheno %>%
  filter(RSID == "rs1121980") 

sapply(snp_1, class)

snp_1$SEX = factor(snp_1$SEX)

snp_1$Indiv_Geno = factor(snp_1$Indiv_Geno)

snp_1$Indiv_Geno = relevel(snp_1$Indiv_Geno, ref = "G|G")

lm = lm(snp_1$BMI ~ snp_1$SEX + snp_1$AGE + snp_1$Indiv_Geno, snp_1)

av = aov(lm)

summary(av)
summary(lm)

TukeyHSD(x = av, which = "Indiv_Geno", conf.level = 0.95)

lm model looks like below:

                   Df Sum Sq Mean Sq F value   Pr(>F)    
snp_1$SEX           1   3412    3412 108.399  < 2e-16 ***
snp_1$AGE           1    845     845  26.851 2.25e-07 ***
snp_1$Indiv_Geno    3    770     257   8.157 2.02e-05 ***
Residuals        8424 265171      31                     
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
Call:
lm(formula = snp_1$BMI ~ snp_1$SEX + snp_1$AGE + snp_1$Indiv_Geno, 
    data = snp_1)

Residuals:
    Min      1Q  Median      3Q     Max 
-13.657  -3.885  -0.833   2.892  34.548 

Coefficients:
                     Estimate Std. Error t value Pr(>|t|)    
(Intercept)         29.847764   0.285073 104.702  < 2e-16 ***
snp_1$SEX2          -1.248649   0.122730 -10.174  < 2e-16 ***
snp_1$AGE           -0.020984   0.004077  -5.147 2.71e-07 ***
snp_1$Indiv_GenoA|A  0.851648   0.178004   4.784 1.74e-06 ***
snp_1$Indiv_GenoA|G  0.145647   0.162781   0.895   0.3710    
snp_1$Indiv_GenoG|A  0.329719   0.165168   1.996   0.0459 *  
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

Residual standard error: 5.611 on 8424 degrees of freedom
Multiple R-squared:  0.01861,   Adjusted R-squared:  0.01802 
F-statistic: 31.94 on 5 and 8424 DF,  p-value: < 2.2e-16
  Tukey multiple comparisons of means
    95% family-wise confidence level

Fit: aov(formula = lm)

$<NA>
     diff lwr upr p adj

Warning messages:
1: In replications(paste("~", xx), data = mf) :
  non-factors ignored: snp_1$AGE
2: In qtukey(conf.level, length(means), x$df.residual) : NaNs produced```


  [1]: https://i.stack.imgur.com/PSdQF.png
0

There are 0 best solutions below