If/For loop: If shapiro-Wilk is >0.05, perform one-way ANOVA, if <0.05 perform Kruskal-wallis in r

307 Views Asked by At

Programmatically select/run group comparisons (incl post-hoc)

I would like to perform a loop in r, where if the p value of a shaprio-wilk test is less than 0.05, It performs a kruskall-wallis test, but if it is greater than 0.05, a one-way ANOVA is performed. Extending this, if the result of the kruskall-wallis or one-way ANOVA p, value i s < 0.05, a post-hoc test is performed. Would this be possible in r?

#Example columns
Label <- c("blue", "red", "green", "blue", "red", "green","blue", "red", "green","blue", "red", "green","blue", "red", "green")
n <- c(10, 223, 12890, 34, 78, 1902, 34, 211, 1007,209, 330, 90446, 801, 1029, 9011)

#make example data frame
data <- data.frame(Label,n)

#shapiro-wilk test for normality
shapiro.test(data$n)

#if p-value for shapiro-wilk is >0.05, perform one-way ANOVA
OneWay <- lm(n ~ Label, data = data) 
anova(OneWay)
#if one-way ANOVA is < 0.05, perform post-hoc
library(mosaic)
TukeyHSD(OneWay)

#If shapiro-wilk < 0.05, perform Krushkall-wallis
kruskal <- kruskal.test(data$n, data$Label) #dependent variable followed by the category (predictor) variable
#if krushkall-wallis is < 0.05, perform post-hoc
library(mosaic)
TukeyHSD(kruskal)
3

There are 3 best solutions below

0
On BEST ANSWER

Thanks to the above answers, particularly @TarJae, who helped get the ball rolling. For transparency and to hopefully help others, I am posting the resulting code that filters whether data is normally distributed, and has an equal variance. If either of these conditions are violated, the appropriate ANOVA/Welch ANOVA/Kruskal-Wallis test is performed. It produces the resulting post-hoc tests as well.

#Example columns
Label <- c("blue", "red", "green", "blue", "red", "green","blue", "red", "green","blue", "red", "green","blue", "red", "green")
n <- c(10, 223, 12890, 34, 78, 1902, 34, 211, 1007,209, 330, 90446, 801, 1029, 9011) #use this data to demonstrate non-parametric test
n <- rnorm(15, mean = 30, sd = 1) #use this data to demonstrate normal data

#make example data frame
data <- data.frame(Label,n)

# ################# following statements follow this pattern
# if(condition1) {        
#   if(condition2) {     
#     code if both pass
#   } else {            
#     code if 1 passes, 2 fails
#   }
# } else {            
#   code if 1 fails
# }
# ##########
shapiro <- shapiro.test(data$n)


#Homogeneity of variance testing
#If it is normally distributed perform bartlett test
if(shapiro$p.value > 0.05) {
  library(tidyverse)
  bart <- bartlett.test(n ~ Label, data=data)
} 

#If it is normally distributed and homogeneity of variance is equal, perform one-way ANOVA. If normally distributed, but unequal variance, perform Welch ANOVA, if it is not normally distributed, perform Fligner-Killeen’s homogeneity of variance test which is more appropriate for non-normal data
if ((shapiro$p.value > 0.05) && exists("bart")){
  if(bart$p.value > 0.05) { 
    OneWay <- lm(n ~ Label, data = data) 
    oneway <- anova(OneWay) #normal distibution and equal varaince
  } else  {
    welch <- oneway.test(n ~ Label, data = data) #Welch ANOVA, normal distribution, unequal variance
  }
} else {
  fligner <- fligner.test(n ~ Label, data = data)
   #if not normal distribution, perform fligner-killen homogeneity of variance test
} 

#If data is normally distributed, and Fligner homogeneity is equal, perform one-way ANOVA. If data is normally distributed but unequal variance, perform Welch ANOVA, if not normal distribution perform kruskal wallis test.
if((shapiro$p.value > 0.05) && exists("fligner")) {
  if(fligner$p.value < 0.05) {
    welch <- oneway.test(n ~ Label, data = data) #Welch ANOVA
    welch
  } else {
    OneWay <- lm(n ~ Label, data = data) 
    oneway <- anova(OneWay)
  }
  } else {
    kruskal <- kruskal.test(n ~ Label, data = data )
  }



#post hoc testing for appropriate comparison tests

if(exists("oneway")) {
  f <- summary(OneWay)$fstatistic
  oneway.pvalue <- unname(pf(f[1],f[2],f[3],lower.tail=F))
  oneway.pvalue
}

if(exists("oneway.pvalue") && oneway.pvalue< 0.05){
  oneway_ph <- TukeyHSD(oneway)
  print(oneway_ph)
}

if(exists("welch") && welch$p.value <0.05) {
  welch_ph <- pairwise.t.test(data$n, data$Label, p.adjust.method = "bonferroni")
  print(welch_ph)
}


if(exists("kruskal") && kruskal$p.value < 0.05) {
  library(FSA)
  kruskal_ph <- dunnTest(n ~ as.factor(Label),
                data=data,
                method="bonferroni")
  print(kruskal_ph)
}

2
On

Here you go:

We can do it with a simple if else statetment: The challenge is to access the p-value of the different tests (all are shown in the code (for me the overall p-value of lm was the most challenging).

Note: You are asking for a TukeyHSD test if krushkall-wallis is < 0.05 in your code. In this case TukeyHSD is not appropriate as it is performed after ANOVA. After kruskal wallis test there are other posthoc tests, one of is the Bonferroni (= Dunn Procedure) which can be used in ANOVA and Kruskal Wallis in equally sample sizes.

#Example columns
Label <- c("blue", "red", "green", "blue", "red", "green","blue", "red", "green","blue", "red", "green","blue", "red", "green")
n <- c(10, 223, 12890, 34, 78, 1902, 34, 211, 1007,209, 330, 90446, 801, 1029, 9011)

data <- data.frame(Label,n)


# perform shapiro wilk test
shap <- shapiro.test(data$n)

# 1. if p-value for shapiro-wilk is > 0.05, perform one-way ANOVA

if(shap$p.value > 0.05){
  OneWay <- lm(n ~ Label, data = data) 
  anova(OneWay)
}

###############################################################

# 2. if one-way ANOVA is < 0.05, perform post-hoc

# get overall p-value of model
f <- summary(OneWay)$fstatistic
oneway.pvalue <- unname(pf(f[1],f[2],f[3],lower.tail=F))
oneway.pvalue

# [1] 0.2083276

if(oneway.pvalue < 0.05){
  TukeyHSD(OneWay)
}

###############################################################

# 3. If shapiro-wilk < 0.05, perform Krushkall-wallis

if(shap$p.value < 0.05){
  kt <- kruskal.test(n ~ Label, data = data )
  kt
} 

# Kruskal-Wallis rank sum test

# data:  n by Label
# Kruskal-Wallis chi-squared = 9.9377, df = 2, p-value = 0.006951

###############################################################

# 4. if krushkall-wallis is < 0.05, perform post-hoc
if(kt$p.value < 0.05){
  #install.packages("FSA")
  library(FSA)
  DT = dunnTest(n ~ Label,
                data=data,
                method="bh")
  DT
}

# DT
# Dunn (1964) Kruskal-Wallis multiple comparison
# p-values adjusted with the Benjamini-Hochberg method.

# Comparison         Z     P.unadj       P.adj
# 1 blue - green -3.114051 0.001845373 0.005536119
# 2   blue - red -1.132382 0.257473719 0.257473719
# 3  green - red  1.981669 0.047516285 0.071274427

1
On

One approach

  • load libraries and prepare sample data:
library(dplyr)
library(mosaic)
library(tidyr)

df <- data.frame(Label = c("blue", "red", "green", "blue", "red", "green","blue", "red",
                           "green","blue", "red", "green","blue", "red", "green"),
                 n = c(10, 223, 12890, 34, 78, 1902, 34, 211, 1007,209, 330,
                       90446, 801, 1029, 9011)
                 )

The following exploits the advantage that a dataframe's "cells" cannot only hold atomic values but also lists (which makes the respective column a "list column"). You can thus also store, e.g. an linear model's output in a cell, if you wrap it in a list. Take care, though, to extract the list item before mutateing it (hence the various [[1]] in below code.

df |>
  summarise(is_normal = shapiro.test(n)[["p.value"]] > .5,
            lin_mod = list(lm(n ~ Label)),
            p_between_groups = ifelse(is_normal,
                                      anova(lin_mod[[1]])[["Pr(>F)"]][1],
                                      kruskal.test(n ~  Label)[["p.value"]]                                      
                                      )
            ) |>
  mutate(test_type = ifelse(is_normal, "anova", "kruskal"),
         p_tukey = ifelse(is_normal & test_type == "anova",
                          list(TukeyHSD(lin_mod[[1]])[["Label"]][, "p adj"]),
                          NA
                          )
         ) |>
  unnest_longer(p_tukey)

output:

+ # A tibble: 3 x 6
    p_shapiro lin_mod p_between_groups test_between p_tukey p_tukey_id
        <dbl> <list>             <dbl> <chr>          <dbl> <chr>     
1 0.000000459 <lm>             0.00695 kruskal        0.265 green-blue
2 0.000000459 <lm>             0.00695 kruskal        1.00  red-blue  
3 0.000000459 <lm>             0.00695 kruskal        0.270 red-green  

(note that I forced Tukey's test which wouldn't be triggered by your sample data)