Perform Kruskal-wallis test for big amount of combinations in a dataframe

90 Views Asked by At

I have a df in R with 50 unique combinations of A and B. For each combination of A and B, I want to perform a Kruskal-wallis test: kruskal.test(D,C,data = df)

I want to test which combinations A and B needed to reject the null hypothesis.

How can i perform this without making a seperate test for each combination? Sample of my data is below

A     B       C     D
mix1 size1    1     0.2
mix1 size1    2     0.15
mix1 size1    3     0.22
mix1 size1    4     0.215
mix2 size1    1     0.2
mix2 size1    2     0.15
mix2 size1    3     0.2
mix2 size1    4     0.15
mix2 size2    1     0.21
mix2 size2    2     0.11
mix2 size2    3     0.23
mix2 size2    4     0.615
...
mix22 size1    1     0.01
mix22 size1    2     0.18
mix22 size1    3     0.7
mix22 size1    4     0.17

My expected output is df/table with the p-value from the kruskal-wallis test of each combination of A and B.

A     B    P
mix1 size1 0.005
mix2 size1 0.211

Perhaps with something from the *apply family?

2

There are 2 best solutions below

1
Jonathan V. Solórzano On BEST ANSWER

Here's a tidyverse approach using dplyr and rstatix. I used the data posted by @jay.sf.

library(dplyr)
library(rstatix)

df |>
  group_by(A,B) |>
  kruskal_test(D~C)

# A tibble: 4 × 8
#  A     B     .y.       n statistic    df     p method        
#* <chr> <chr> <chr> <int>     <dbl> <int> <dbl> <chr>         
#1 mix1  size1 D         4       3       3 0.392 Kruskal-Wallis
#2 mix2  size1 D         4       3       3 0.392 Kruskal-Wallis
#3 mix2  size2 D         4       3       3 0.392 Kruskal-Wallis
#4 mix22 size1 D         4       3       3 0.392 Kruskal-Wallis
8
jay.sf On

Using interaction in by.

> by(dat, with(dat, interaction(A, B)), \(x) {
+   with(x, kruskal.test(D, C))[c('statistic', 'parameter', 'p.value')]
+ }) |> do.call(what='rbind')
            statistic parameter p.value
mix1.size1  2379      3         0      
mix2.size1  2620      3         0      
mix22.size1 2460      3         0      
mix2.size2  2537      3         0      

Data:

> dput(dat)
structure(list(A = c("mix1", "mix1", "mix1", "mix1", "mix2", 
"mix2", "mix2", "mix2", "mix2", "mix2", "mix2", "mix2", "mix22", 
"mix22", "mix22", "mix22"), B = c("size1", "size1", "size1", 
"size1", "size1", "size1", "size1", "size1", "size2", "size2", 
"size2", "size2", "size1", "size1", "size1", "size1"), C = c(1L, 
2L, 3L, 4L, 1L, 2L, 3L, 4L, 1L, 2L, 3L, 4L, 1L, 2L, 3L, 4L), 
    D = c(0.2, 0.15, 0.22, 0.215, 0.2, 0.15, 0.2, 0.15, 0.21, 
    0.11, 0.23, 0.615, 0.01, 0.18, 0.7, 0.17)), class = "data.frame", row.names = c(NA, 
-16L))