Normality test for multi-grouped data in R

2.1k Views Asked by At

I'm trying to run a normality test over my data in R. My dataset is a data frame formed by 4 columns of characters and one column with numeric values. At the moment, I'm using the Rstatix package in R and other types of statistical tests are working well like wilcox_test() and kruskal_test(), but when I try to run shapiro_test(), it doesn't work, giving the following error:

data %>% group_by(treatment,chase,measure) %>% shapiro_test(value)
x
+-<error/dplyr:::mutate_error>
| Problem with `mutate()` input `data`.
| x Must group by variables found in `.data`.
| * Column `variable` is not found.
| i Input `data` is `map(.data$data, .f, ...)`.
\-<error/rlang_error>
  Must group by variables found in `.data`.
  * Column `variable` is not found.
Backtrace:
  1. dplyr::group_by(., treatment, chase, measure)
  2. rstatix::shapiro_test(., value)
 33. rstatix:::.f(.x[[i]], ...)
 11. dplyr::group_by(., variable)
 43. dplyr::group_by_prepare(.data, ..., .add = .add)

My data set is the following:

    groups treatment chase measure     value
1   uncoated   control    30  colocA 17.912954
2   uncoated   control    30  colocA 16.806409
3   uncoated   control    30  colocA 20.322467
4   uncoated   control    30  colocA 15.953959
5   uncoated   control    30  colocA 22.566408
6   uncoated   control    30  colocA 17.780975
7   uncoated   control    30  colocA 19.764265
8   uncoated   control    30  colocA 16.928500
9   uncoated   control    30  colocA 22.931763
10  uncoated   control    30  colocA 18.101085
11  uncoated   control    30  distCC  1.159298
12  uncoated   control    30  distCC  1.174931
13  uncoated   control    30  distCC  1.190449
14  uncoated   control    30  distCC  1.265717
15  uncoated   control    30  distCC  1.103845
16  uncoated   control    30  distCC  1.125344
17  uncoated   control    30  distCC  1.290703
18  uncoated   control    30  distCC  1.172462
19  uncoated   control    30  distCC  1.065353
20  uncoated   control    30  distCC  1.048523
21    coated   control    30  colocA  6.062000
22    coated   control    30  colocA  9.370714
23    coated   control    30  colocA 12.898769
24    coated   control    30  colocA 20.398458
25    coated   control    30  colocA 11.174150
26    coated   control    30  colocA 17.574250
27    coated   control    30  colocA 12.481857
28    coated   control    30  colocA 21.565250
29    coated   control    30  colocA 21.743409
30    coated   control    30  colocA 12.699600
31    coated   control    30  distCC  4.317260
32    coated   control    30  distCC  4.263914
33    coated   control    30  distCC  5.136013
34    coated   control    30  distCC  3.142906
35    coated   control    30  distCC  2.617590
36    coated   control    30  distCC  4.149614
37    coated   control    30  distCC  4.995551
38    coated   control    30  distCC  3.851803
39    coated   control    30  distCC  4.606119
40    coated   control    30  distCC  2.820326

Thank you in advance.

2

There are 2 best solutions below

0
On BEST ANSWER

Here is a way with stats::shapiro.test.

library(dplyr)
library(broom)

data %>%
  group_by(treatment, chase, measure) %>% 
  do(tidy(shapiro.test(.$value)))
## A tibble: 2 x 6
## Groups:   treatment, chase, measure [2]
#  treatment chase measure statistic p.value method                     
#  <chr>     <int> <chr>       <dbl>   <dbl> <chr>                      
#1 control      30 colocA      0.940 0.244   Shapiro-Wilk normality test
#2 control      30 distCC      0.811 0.00128 Shapiro-Wilk normality test
0
On

We could also wrap the output in a list in summarise and unnest it

library(dplyr)
library(tidyr)
library(broom)
dat %>% 
    group_by(treatment, chase, measure) %>%
    summarise(out = list(shapiro.test(value) %>% tidy), .groups = 'drop') %>%
    unnest(c(out))
# A tibble: 2 x 6
#  treatment chase measure statistic p.value method                     
#  <chr>     <int> <chr>       <dbl>   <dbl> <chr>                      
#1 control      30 colocA      0.940 0.244   Shapiro-Wilk normality test
#2 control      30 distCC      0.811 0.00128 Shapiro-Wilk normality test
 

data

dat <- structure(list(groups = c("uncoated", "uncoated", "uncoated", 
"uncoated", "uncoated", "uncoated", "uncoated", "uncoated", "uncoated", 
"uncoated", "uncoated", "uncoated", "uncoated", "uncoated", "uncoated", 
"uncoated", "uncoated", "uncoated", "uncoated", "uncoated", "coated", 
"coated", "coated", "coated", "coated", "coated", "coated", "coated", 
"coated", "coated", "coated", "coated", "coated", "coated", "coated", 
"coated", "coated", "coated", "coated", "coated"), treatment = c("control", 
"control", "control", "control", "control", "control", "control", 
"control", "control", "control", "control", "control", "control", 
"control", "control", "control", "control", "control", "control", 
"control", "control", "control", "control", "control", "control", 
"control", "control", "control", "control", "control", "control", 
"control", "control", "control", "control", "control", "control", 
"control", "control", "control"), chase = c(30L, 30L, 30L, 30L, 
30L, 30L, 30L, 30L, 30L, 30L, 30L, 30L, 30L, 30L, 30L, 30L, 30L, 
30L, 30L, 30L, 30L, 30L, 30L, 30L, 30L, 30L, 30L, 30L, 30L, 30L, 
30L, 30L, 30L, 30L, 30L, 30L, 30L, 30L, 30L, 30L), measure = c("colocA", 
"colocA", "colocA", "colocA", "colocA", "colocA", "colocA", "colocA", 
"colocA", "colocA", "distCC", "distCC", "distCC", "distCC", "distCC", 
"distCC", "distCC", "distCC", "distCC", "distCC", "colocA", "colocA", 
"colocA", "colocA", "colocA", "colocA", "colocA", "colocA", "colocA", 
"colocA", "distCC", "distCC", "distCC", "distCC", "distCC", "distCC", 
"distCC", "distCC", "distCC", "distCC"), value = c(17.912954, 
16.806409, 20.322467, 15.953959, 22.566408, 17.780975, 19.764265, 
16.9285, 22.931763, 18.101085, 1.159298, 1.174931, 1.190449, 
1.265717, 1.103845, 1.125344, 1.290703, 1.172462, 1.065353, 1.048523, 
6.062, 9.370714, 12.898769, 20.398458, 11.17415, 17.57425, 12.481857, 
21.56525, 21.743409, 12.6996, 4.31726, 4.263914, 5.136013, 3.142906, 
2.61759, 4.149614, 4.995551, 3.851803, 4.606119, 2.820326)),
class = "data.frame", row.names = c("1", 
"2", "3", "4", "5", "6", "7", "8", "9", "10", "11", "12", "13", 
"14", "15", "16", "17", "18", "19", "20", "21", "22", "23", "24", 
"25", "26", "27", "28", "29", "30", "31", "32", "33", "34", "35", 
"36", "37", "38", "39", "40"))