Significance tests across more than two levels of a categorical variable in R

120 Views Asked by At

I'm trying to determine if there are significant differences in frequencies of a categorial variable with 8 levels between two groups. In this case, two groups are asked their favorite colors with 8 choices. I want to know if there are significant differences in the frequency of people in Group 1 picking a color versus people in Group 2 picking the same color.

I.e., 64.2% of Grp 1 picked Orange compared to 53% in Group 2,. Is this difference significant? Here is a frequency table using tabpct()

tabpct(all_data$Colors, all_data$Group, graph = F)
Column percent 
                         all_data$Group
all_data$Colors         Grp 1   %     Grp 2   %
           Red          3    (1.3)    2    (1.0)
           Blue         19   (8.4)    10   (5.0)
           Yellow       1    (0.4)    2    (1.0)
           Green        4    (1.8)    5    (2.5)
           Purple       1    (0.4)    2    (1.0)
           Orange       145  (64.2)   106  (53.0)
           Pink         1    (0.4)    1    (0.5)
           Brown       52   (23.0)   72   (36.0)
           Total        226  (100)    200  (100)

I'm sure there is a simpler way, but I can't seem to figure it out. Any help would be appreciated!

I've tried to model an Anova and do a TukeyHSD test on it, but I'm given the error despite the fact that there are no NA, NaN, Inf, or 0:

ColorComp <- aov(Color ~ Group, data = all_data)
TukeyHSD(ColorComp)

> Error in lm.fit(x, y, offset = offset, singular.ok = singular.ok, ...) : 
> NA/NaN/Inf in 'y'
> In addition: Warning message:
> In storage.mode(v) <- "double" : NAs introduced by coercion

I have also tried regression with the same error.

2

There are 2 best solutions below

0
IRTFM On

Testing individual color differences is not statistically valid unless there is some a priori reason that makes just that color the focus of the analysis.

The Fisher test using a Monte-Carlo simulation indicates borderline suggestive evidence of a difference in distribution:

read.table(text=txt, head=TRUE)
  Colors Grp1     X. Grp2   X..1
1    Red    3  (1.3)    2  (1.0)
2   Blue   19  (8.4)   10  (5.0)
3 Yellow    1  (0.4)    2  (1.0)
4  Green    4  (1.8)    5  (2.5)
5 Purple    1  (0.4)    2  (1.0)
6 Orange  145 (64.2)  106 (53.0)
7   Pink    1  (0.4)    1  (0.5)
8  Brown   52 (23.0)   72 (36.0)
> dat <-read.table(text=txt, head=TRUE)
> fisher.test(dat[c(2,4)])

    Fisher's Exact Test for Count Data

data:  dat[c(2, 4)]
p-value = 0.06452
alternative hypothesis: two.sided

A chi-square test can be done but is of doubtful validity.

chisq.test(dat[c(2,4)])

    Pearson's Chi-squared test

data:  dat[c(2, 4)]
X-squared = 11.512, df = 7, p-value = 0.1178

Warning message:
In chisq.test(dat[c(2, 4)]) : Chi-squared approximation may be incorrect

0
DaveArmstrong On

Here's the result using simulate.p.value in chisq.test():

mat <- matrix(c(3  ,  2, 
19 ,  10,
1  ,  2, 
4  ,  5, 
1  ,  2, 
145,  106,
1  ,  1, 
52  , 72), ncol=2, byrow=TRUE) 
colnames(mat) <- c("Grp1", "Grp2")
rownames(mat) <- c("Red",    "Blue",   "Yellow", "Green",  "Purple", "Orange", "Pink",   "Brown")
mat
#>        Grp1 Grp2
#> Red       3    2
#> Blue     19   10
#> Yellow    1    2
#> Green     4    5
#> Purple    1    2
#> Orange  145  106
#> Pink      1    1
#> Brown    52   72

chisq.test(mat, simulate.p.value=TRUE, B=10000)
#> 
#>  Pearson's Chi-squared test with simulated p-value (based on 10000
#>  replicates)
#> 
#> data:  mat
#> X-squared = 11.512, df = NA, p-value = 0.09839

Created on 2023-11-07 with reprex v2.0.2