Is a multinomial logistic regression the appropriate "test" for this situation?

195 Views Asked by At

I have two columns in my dataset. y is the dependent variable and is categorical with three levels (unordered levels A, B and C) and x is the numeric independent variable. The example below illustrates the situation, but my actual dataset is larger, with over 1000 rows.

+------+---+
|  x   | y |
+------+---+
| 5.93 | A |
| 4.46 | A |
| 4.63 | A |
| 5.07 | A |
| 5.71 | A |
| 6.81 | B |
| 6.45 | B |
| 6.07 | B |
| 7.26 | C |
| 8.24 | C |
| 6.25 | C |
| 7.34 | C |
| 7.17 | C |
+------+---+

My null hypothesis is that the proportions of A, B and C in column y are independent of the x values. That is, the proportions of A, B and C associated with any given x value are independent of x. The alternative hypothesis is that these proportions are dependent on x.

I am looking for a statistical test for this.

I am wondering if performing a multinomial logistic regression and assessing the significance of the coefficients is a reasonable way to go, or if there is a better test.

1

There are 1 best solutions below

0
On BEST ANSWER

If you need to perform hypothesis testing, then most likely multinomial regression is the way. The other option is to discretize your continuous variable and then show that there is an association between different bins and your categories.

You can check this post's accepted answer for testing each coefficient, under each term separately. The downside about that is you need to set one term as a reference.

Since your null is that "proportions of A, B and C in column y are independent of the x values", you can test your model against a null model. Normally a likelihood ratio test is used. Below is how to do it in R:

df = structure(list(x = c(5.93, 4.46, 4.63, 5.07, 5.71, 6.81, 6.45, 
6.07, 7.26, 8.24, 6.25, 7.34, 7.17), y = structure(c(1L, 1L, 
1L, 1L, 1L, 2L, 2L, 2L, 3L, 3L, 3L, 3L, 3L), .Label = c("A", 
"B", "C"), class = "factor")), class = "data.frame", row.names = 0:12)

library(car)
library(nnet)
fit = multinom(y ~ x,data=df)
Anova(fit)

# weights:  6 (2 variable)
initial  value 14.281960 
final  value 13.954126 
converged
Analysis of Deviance Table (Type II tests)

Response: y
  LR Chisq Df Pr(>Chisq)    
x   18.717  2  8.624e-05 ***
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1