Factorial Anova in R

1.1k Views Asked by At

I am trouble understanding summary of factorial anova in R. I don't understand why I am getting Df of 2 for only the first variable. A,B,C and D all have 3 levels so in my understanding I should get 2 Df for those and interaction of those. Please help me to fix the code or understand the results.

P.S. Where can I find the list of options for summary()? I saw one example that removed the * after sig level and I want to see what options I have.

Thank you in advance

Here is Data I have

Complete data set I have

 Runs I  A  B  C  D AB  E AD BC  F  G  H  J  K B1 B2     y
1     1 1 -1 -1 -1 -1  1  1  1  1  1  1 -1 -1 -1 -1  1 190.9
2     2 1  1 -1 -1 -1 -1 -1 -1  1  1  1  1  1  1 -1 -1 436.2
3     3 1 -1  1 -1 -1 -1  1  1 -1 -1  1  1  1 -1  1 -1 480.3
4     4 1  1  1 -1 -1  1 -1 -1 -1 -1  1 -1 -1  1  1  1 406.3
5     5 1 -1 -1  1 -1  1 -1  1 -1  1 -1  1 -1  1  1 -1 212.9
6     6 1  1 -1  1 -1 -1  1 -1 -1  1 -1 -1  1 -1  1  1 478.7
7     7 1 -1  1  1 -1 -1 -1  1  1 -1 -1 -1  1  1 -1  1 396.5
8     8 1  1  1  1 -1  1  1 -1  1 -1 -1  1 -1 -1 -1 -1 349.7
9     9 1 -1 -1 -1  1  1  1 -1  1 -1 -1 -1  1  1  1 -1 119.7
10   10 1  1 -1 -1  1 -1 -1  1  1 -1 -1  1 -1 -1  1  1 372.2
11   11 1 -1  1 -1  1 -1  1 -1 -1  1 -1  1 -1  1 -1  1 411.6
12   12 1  1  1 -1  1  1 -1  1 -1  1 -1 -1  1 -1 -1 -1 382.8
13   13 1 -1 -1  1  1  1 -1 -1 -1 -1  1  1  1 -1 -1  1 161.2
14   14 1  1 -1  1  1 -1  1  1 -1 -1  1 -1 -1  1 -1 -1 424.3
15   15 1 -1  1  1  1 -1 -1 -1  1  1  1 -1 -1 -1  1 -1 322.8
16   16 1  1  1  1  1  1  1  1  1  1  1  1  1  1  1  1 302.1
17   17 1  0  0  0  0  0  0  0  0  0  0  0 -1  1  0  0 302.4
18   18 1  0  0  0  0  0  0  0  0  0  0  0  1 -1  0  0 318.2
19   19 1  0  0  0  0  0  0  0  0  0  0  0 -1  1  0  0 332.8

> data
###Factors
> A
 [1] -1 1  -1 1  -1 1  -1 1  -1 1  -1 1  -1 1  -1 1  0  0  0 
Levels: -1 0 1
> B
 [1] -1 -1 1  1  -1 -1 1  1  -1 -1 1  1  -1 -1 1  1  0  0  0 
Levels: -1 0 1
> C
 [1] -1 -1 -1 -1 1  1  1  1  -1 -1 -1 -1 1  1  1  1  0  0  0 
Levels: -1 0 1
> D
 [1] -1 -1 -1 -1 -1 -1 -1 -1 1  1  1  1  1  1  1  1  0  0  0 
Levels: -1 0 1

####Response variable
> data$y
 [1] 190.9 436.2 480.3 406.3 212.9 478.7 396.5 349.7 119.7 372.2 411.6 382.8 161.2 424.3 322.8 302.1 302.4 318.2
[19] 332.8

A=as.factor(data$A)
B=as.factor(data$B)
C=as.factor(data$C)
D=as.factor(data$D)



out3=lm(data$y~C+B+A+D)
fit1=aov(out3)
summary(fit1)

> summary(fit1)
            Df Sum Sq Mean Sq F value Pr(>F)  
C            2   2743    1372   0.170 0.8456  
B            1  26896   26896   3.332 0.0910 .
A            1  45839   45839   5.679 0.0331 *
D            1  12928   12928   1.602 0.2279  
Residuals   13 104934    8072

Same anova with different order of variable

summary(fit1) Df Sum Sq Mean Sq F value Pr(>F)
B 2 28199 14100 1.747 0.2129
A 1 45839 45839 5.679 0.0331 * D 1 12928 12928 1.602 0.2279
C 1 1440 1440 0.178 0.6796
Residuals 13 104934 8072

If I conduct anova with only 2 levels(exclude 0 for all variables, and use [1:16] data only since last 3 data are based on "0" level ), then it comes out fine. I get Df of 1 for all var but residuals.

1

There are 1 best solutions below

7
On BEST ANSWER

I was trying and thinking and thinking and saying how could this be possible that the degrees of freedom are not calculated correctly? But sometimes we only think about complicated things and forget about the easy things. I found what the problem is:

data <- read.table(header=T,text='Runs I  A  B  C  D AB  E AD BC  F  G  H  J  K B1 B2     y
1     1 1 -1 -1 -1 -1  1  1  1  1  1  1 -1 -1 -1 -1  1 190.9
2     2 1  1 -1 -1 -1 -1 -1 -1  1  1  1  1  1  1 -1 -1 436.2
3     3 1 -1  1 -1 -1 -1  1  1 -1 -1  1  1  1 -1  1 -1 480.3
4     4 1  1  1 -1 -1  1 -1 -1 -1 -1  1 -1 -1  1  1  1 406.3
5     5 1 -1 -1  1 -1  1 -1  1 -1  1 -1  1 -1  1  1 -1 212.9
6     6 1  1 -1  1 -1 -1  1 -1 -1  1 -1 -1  1 -1  1  1 478.7
7     7 1 -1  1  1 -1 -1 -1  1  1 -1 -1 -1  1  1 -1  1 396.5
8     8 1  1  1  1 -1  1  1 -1  1 -1 -1  1 -1 -1 -1 -1 349.7
9     9 1 -1 -1 -1  1  1  1 -1  1 -1 -1 -1  1  1  1 -1 119.7
10   10 1  1 -1 -1  1 -1 -1  1  1 -1 -1  1 -1 -1  1  1 372.2
11   11 1 -1  1 -1  1 -1  1 -1 -1  1 -1  1 -1  1 -1  1 411.6
12   12 1  1  1 -1  1  1 -1  1 -1  1 -1 -1  1 -1 -1 -1 382.8
13   13 1 -1 -1  1  1  1 -1 -1 -1 -1  1  1  1 -1 -1  1 161.2
14   14 1  1 -1  1  1 -1  1  1 -1 -1  1 -1 -1  1 -1 -1 424.3
15   15 1 -1  1  1  1 -1 -1 -1  1  1  1 -1 -1 -1  1 -1 322.8
16   16 1  1  1  1  1  1  1  1  1  1  1  1  1  1  1  1 302.1
17   17 1  0  0  0  0  0  0  0  0  0  0  0 -1  1  0  0 302.4
18   18 1  0  0  0  0  0  0  0  0  0  0  0  1 -1  0  0 318.2
19   19 1  0  0  0  0  0  0  0  0  0  0  0 -1  1  0  0 332.8')

a.dummies <- model.matrix(~A)
b.dummies <- model.matrix(~B)
c.dummies <- model.matrix(~C)
d.dummies <- model.matrix(~D)


a<-cbind(a.dummies[,-1],b.dummies[,-1])
b<-cbind(c.dummies[,-1],d.dummies[,-1])
all<-cbind(a,b)

I took the liberty to create the dummies on my own to check them one by one. And the problem revealed itself. Simple correlation table:

cor(all)

           A0         A1         B0         B1         C0         C1         D0         D1
A0  1.0000000 -0.3692745  1.0000000 -0.3692745  1.0000000 -0.3692745  1.0000000 -0.3692745
A1 -0.3692745  1.0000000 -0.3692745  0.1363636 -0.3692745  0.1363636 -0.3692745  0.1363636
B0  1.0000000 -0.3692745  1.0000000 -0.3692745  1.0000000 -0.3692745  1.0000000 -0.3692745
B1 -0.3692745  0.1363636 -0.3692745  1.0000000 -0.3692745  0.1363636 -0.3692745  0.1363636
C0  1.0000000 -0.3692745  1.0000000 -0.3692745  1.0000000 -0.3692745  1.0000000 -0.3692745
C1 -0.3692745  0.1363636 -0.3692745  0.1363636 -0.3692745  1.0000000 -0.3692745  0.1363636
D0  1.0000000 -0.3692745  1.0000000 -0.3692745  1.0000000 -0.3692745  1.0000000 -0.3692745
D1 -0.3692745  0.1363636 -0.3692745  0.1363636 -0.3692745  0.1363636 -0.3692745  1.0000000

The way the lm function works (and many more model functions) is to eliminate one of two variables that have a correlation of exactly 1 i.e. remove duplicate columns. In your case C0 has a correlation of 1 against A0, B0 and D0 so those 3 were removed from the model effectively reducing the number of levels of your factors to 2 for A,B and D. Therefore, the degrees of freedom are now 1 for A, B and D.

Mystery solved!!!