regarding the I( ) term in linear regression modeling in R using lm

2.5k Views Asked by At

I once saw a linear model fitting written as follows:

lm(formula = Ozone ~ Solar.R + Wind + Temp + I(Wind^2) + I(Temp^2) + 
I(Wind * Temp) + I(Wind * Temp^2) + I(Temp * Wind^2) + I(Temp^2 * 
Wind^2), data = airquality)

I am not sure what does I( ) mean here? Or for example, what does I(Wind * Temp^2) here. can I write it as Wind:Temp^2?

1

There are 1 best solutions below

0
On

The I() notation in the formula syntax in R means 'as is' i.e. I(a+b) simply means add the variable a+b as a predictor in the lm model. In your case I(Wind * Temp^2) means include as a predictor variable the product of Wind and Temp squared. The I() function is used so that there is no confusion with the operators of the formula syntax.

For more info page 2 here explains it in full detail.

Hope this is clear!

UPDATE I just want to add Hong Ooi's very good comment on this:

I(Wind * Temp^2) is not the same as Wind:Temp^2

The ^n operator in formula syntax means 'include these variables and all interactions up to n way'. For example Y ~ (X + Z + W)^2 is equivalent to Y ~ X + Z + W + X:Z + X:W + Z:W

So, in our case Wind:Temp^2 means just Wind:Temp

Small illustration:

Y <- runif(100)
X1 <- runif(100)
X2 <- runif(100)
df <- data.frame(Y,X1,X2)

> b <- lm( Y ~ X1:X2^2,data=df)
> summary(b)

Call:
lm(formula = Y ~ X1:X2^2, data = df)

Residuals:
    Min      1Q  Median      3Q     Max 
-0.4802 -0.2490 -0.0173  0.2345  0.5066 

Coefficients:
            Estimate Std. Error t value Pr(>|t|)    
(Intercept)  0.45126    0.04794   9.413 2.28e-15 ***
X1:X2        0.08991    0.13414   0.670    0.504    
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

Residual standard error: 0.2965 on 98 degrees of freedom
Multiple R-squared:  0.004563,  Adjusted R-squared:  -0.005594 
F-statistic: 0.4493 on 1 and 98 DF,  p-value: 0.5043