I am trying to run a car::vif()
test in R, to test for multicollinearity. However, when I run the code
reg.model1 <- log(Price2) ~ Detached.house + Semi.detached.house +
Attached.houses + Apartment +
Stock.apartment + Housing.cooperative + Sole.owner + Age +
BRA + Bedrooms + Balcony + Lotsize + Sentrum + Alna + Vestre.Aker +
Nordstrand + Marka + Ullern + Østensjø + Søndre.Nordstrand + Stovner +
Nordre.Aker + Bjerke + Grorud + Gamle.Oslo + St..Hanshaugen +
Grünerløkka + Sagene + Frogner
reg1 <- lm(formula = reg.model1, data = Data)
vif(reg1)
I receive this error in the console:
Error in vif.default(reg1) : there are aliased coefficients in the model.
What I have read is that this means that there is something in the model that is highly correlated. When I look at the correlation matrix the only thing that is highly correlated is the dependent variable Price
. But I also read somewhere that the dependent variable is okay even if it's highly correlated. I also found out that BRA
is 0.8 in correlation so I tried to run it again without this, and still get the same error. Does anyone know what the problem could be, or what I could try to do differently?
This is telling you that some set(s) of predictors is/are perfectly (multi)collinear; if you looked at
coef(reg1)
you would see at least oneNA
value, and if you ransummary(lm)
you would see the message(for some n>=1). Examining the pairwise correlations of the predictor variables is not enough, because if you have (e.g.) predictors A, B, C where (the absolute values of) none of the pairwise correlations are exactly 1, they can still be multicollinear. (Probably the most common case is where A, B, C are dummy variables that describe a mutually exclusive and complete set of possibilities [i.e. for each observation exactly one of A, B, C is 1 and the other two are 0]. I strongly suspect that this is what's going on with your last 16 or so variables, which seem to be boroughs of Oslo ...)
Checking to see which coefficients of the regression are
NA
(as suggested by @Axeman) can suggest where the problem is; this answer explains how you can usemodel.matrix()
andcaret::findLinearCombos
to figure out exactly which sets of predictors are causing the problem. (If all of your predictors are simple numeric variables you can skipmodel.matrix()
.)If your problem is indeed caused by including a dummy variable for every possible geographic region, the simplest/best solution is to include geographic region (borough) in the model as a factor: if you do this, R will automatically generate a set of dummies/contrasts, but it will leave one dummy out automatically to avoid this kind of problem. If you later want to go back and get predicted values for every borough, you can use tools from the
emmeans
oreffects
packages.