Multicollinearity test with car::vif

888 Views Asked by At

I am trying to run a car::vif() test in R, to test for multicollinearity. However, when I run the code

reg.model1 <- log(Price2) ~ Detached.house + Semi.detached.house + 
               Attached.houses + Apartment +
      Stock.apartment + Housing.cooperative + Sole.owner + Age +
      BRA + Bedrooms + Balcony + Lotsize + Sentrum + Alna + Vestre.Aker +
      Nordstrand + Marka + Ullern + Østensjø + Søndre.Nordstrand + Stovner + 
      Nordre.Aker + Bjerke + Grorud + Gamle.Oslo + St..Hanshaugen + 
      Grünerløkka + Sagene + Frogner 
reg1 <- lm(formula = reg.model1, data = Data)
vif(reg1)

I receive this error in the console:

Error in vif.default(reg1) : there are aliased coefficients in the model.

What I have read is that this means that there is something in the model that is highly correlated. When I look at the correlation matrix the only thing that is highly correlated is the dependent variable Price. But I also read somewhere that the dependent variable is okay even if it's highly correlated. I also found out that BRA is 0.8 in correlation so I tried to run it again without this, and still get the same error. Does anyone know what the problem could be, or what I could try to do differently?

2

There are 2 best solutions below

2
On BEST ANSWER

This is telling you that some set(s) of predictors is/are perfectly (multi)collinear; if you looked at coef(reg1) you would see at least one NA value, and if you ran summary(lm) you would see the message

([n] not defined because of singularities)

(for some n>=1). Examining the pairwise correlations of the predictor variables is not enough, because if you have (e.g.) predictors A, B, C where (the absolute values of) none of the pairwise correlations are exactly 1, they can still be multicollinear. (Probably the most common case is where A, B, C are dummy variables that describe a mutually exclusive and complete set of possibilities [i.e. for each observation exactly one of A, B, C is 1 and the other two are 0]. I strongly suspect that this is what's going on with your last 16 or so variables, which seem to be boroughs of Oslo ...)

Checking to see which coefficients of the regression are NA (as suggested by @Axeman) can suggest where the problem is; this answer explains how you can use model.matrix() and caret::findLinearCombos to figure out exactly which sets of predictors are causing the problem. (If all of your predictors are simple numeric variables you can skip model.matrix().)

If your problem is indeed caused by including a dummy variable for every possible geographic region, the simplest/best solution is to include geographic region (borough) in the model as a factor: if you do this, R will automatically generate a set of dummies/contrasts, but it will leave one dummy out automatically to avoid this kind of problem. If you later want to go back and get predicted values for every borough, you can use tools from the emmeans or effects packages.

1
On

I searched around for solutions since I couldn't solve them based on the answers. The answers, however, helped me understand my problem better. The solution to my problem was as simple as to put a minus instead of plus for one of the dummy variables. This was originally my code as I posted earlier:

reg.model1 <- log(Price2) ~ Detached.house + Semi.detached.house + 
               Attached.houses + Apartment +
      Stock.apartment + Housing.cooperative + Sole.owner + Age +
      BRA + Bedrooms + Balcony + Lotsize + Sentrum + Alna + Vestre.Aker +
      Nordstrand + Marka + Ullern + Østensjø + Søndre.Nordstrand + Stovner + 
      Nordre.Aker + Bjerke + Grorud + Gamle.Oslo + St..Hanshaugen + 
      Grünerløkka + Sagene + Frogner 
reg1 <- lm(formula = reg.model1, data = Data)
vif(reg1)

To solve my issue i had to simply change my code to:

reg.model1 <- log(Price2) ~ Detached.house + Semi.detached.house + 
               Attached.houses - Apartment +
      Stock.apartment + Housing.cooperative - Sole.owner + Age +
      BRA + Bedrooms + Balcony + Lotsize + Sentrum + Alna + Vestre.Aker +
      Nordstrand + Marka + Ullern + Østensjø + Søndre.Nordstrand + Stovner + 
      Nordre.Aker + Bjerke + Grorud + Gamle.Oslo + St..Hanshaugen + 
      Grünerløkka + Sagene - Frogner 
reg1 <- lm(formula = reg.model1, data = Data)
vif(reg1)

As you can see I have 3 series of dummies, and to make sure multicollinearity doesn't occur I have to remove one dummy from each one. I have removed apartments for the type of home, sole owner for a type of ownership, and Frogner for the district. This website explained this problem and solution much better and simpler than I (https://www.learndatasci.com/glossary/dummy-variable-trap/)!