How to test multicollinearity between percentage-based predictors

21 Views Asked by At

I want to test which predictor variables in my dataset are likely responsible for variation in newt occurrence and abundance between ponds. The response variable 'newt occurrence and abundance' consists of 3 categories (0=absent, 1=rare, 2=abundant), hence I will try to solve this using multinomial logistic regression. Before doing so I wanted to test for multicollinearity between the predictor variables, but stumbled upon a problem. I have one binary predictor 'fish presence' (1=present, 0=absent), 8 percentage-based predictors 'Open water%','Emergentvegetation%', 'Submergvegetation%','Floatingvegetation%','Grassland%','Shrubland%','Forest%' and 'Farmland%' and 8 continuous variables 'Conductivity' and 7 concentrations of molecules (NO2, NO3, PO4, BrO3, F, K and Al). My problem is that when I use VIF (Variance Inflation Factors) to asses the degree of collinearity all percentage-based predictors seem to be correlated (bcs their VIF values are way above 5). But when I look at the correlation matrix only two of them are correlated (r>0,60)...

I used following code

# VIF to test for multicollinearity
multicol_adults_verw<-glm(Newts$Adult_newt_present ~ Newts$Fish_present 
                     + Newts$Conductivity 
                     + Newts$`Openwater%` + Newts$`Emergentvegetation%` + Newts$`Submergvegetation%` 
                     + Newts$`Floatingvegetation%` + Newts$`Grassland%`+ Newts$`Shrubland%` + Newts$`Forest%` 
                     + Newts$`Farmland%`
                     + Newts$NO2 + Newts$NO3 + Newts$PO4 + Newts$BrO3 + Newts$F + Newts$K + Newts$Al) 

vif(multicol_adults_verw)

# This was the result
 Newts$Fish_present          Newts$Conductivity          Newts$`Openwater%` 
                   1.293867                    1.810617                  280.805856 
Newts$`Emergentvegetation%`  Newts$`Submergvegetation%` Newts$`Floatingvegetation%` 
                 127.825638                   57.055433                  173.368349 
         Newts$`Grassland%`          Newts$`Shrubland%`             Newts$`Forest%` 
                1218.120231                  676.756897                  446.102839 
          Newts$`Farmland%`                   Newts$NO2                   Newts$NO3 
                 471.319920                    1.837806                    1.109595 
                  Newts$PO4                  Newts$BrO3                     Newts$F 
                   2.345582                    1.406849                    1.396751 
                    Newts$K                    Newts$Al 
                   2.621439                    1.724767 

# Then I made a correlation matrix of only the percentage-based predictors

percentage_variables<-cbind(Newts[,9:16])
percentage_variables
cor(percentage_variables,method = 'spearman')

# This was the result
                     Openwater% Emergentvegetation% Submergvegetation% Floatingvegetation%  Grassland%
Openwater%           1.00000000         -0.43541792        -0.40267003        -0.653890921 -0.21347632
Emergentvegetation% -0.43541792          1.00000000         0.11738954        -0.136158732  0.03574042
Submergvegetation%  -0.40267003          0.11738954         1.00000000         0.137795069  0.22785820
Floatingvegetation% -0.65389092         -0.13615873         0.13779507         1.000000000  0.10829576
Grassland%          -0.21347632          0.03574042         0.22785820         0.108295759  1.00000000
Shrubland%          -0.06501971          0.11712903        -0.12484373        -0.009161974 -0.54978000
Forest%              0.15317819         -0.11078578        -0.22185727         0.072672985 -0.51942932
Farmland%            0.17044214         -0.03061237        -0.02857237        -0.155229861 -0.43197789
                      Shrubland%     Forest%   Farmland%
Openwater%          -0.065019710  0.15317819  0.17044214
Emergentvegetation%  0.117129032 -0.11078578 -0.03061237
Submergvegetation%  -0.124843726 -0.22185727 -0.02857237
Floatingvegetation% -0.009161974  0.07267298 -0.15522986
Grassland%          -0.549780002 -0.51942932 -0.43197789
Shrubland%           1.000000000  0.20730179 -0.16762573
Forest%              0.207301788  1.00000000 -0.13299785
Farmland%           -0.167625729 -0.13299785  1.00000000

# It seems that only 'floatingvegetation%' and 'openwater%' are correlated (but negatively, what is logical)

Is there maybe anything I should do to the percentages before testing for multicollinearity or am I just interpreting it incorrectly? Thank you in advance for your help!

0

There are 0 best solutions below