I want to test which predictor variables in my dataset are likely responsible for variation in newt occurrence and abundance between ponds. The response variable 'newt occurrence and abundance' consists of 3 categories (0=absent, 1=rare, 2=abundant), hence I will try to solve this using multinomial logistic regression. Before doing so I wanted to test for multicollinearity between the predictor variables, but stumbled upon a problem. I have one binary predictor 'fish presence' (1=present, 0=absent), 8 percentage-based predictors 'Open water%','Emergentvegetation%', 'Submergvegetation%','Floatingvegetation%','Grassland%','Shrubland%','Forest%' and 'Farmland%' and 8 continuous variables 'Conductivity' and 7 concentrations of molecules (NO2, NO3, PO4, BrO3, F, K and Al). My problem is that when I use VIF (Variance Inflation Factors) to asses the degree of collinearity all percentage-based predictors seem to be correlated (bcs their VIF values are way above 5). But when I look at the correlation matrix only two of them are correlated (r>0,60)...
I used following code
# VIF to test for multicollinearity
multicol_adults_verw<-glm(Newts$Adult_newt_present ~ Newts$Fish_present
+ Newts$Conductivity
+ Newts$`Openwater%` + Newts$`Emergentvegetation%` + Newts$`Submergvegetation%`
+ Newts$`Floatingvegetation%` + Newts$`Grassland%`+ Newts$`Shrubland%` + Newts$`Forest%`
+ Newts$`Farmland%`
+ Newts$NO2 + Newts$NO3 + Newts$PO4 + Newts$BrO3 + Newts$F + Newts$K + Newts$Al)
vif(multicol_adults_verw)
# This was the result
Newts$Fish_present Newts$Conductivity Newts$`Openwater%`
1.293867 1.810617 280.805856
Newts$`Emergentvegetation%` Newts$`Submergvegetation%` Newts$`Floatingvegetation%`
127.825638 57.055433 173.368349
Newts$`Grassland%` Newts$`Shrubland%` Newts$`Forest%`
1218.120231 676.756897 446.102839
Newts$`Farmland%` Newts$NO2 Newts$NO3
471.319920 1.837806 1.109595
Newts$PO4 Newts$BrO3 Newts$F
2.345582 1.406849 1.396751
Newts$K Newts$Al
2.621439 1.724767
# Then I made a correlation matrix of only the percentage-based predictors
percentage_variables<-cbind(Newts[,9:16])
percentage_variables
cor(percentage_variables,method = 'spearman')
# This was the result
Openwater% Emergentvegetation% Submergvegetation% Floatingvegetation% Grassland%
Openwater% 1.00000000 -0.43541792 -0.40267003 -0.653890921 -0.21347632
Emergentvegetation% -0.43541792 1.00000000 0.11738954 -0.136158732 0.03574042
Submergvegetation% -0.40267003 0.11738954 1.00000000 0.137795069 0.22785820
Floatingvegetation% -0.65389092 -0.13615873 0.13779507 1.000000000 0.10829576
Grassland% -0.21347632 0.03574042 0.22785820 0.108295759 1.00000000
Shrubland% -0.06501971 0.11712903 -0.12484373 -0.009161974 -0.54978000
Forest% 0.15317819 -0.11078578 -0.22185727 0.072672985 -0.51942932
Farmland% 0.17044214 -0.03061237 -0.02857237 -0.155229861 -0.43197789
Shrubland% Forest% Farmland%
Openwater% -0.065019710 0.15317819 0.17044214
Emergentvegetation% 0.117129032 -0.11078578 -0.03061237
Submergvegetation% -0.124843726 -0.22185727 -0.02857237
Floatingvegetation% -0.009161974 0.07267298 -0.15522986
Grassland% -0.549780002 -0.51942932 -0.43197789
Shrubland% 1.000000000 0.20730179 -0.16762573
Forest% 0.207301788 1.00000000 -0.13299785
Farmland% -0.167625729 -0.13299785 1.00000000
# It seems that only 'floatingvegetation%' and 'openwater%' are correlated (but negatively, what is logical)
Is there maybe anything I should do to the percentages before testing for multicollinearity or am I just interpreting it incorrectly? Thank you in advance for your help!