Python VIF returns infinity values for dummy variables

1.6k Views Asked by At

So in the stroke prediction dataset, I've created dummy variables for all the categorical variables, i.e gender_male and gender_female, smoking_status_smokes and smoking_status_unknown and so on. Now to check for multicollinearity for all the variables (numerical and dummy), I've applied the variance inflation function:

import numpy as np
from statsmodels.stats.outliers_influence import variance_inflation_factor
vif_data = pd.DataFrame()

vif_data["feature"] = new_df.loc[:, new_df.columns != 'stroke'].columns
vif_data["VIF"] = [variance_inflation_factor(new_df.loc[:, new_df.columns != 'stroke'].values, i) for i in range(len(new_df.loc[:, new_df.columns != 'stroke'].columns))]
vif_data

The output that I get is below:

feature VIF
0   age 2.836394
1   hypertension    1.111484
2   heart_disease   1.113943
3   avg_glucose_level   1.107552
4   bmi 1.342729
5   gender_Female   inf
6   gender_Male inf
7   ever_married_No inf
8   ever_married_Yes    inf
9   work_type_Govt_job  inf
10  work_type_Never_worked  inf
11  work_type_Private   inf
12  work_type_Self-employed inf
13  work_type_children  inf
14  Residence_type_Rural    inf
15  Residence_type_Urban    inf
16  smoking_status_formerly smoked  inf
17  smoking_status_never smoked inf
18  smoking_status_smokes   inf

Can somebody please explain why are the vif of the dummy variables infinity? Is there a better way to check for multicollinearity? Thanks

0

There are 0 best solutions below