So in the stroke prediction dataset, I've created dummy variables for all the categorical variables, i.e gender_male and gender_female, smoking_status_smokes and smoking_status_unknown and so on. Now to check for multicollinearity for all the variables (numerical and dummy), I've applied the variance inflation function:
import numpy as np
from statsmodels.stats.outliers_influence import variance_inflation_factor
vif_data = pd.DataFrame()
vif_data["feature"] = new_df.loc[:, new_df.columns != 'stroke'].columns
vif_data["VIF"] = [variance_inflation_factor(new_df.loc[:, new_df.columns != 'stroke'].values, i) for i in range(len(new_df.loc[:, new_df.columns != 'stroke'].columns))]
vif_data
The output that I get is below:
feature VIF
0 age 2.836394
1 hypertension 1.111484
2 heart_disease 1.113943
3 avg_glucose_level 1.107552
4 bmi 1.342729
5 gender_Female inf
6 gender_Male inf
7 ever_married_No inf
8 ever_married_Yes inf
9 work_type_Govt_job inf
10 work_type_Never_worked inf
11 work_type_Private inf
12 work_type_Self-employed inf
13 work_type_children inf
14 Residence_type_Rural inf
15 Residence_type_Urban inf
16 smoking_status_formerly smoked inf
17 smoking_status_never smoked inf
18 smoking_status_smokes inf
Can somebody please explain why are the vif of the dummy variables infinity? Is there a better way to check for multicollinearity? Thanks