I want to calculate VIF for a very large dataset. 3000 sample and 5000 features>
The standard way of doing this is very slow:
# approach 1
from statsmodels.stats.outliers_influence import variance_inflation_factor
from statsmodels.tools.tools import add_constant
import numpy as np
cols=5000
rows=3000
np.random.seed(2)
df = pd.DataFrame(index=range(rows),data=np.random.randn(rows,cols),columns=[f'F{x}' for x in range(cols)] )
X = add_constant(df)
pd.Series([variance_inflation_factor(X.values, i)
for i in range(X.shape[1])],
index=X.columns)
Matrix inversion approach is fast but does not work because of precision issues:
# approch 2
corr_matrix = np.corrcoef(X.values)
inv_mat = np.linalg.inv(corr_matrix)
vif = np.diag(inv_mat)
Is there a solution to making approach 1 faster, or fixing precision issue for approach 2?