I want to perform variable selection using Lasso regression, as I am not sure how many (lagged) variables X still have an effect on my variable y. However, the resulting model, and also which variables end up being zero, is different for different amounts of inputvariables.
For example, I have n=295 observations. If I use LassoCV on 10 lagged input variables, I get that the 5th lag is 0. If I only use 8 lagged input variables, the 4th lag might turn out 0. So my variable selection is dependent on the number of variables I start with, and I therefore think the result can't be trusted. What am I doing wrong?
Since in my application n>p, I don't think it has to do with multiple minima in the Lasso criterion. I do get ConvergenceWarnings very often, so I increased my number of iterations and tolerance. I am not very familiar with the duality gap, it does seem very large. Maybe the error lies here?
ConvergenceWarning: Objective did not converge. You might want to increase the number of iterations. Duality gap: 31069.631120879843, tolerance: 12362.796494481028 model = cd_fast.enet_coordinate_descent_gram(
My code in python:
import numpy as np
from sklearn.linear_model import LassoCV
from sklearn.model_selection import TimeSeriesSplit
n_lambs = 50
cv = TimeSeriesSplit(
n_splits=5,
gap=0
)
model = LassoCV(
alphas=np.logspace(-4, 2, n_lambs),
fit_intercept=True,
cv=cv,
n_jobs=-1,
max_iter=1000000,
tol=0.001
)
fit = model.fit(X_train, y_train)
In this code, I vary my X_train variable to have different amounts of lags.
(By the way, any suggestions on different variable selection methods would be appreciated too.)