I am familiar with GridSearchCV in sklearn and using it to conduct a grid search over a set of parameters with non-time series data, but I'm not quite sure how to do this when one of the parameters I want to optimize over is the size of the training window for my LASSO regression. Basically I have market data where each data point corresponds to an hour long interval of market data and I am trying to fit a LASSO model where it trains over some lookback (say previous 60 hours) and forecasts volatility for the next hour. I am trying to optimize both over this lookback window (i.e. training window) and the Lasso regularization penalty.
My current approach is hard code the model to step through time given a training window, forecast the next hour, then shift the training window over an hour and forecast the next hour, etc. I basically grid search over some LASSO penalty parameters with a fixed training window, select the optimal lambda, then do a grid search with this fixed lambda over a set of training window sizes. However, this is very inefficient and might also not get me the best pair of parameters.
What I have so far is this:
scores0 = []
param_search = {'alpha' : np.logspace(-4, 0, 15)}
X = df.iloc[:, :-1]
Y = df.iloc[:, -1]
btscv = BlockingTimeSeriesSplit(n_splits=200)
for i in range(30):
model = Lasso()
finder0 = GridSearchCV(
estimator=model,
param_grid=param_search,
scoring='r2',
n_jobs=4,
cv=btscv,
verbose=1,
pre_dispatch=8,
error_score=-999,
return_train_score=True
)
finder0.fit(X, Y)
best_params0 = finder0.best_params_
best_score0 = round(finder0.best_score_,4)
scores0.append(best_score0)
However, this only finds optimal alpha for Lasso with the fixed n_splits in my blocked time series split of 200. I want to find the optimal alpha AND n_splits since I am doing a rolling lasso regression as my model.
Anyone have any experience with this without having data leakage in the cross validation? Thanks