I'm trying to figure out how does grid search works but can't understand why scores on individual split don't match scores explicitly evaluated on those splits. I'll be more concrete with an example.
My cross validation scheme is:
tscv = TimeSeriesSplit(n_splits=3 , test_size = 1 )
And be [y|X] my sample, with y of shape (n_samples, 2) and X of shape (n_samples, 10). If I define:
clf_Lasso = GridSearchCV(estimator = Lasso(),
param_grid = { 'alpha' : [10] },
refit=True,
cv=tscv,
return_train_score=True,
scoring = 'neg_mean_squared_error'
)
model_Lasso = clf_Lasso.fit(X, y)
grid_search_scores_Lasso = pd.DataFrame(
model_Lasso.cv_results_ )[['param_alpha', 'split0_train_score', 'split1_train_score', 'split2_train_score']
I expect the last line to return a pandas Dataframe with one row only and the mean squared errors evaluated at each of my three split for alpha = 10.
I then run:
mse_Lasso = []
for train, test in tscv.split(X):
Xcv = X.iloc[train] ; ycv = y.iloc[train]
Xcv_test = X.iloc[test] ; ycv_test = y.iloc[test]
tmp = Lasso(alpha=10).fit(Xcv, ycv)
mse_Lasso.append( mean_squared_error(tmp.predict(Xcv_test), ycv_test) )
I expect mse_Lasso to be a list containing the same values of the first row of the previous dataset, having:
from sklearn.linear_model import Lasso
from sklearn.metrics import mean_squared_error
While I get, in the first case:
param_alpha split0_train_score split1_train_score split2_train_score
0 10 -8.075127 -8.073908 -8.067685
and:
[10.227336344351109, 12.195915550359423, 16.63612266112668]
in the second one... What am I doing wrong?
Please help...
PS: if I run multiple values of alphas and select the best one they provide the same prediction