I'm trying to compare the information given here on the interaction between SMOTE and GridSearchCV, in particular on the need not to oversample the validation dataset.
I’m using Pipeline from imblearn.Pipeline
I'm trying to set up a Pipeline that will still test - and therefore cross-validate ...) several sampling methods.
Is the approach correct or is there a risk of dataleak / oversampling of the validation dataset?
model = Pipeline([
('preprocessor',StandardScaler()),
('sampling', SMOTE()),
('clf', RandomForestClassifier())
])
gridpsace=[{'preprocessor': [StandardScaler(), MinMaxScaler(), None],
'sampling': [SMOTE(), None],
'clf': [RandomForestClassifier(n_jobs=4)], # Actual Estimator
'clf__n_estimators': [100],
'clf__max_depth': [10, None],
'clf__class_weight': [None, 'balanced'],
}]
grid = GridSearchCV(model,
gridpsace,
scoring={'roc_auc': 'roc_auc', 'ftwo': scorer},
refit='roc_auc'
)
model_grid = grid.fit(X, y)
model_grid.cv_results_