Cross-Validating SMOTE with GridSearchCV in scikit-learn: Risk of data leakage?

67 Views Asked by Leo At 11 October 2023 at 16:35

I'm trying to compare the information given here on the interaction between SMOTE and GridSearchCV, in particular on the need not to oversample the validation dataset.

I’m using Pipeline from imblearn.Pipeline I'm trying to set up a Pipeline that will still test - and therefore cross-validate ...) several sampling methods.

Is the approach correct or is there a risk of dataleak / oversampling of the validation dataset?

model = Pipeline([
    ('preprocessor',StandardScaler()),
    ('sampling', SMOTE()),
    ('clf', RandomForestClassifier())
])

gridpsace=[{'preprocessor': [StandardScaler(), MinMaxScaler(), None],
            'sampling': [SMOTE(), None],
            'clf': [RandomForestClassifier(n_jobs=4)], # Actual Estimator
            'clf__n_estimators': [100],
            'clf__max_depth': [10, None],
            'clf__class_weight': [None, 'balanced'],
            }]

grid = GridSearchCV(model,
                    gridpsace,
                    scoring={'roc_auc': 'roc_auc', 'ftwo': scorer},
                    refit='roc_auc'
                    )

model_grid = grid.fit(X, y)
model_grid.cv_results_

Original Q&A

Cross-Validating SMOTE with GridSearchCV in scikit-learn: Risk of data leakage?

There are 0 best solutions below

Related Questions in PYTHON

Related Questions in SCIKIT-LEARN

Related Questions in GRIDSEARCHCV

Related Questions in IMBLEARN

Related Questions in SMOTE

Trending Questions

Popular # Hahtags

Popular Questions