I tried to implement in my pipeline the SelectKBest
function to improve my existing near model.
Without this new step, the model gave me the following results:
- Best test negative MSE of the base model :
-62.60
- Best test R2 of the base model:
0.607
Nevertheless, this new step implemented in my pipeline, I found myself with really bad performances:
- Best negative MSE test of the base model :
-132.29
- Best test R2 of the base model :
0.175
Moreover, warning message appears in loop:
RuntimeWarning: divide by zero encountered in true_divide
f = msb / msw
Here is my code :
numeric_features = ['AGE_2019', 'Inhabitants']
categorical_features = ['familty_type','studying','Job_42','sex','DEGREE', 'Activity_type', 'Nom de la commune', 'city_type', 'DEP', 'INSEE', 'Nom du département', 'reg', 'Nom de la région']
numeric_transformer = Pipeline(steps=[
('imputer', SimpleImputer(strategy='median'))
,('scaler', MinMaxScaler()) # Centrage des données
])
categorical_transformer = Pipeline(steps=[
('imputer', SimpleImputer(strategy='constant',fill_value='missing'))
,('encoder', OneHotEncoder(handle_unknown='ignore')) # Création de variables binaires pour les variables catégoriques
])
preprocessor = ColumnTransformer(
transformers=[
('numeric', numeric_transformer, numeric_features)
,('categorical', categorical_transformer, categorical_features)
])
# Creation of the pipeline
pipe = Pipeline([
('preprocessor', preprocessor),
('model', DecisionTreeRegressor(random_state=0)),
('selector',SelectKBest(f_classif, k=7))
])
# Creation of the grid of parameters
dt_params = {'model__min_samples_split': [2, 5] + list(range(10, 250,5))}
cv_folds = KFold(n_splits=5, shuffle=True, random_state=0)
grid_piped = GridSearchCV(pipe, dt_params, cv=cv_folds, n_jobs=-1, scoring=['neg_mean_squared_error', 'r2'], refit='r2')
# Fitting our model
grid_piped.fit(df_X_train,df_Y_train)
I think there is a misinterpretation of
SelectKBest
. That estimator normally does feature selection before fitting a model (see the user guide).Removing features often reduces variance, but may also hurt performance. Here's a minimal example showing worse performance when using
SelectKBest
: