Divide by zero encountered in true_divide f = msb / msw with SelectKBest

505 Views Asked by At

I tried to implement in my pipeline the SelectKBest function to improve my existing near model.

Without this new step, the model gave me the following results:

  • Best test negative MSE of the base model : -62.60
  • Best test R2 of the base model: 0.607

Nevertheless, this new step implemented in my pipeline, I found myself with really bad performances:

  • Best negative MSE test of the base model : -132.29
  • Best test R2 of the base model : 0.175

Moreover, warning message appears in loop:

RuntimeWarning: divide by zero encountered in true_divide
  f = msb / msw

Here is my code :

numeric_features = ['AGE_2019', 'Inhabitants']
categorical_features = ['familty_type','studying','Job_42','sex','DEGREE', 'Activity_type', 'Nom de la commune', 'city_type', 'DEP', 'INSEE', 'Nom du département', 'reg', 'Nom de la région']

numeric_transformer = Pipeline(steps=[
        ('imputer', SimpleImputer(strategy='median'))
        ,('scaler', MinMaxScaler()) # Centrage des données
])

categorical_transformer = Pipeline(steps=[
        ('imputer', SimpleImputer(strategy='constant',fill_value='missing'))       
        ,('encoder', OneHotEncoder(handle_unknown='ignore')) # Création de variables binaires pour les variables catégoriques
])

preprocessor = ColumnTransformer(
    transformers=[
    ('numeric', numeric_transformer, numeric_features)
    ,('categorical', categorical_transformer, categorical_features)
]) 

# Creation of the pipeline 

pipe = Pipeline([
    ('preprocessor', preprocessor),     
    ('model', DecisionTreeRegressor(random_state=0)),
    ('selector',SelectKBest(f_classif, k=7))
])

# Creation of the grid of parameters

dt_params = {'model__min_samples_split': [2, 5] + list(range(10, 250,5))}

cv_folds = KFold(n_splits=5, shuffle=True, random_state=0)

grid_piped = GridSearchCV(pipe, dt_params, cv=cv_folds, n_jobs=-1, scoring=['neg_mean_squared_error', 'r2'], refit='r2')
    
# Fitting our model

grid_piped.fit(df_X_train,df_Y_train)
1

There are 1 best solutions below

1
On

I think there is a misinterpretation of SelectKBest. That estimator normally does feature selection before fitting a model (see the user guide).

Removing features often reduces variance, but may also hurt performance. Here's a minimal example showing worse performance when using SelectKBest:

X, y = make_regression(n_samples=1000)
X_train, X_test, y_train, y_test = train_test_split(X, y)

pipe = make_pipeline(
    StandardScaler(),
    SelectKBest(k=7),                             # ← Remove features
    DecisionTreeRegressor(max_depth=4),
).fit(X_train, y_train)
print(pipe.score(X_test, y_test))                 # -0.186

pipe2 = make_pipeline(
    StandardScaler(),
    DecisionTreeRegressor(max_depth=4),           # ← Train with all features
).fit(X_train, y_train)
print(pipe2.score(X_test, y_test))                # 0.537