LGBM custom evaluation metric

316 Views Asked by At

I want to optimize parameters for my multiclass classification LGBM model with RandomizedSearchCV by using a custom scoring function. This custom scoring function needs additional data (defined by the scoring_info_cols) that must not be used for training, however it is needed for calculating the score.

I have my features_train dataframe, which has all the features that must be used for training plus the additional data that is needed for calculating the score, and my target_train series. Similarly, I have features_val and target_val.

dropper = ColumnTransformer(
    [("drop", "drop", scoring_info_cols)], remainder="passthrough"
)
lgbm_model = Pipeline([
    ("drop_scoring_info", droppper),
    ("lgbm", lgb.LGBMClassifier(early_stopping_rounds=20),
])

random_search = RandomizedSearchCV(
    lgbm_model,
    param_distributions=param_dist,
    cv=5,
    scoring=custom_scoring_function,
    n_iter=100,
    random_state=42,
    n_jobs=1
)
random_result = random_search.fit(
    features_train, target_train, lgbm__eval_set=[(features_val, target_val)],
    lgbm__eval_metric=my_eval_metric
)

When I call random_search.fit, I get the error: ValueError: Length of feature_name(25) and num_feature(7) don't match

where 7 is exactly the number of features I want to train the model on, where as 25 is the 7 features + 18 additional scoring_info columns. It seems that dropper is being applied to the training features, but not to the validation features.

Any idea on how to get around this?

This is the full traceback:

  File "C:\ProgramData\Anaconda3\lib\contextlib.py", line 131, in __exit__
    self.gen.throw(type, value, traceback)
  File "~\python\venv\lib\site-packages\sklearn\_config.py", line 353, in config_context
    yield
  File "~\python\venv\lib\site-packages\sklearn\base.py", line 1151, in wrapper
    return fit_method(estimator, *args, **kwargs)
  File "~\python\venv\lib\site-packages\sklearn\model_selection\_search.py", line 898, in fit
    self._run_search(evaluate_candidates)
  File "~\python\venv\lib\site-packages\sklearn\model_selection\_search.py", line 1806, in _run_search
    evaluate_candidates(
  File "~\python\venv\lib\site-packages\sklearn\model_selection\_search.py", line 875, in evaluate_candidates
    _warn_or_raise_about_fit_failures(out, self.error_score)
  File "~\python\venv\lib\site-packages\sklearn\model_selection\_validation.py", line 414, in _warn_or_raise_about_fit_failures
    raise ValueError(all_fits_failed_message)
ValueError: 
All the 500 fits failed.
It is very likely that your model is misconfigured.
You can try to debug the error by setting error_score='raise'.
Below are more details about the failures:
--------------------------------------------------------------------------------
500 fits failed with the following error:
Traceback (most recent call last):
  File "~\python\venv\lib\site-packages\sklearn\model_selection\_validation.py", line 732, in _fit_and_score
    estimator.fit(X_train, y_train, **fit_params)
  File "~\venv\lib\site-packages\sklearn\base.py", line 1151, in wrapper
    return fit_method(estimator, *args, **kwargs)
  File "~\python\venv\lib\site-packages\sklearn\pipeline.py", line 420, in fit
    self._final_estimator.fit(Xt, y, **fit_params_last_step)
  File "~\python\venv\lib\site-packages\lightgbm\sklearn.py", line 1142, in fit
    super().fit(
  File "~\python\venv\lib\site-packages\lightgbm\sklearn.py", line 842, in fit
    self._Booster = train(
  File "~\python\venv\lib\site-packages\lightgbm\engine.py", line 249, in train
    booster.add_valid(valid_set, name_valid_set)
  File "~\python\venv\lib\site-packages\lightgbm\basic.py", line 3468, in add_valid
    data.construct()._handle))
  File "~\python\venv\lib\site-packages\lightgbm\basic.py", line 2175, in construct
    self._lazy_init(data=self.data, label=self.label, reference=self.reference,
  File "~\python\venv\lib\site-packages\lightgbm\basic.py", line 1895, in _lazy_init
    return self.set_feature_name(feature_name)
  File "~\python\venv\lib\site-packages\lightgbm\basic.py", line 2567, in set_feature_name
    raise ValueError(f"Length of feature_name({len(feature_name)}) and num_feature({self.num_feature()}) don't match")
ValueError: Length of feature_name(25) and num_feature(7) don't match
1

There are 1 best solutions below

1
Wilmer E. Henao On

so the problem you have is how the Pipeline and RandomizedSearchCV are working. Your ColumnTransformer is removing these extra columns before to send the rest of the data to LGBMClassifier. But this transformation is not happening on the validation data you give in lgbm__eval_set.

Here are ways you can solve this

Method 1: Manually Drop Columns for Validation

You can remove manually the scoring_info_cols from your features_val DataFrame before to do the fit.

features_val_dropped = features_val.drop(scoring_info_cols, axis=1)

random_result = random_search.fit(
    features_train, target_train, 
    lgbm__eval_set=[(features_val_dropped, target_val)], 
    lgbm__eval_metric=my_eval_metric
)

Method 2: Custom fit Method Inside Pipeline

You can create a subclass of Pipeline and change the fit method for also remove these columns in validation set, you know?

from sklearn.pipeline import Pipeline

class CustomPipeline(Pipeline):
    def fit(self, X, y=None, **fit_params):
        if 'lgbm__eval_set' in fit_params:
            eval_set = fit_params['lgbm__eval_set']
            new_eval_set = [(eval_set[0][0].drop(scoring_info_cols, axis=1), eval_set[0][1])]
            fit_params['lgbm__eval_set'] = new_eval_set
        return super().fit(X, y, **fit_params)

lgbm_model = CustomPipeline([
    ("drop_scoring_info", dropper),
    ("lgbm", lgb.LGBMClassifier(early_stopping_rounds=20))
])

random_result = random_search.fit(
    features_train, target_train, 
    lgbm__eval_set=[(features_val, target_val)], 
    lgbm__eval_metric=my_eval_metric
)

I'm pretty shure either way will solve the problem :)