Discrepancy between lgb.cv and cross_val_score results in multiclass classification with LightGBM

128 Views Asked by At

I expect similar cross-validation results when using lgb.cv and cross_val_score, but they vary significantly:

import lightgbm as lgb
import pandas as pd
from sklearn import datasets
from sklearn.metrics import log_loss
from sklearn.model_selection import cross_val_score

from typing import Any, Dict, List


def log_loss_scorer(clf, X, y):
    y_pred = clf.predict_proba(X)
    return log_loss(y, y_pred)


iris = datasets.load_iris()
features = pd.DataFrame(columns=["f1", "f2", "f3", "f4"], data=iris.data)
target = pd.Series(iris.target, name="target")
# 1) Native API
dataset = lgb.Dataset(features, target, feature_name=list(features.columns), free_raw_data=False)

native_params: Dict[str, Any] = {
    "objective": "multiclass", "boosting_type": "gbdt", "learning_rate": 0.05, "num_class": 3, "seed": 41
}
cv_logloss_native: float = lgb.cv(
    native_params, dataset, num_boost_round=1000, nfold=5, metrics="multi_logloss", seed=41, stratified=False,
    shuffle=False
)['valid multi_logloss-mean'][-1]

# 2) ScikitLearn API
model_scikit = lgb.LGBMClassifier(
    objective="multiclass", boosting_type="gbdt", learning_rate=0.05, n_estimators=1000, random_state=41
)
cv_logloss_scikit_list: List[float] = cross_val_score(
    model_scikit, features, target, scoring=log_loss_scorer
)
cv_logloss_scikit: float = sum(cv_logloss_scikit_list) / len(cv_logloss_scikit_list)
print(f"Native logloss CV {cv_logloss_native}; Scikit logloss CV train {cv_logloss_scikit}")

I get a score of 0.8803800291063604 with the native API, and a score of 0.37528027519836027 with the scikit-learn API. I tried different metrics and I still get very different results between the two methods. Is there a specific reason for this discrepancy, and how can I align the results between the two methods?

EDIT: As suggested by @DataJanitor, I disabled the multi_logloss metric from the native API and I implemented my own:

def log_loss_custom_metric(y_pred, data: lgb.Dataset):
    y_true = data.get_label()
    loss_value = log_loss(y_true, y_pred)
    return "custom_logloss", loss_value, True

And I passed it to the native api via the feval argument:

cv_logloss_native: float = lgb.cv(native_params, dataset, num_boost_round=1000, nfold=5, feval=log_loss_custom_metric, shuffle=True)["valid custom_logloss-mean"][-1]

However the results still differ by a lot (0.58 for the native api and 0.37 for the scikit api).

The code I reported is 100% reproducible, as I am using the iris dataset. It would be great if anyone could manage to match the scores and practically tell me which is the source of discrepancy.

1

There are 1 best solutions below

2
On

I see multiple potential sources for differences:

Your native LGBM API code sets stratified=False. This might lead to imbalanced folds. scikit-learn's cross_val_score automatically stratifies the folds for classification tasks, ensuring balanced representation of each class.

Shuffle: You've set shuffle=False in the native API, maintaining the data order. In contrast, cross_val_score will shuffle the data before folding, unless specified otherwise.

Custom Scorer: You've used a custom scorer for scikit-learn which calculates multi-class log loss. Though you've set the native API metric to multi_logloss, slight differences in computation could exist due to different implementations.