Optuna score vs Cross_val_score?

691 Views Asked by At

A accuracy score from optuna and a score in cross_val_score were different. Why does this occur and which score should I choose? I used the hyperparameters that I got in optuna in cross_val_score.

def objective_lgb(trial):
    num_leaves = trial.suggest_int("num_leaves", 2, 1000)
    max_depth = trial.suggest_int("max_depth", 2, 100)
    learning_rate = trial.suggest_float('learning_rate', 0.001, 1)
    n_estimators = trial.suggest_int('n_estimators', 100, 2000)
    min_child_samples = trial.suggest_int('min_child_samples', 3, 1000)
    subsample = trial.suggest_float('subsample', 0.000001, 1)
    colsample_bytree = trial.suggest_float('colsample_bytree', 0.00000001, 1)
    reg_alpha = trial.suggest_float('reg_alpha', 0, 400)
    reg_lambda = trial.suggest_float("reg_lambda", 0, 400)
    importance_type = trial.suggest_categorical('importance_type', ["split", "gain"])

    lgb_clf = lgb.LGBMClassifier(random_state=1,
                         objective="multiclass",
                         num_class = 3, 
                         importance_type=importance_type,
                         num_leaves=num_leaves,
                         max_depth=max_depth,
                         learning_rate=learning_rate,
                         n_estimators=n_estimators,
                         min_child_samples=min_child_samples,
                         subsample=subsample,
                         colsample_bytree=colsample_bytree,
                         reg_alpha=reg_alpha,
                         reg_lambda=reg_lambda
                         )
    score = cross_val_score(lgb_clf, train_x, train_y, n_jobs=-1, cv=KFold(n_splits=10,  shuffle=True, random_state=1), scoring='accuracy')
    mean_score = score.mean()
    return mean_score
lgb_study = optuna.create_study(direction="maximize")
lgb_study.optimize(objective_lgb, n_trials=1500)

lgb_trial = lgb_study.best_trial
print("accuracy:", lgb_trial.value)
print()
print("Best params:", lgb_trial.params)
=========================================================
def light_check(x,params):
     model = lgb.LGBMClassifier()
     scores = cross_val_score(model,x,y,cv=KFold(n_splits=10,  shuffle=True, random_state=1),n_jobs=-1)
     mean = scores.mean()
     return scores, mean
light_check(x,{'num_leaves': 230, 'max_depth': 53, 'learning_rate': 0.04037430031226232, 'n_estimators': 1143, 'min_child_samples': 381, 'subsample': 0.12985990464862135, 'colsample_bytree': 0.8914118949904919, 'reg_alpha': 31.869348047391053, 'reg_lambda': 17.45653692887209, 'importance_type': 'split'})
2

There are 2 best solutions below

1
On

From what I can see, you are using X_train, y_train in the optuna call, while in light_check you are passing x and y. Assuming you made a split in some unknown code, the data set for optuna is smaller and you get a different number.

1
On

Optuna outputs the value you return in the objective function as an accuracy score, which corresponds to the mean_score in your problem. Additionally, you should be mindful that during cross-validation, you must provide the training data to the model, which you did correctly. However, in the light_check function, you mistakenly provided all the data to the model.

The correct approach for the final evaluation of the model is to use a portion of the data that was initially separated as a test set for evaluation. The validation data is specifically for model validation purposes, while the Test data is used for model evaluation.

For a better understanding, please visit the following address to review my code demonstrating how to set hyperparameters through Optuna. This will provide you with a more comprehensive understanding of model analysis and evaluation.

https://www.kaggle.com/code/amir9473/tuning-hyperparameters-ml-classification-acu-94