XGBClassifier gives different results on similar environments of different machines

60 Views Asked by At

I trained an XGBoost classifier model using Grid-Search with the below params:

params = {
    'max_depth':[5,6],
    'min_child_weight': [1, 2, 3],
    'subsample': [0.6, 0.8, 1.0],
    'colsample_bytree': [0.6, 0.8, 1.0]
}

xgb = XGBClassifier(device="cuda",learning_rate=0.02, n_estimators=1000, objective='binary:logistic', verbosity=0, tree_method="gpu_hist")

skf = StratifiedKFold(n_splits=folds, shuffle = True, random_state = 1001)

grid_search = GridSearchCV(estimator=xgb, param_grid=params, scoring='roc_auc', n_jobs=-1, cv=skf.split(x_train,y_train), verbose=100, return_train_score=True)

grid_search.fit(x_train, y_train)

And then I saved the best model as below:

from joblib import dump
joblib.dump(grid_search.best_estimator_, 'xgboost_grid_search.joblib')

When I load the model again, the predict_proba gives different results, this is how I load the model to get predictions:

import joblib
model = joblib.load("xgboost_grid_search.joblib")
model.predict_proba(x_test)

Here x_train and x_test contain numerical features. y_train and y_test are categorical values (either 0 or 1)

By reading through quite a few blogs, articles, stack-overflow answers, I have made sure the below conditions are met in both environments:

 1. Correct python version - 3.11.5
 2. Same/consistent joblib and xgboost pip versions - 1.2.0 and 2.0.0 respectively
 3. Correct ordering of features in x_test as x_train and model.feature_names_in_

However, I am pointing out that the OS of these two environments is different: Mac M1 and Ubuntu (not sure if this is an issue).

Any help is appreciated and please let me know if I am doing something wrong.

Thanks in advance!

1

There are 1 best solutions below

3
On

To reproduce the results you need to also set the random seed i.e. the random_state input argument in xgboost.XGBClassifier

params = {
    'max_depth':[5,6],
    'min_child_weight': [1, 2, 3],
    'subsample': [0.6, 0.8, 1.0],
    'colsample_bytree': [0.6, 0.8, 1.0]
}

xgb = XGBClassifier(device="cuda",learning_rate=0.02, 
                    n_estimators=1000, 
                    objective='binary:logistic', 
                    verbosity=0, 
                    tree_method="gpu_hist", 
                    random_state= 1001) # HERE!

skf = StratifiedKFold(n_splits=folds, shuffle = True, random_state = 1001)

grid_search = GridSearchCV(estimator=xgb, param_grid=params, scoring='roc_auc', n_jobs=-1, cv=skf.split(x_train,y_train), verbose=100, return_train_score=True)

grid_search.fit(x_train, y_train)