How can I control subsampling such that xgb.cv and cross_validate produce the same results?

788 Views Asked by At

xgb.cv and sklearn.model_selection.cross_validate do not produce the same mean train/test error even though I set the same seed/random_state and I make sure both methods use the same folds. The code at the bottom allows to reproduce my issue. (Early stopping is off by default).

I found out this issue is caused by the subsample parameter (both methods produce the same result if this parameter is set to 1) but I cannot find a way to make both methods subsample in the same way. In addition to setting seed/random_state as shown in the code at the bottom, I also tried explicitly adding:

import random
random.seed(1)
np.random.seed(1)

at the beginning of my file but this does not resolve my issue either. Any ideas?

import numpy as np
import xgboost as xgb
from xgboost import XGBClassifier
from sklearn.model_selection import cross_validate, StratifiedKFold

X = np.random.randn(100,20)
y = np.random.randint(0,2,100)
dtrain = xgb.DMatrix(X, label=y)

params = {'eta':0.3,
          'max_depth': 4,
          'gamma':0.1,
          'silent': 1,
          'objective': 'binary:logistic',
          'seed': 1,
          'subsample': 0.8
         }

cv_results = xgb.cv(params, dtrain, num_boost_round=99, seed=1,
                    folds=StratifiedKFold(5, shuffle=False, random_state=1),
                    early_stopping_rounds=10)
print(cv_results, '\n')

xgbc = XGBClassifier(learning_rate=0.3, 
                     max_depth=4, 
                     gamma=0.1, 
                     silent = 1,  
                     objective = 'binary:logistic',
                     subsample = 0.8,
                     random_state = 1,
                     n_estimators=len(cv_results))
scores = cross_validate(xgbc, X, y, 
                        cv=StratifiedKFold(5, shuffle=False, random_state=1), 
                        return_train_score=True)
print('train-error-mean = {}   test-error-mean = {}'.format(
             1-scores['train_score'].mean(), 1-scores['test_score'].mean()))

Output:

   train-error-mean  train-error-std  test-error-mean  test-error-std
0          0.214981         0.030880         0.519173        0.129533
1          0.140039         0.018552         0.549549        0.034696
2          0.105100         0.017420         0.510501        0.040517
3          0.092474         0.012587         0.450977        0.075866 

train-error-mean = 0.06994061572120636   test-error-mean = 0.4706015037593986

Output in case subsample is set to 1:

   train-error-mean  train-error-std  test-error-mean  test-error-std
0          0.180043         0.013266         0.491504        0.093246
1          0.117381         0.021328         0.488070        0.097733
2          0.074972         0.030605         0.530075        0.091446
3          0.044907         0.032232         0.519073        0.130802
4          0.032438         0.021816         0.481027        0.080622 

train-error-mean = 0.032438271604938285   test-error-mean = 0.4810275689223057
1

There are 1 best solutions below

2
On

I know for sure in the case of LGBM, but from the quick code at the XGB code (here) it seems to have a similar behaviour, so I assume the answer is relevant.

The trick is in the early stopping. The native xgb.cv defines a single iteration for which the mean CV score (or something similar to the mean, i forgot by now :) ) reaches plateau, while in sklearn cross validation models in each fold are trained independently and thus early stopping happens on different iterations for different folds.

So, if you want to get identical results- disable early stopping (which is problematic, as you can over- or under-fit and you are not aware of it). If you want to use early stopping- there is no way to get identical results due to the difference in implementations