Different score every time I run sklearn model with random_state set

1.6k Views Asked by At

I'm trying to determine why every time I rerun a model I obtain a slightly different score. I've defined:

# numpy seed (don't know if needed, but figured it couldn't hurt)
np.random.seed(42)
# Also tried re-seeding every time I ran the `cross_val_predict()` block, but that didn't work either

# cross-validator with random_state set
cv5 = KFold(n_splits=5, random_state=42, shuffle=True)

# scoring as RMSE of natural logs (to match Kaggle competition I'm trying)
def custom_scorer(actual, predicted):    
    actual = np.log1p(actual)
    predicted = np.log1p(predicted)
    return np.sqrt(np.sum(np.square(actual-predicted))/len(actual))

Then I ran this once with cv=cv5:

# Running GridSearchCV
rf_test = RandomForestRegressor(n_jobs = -1) 
params = {'max_depth': [20,30,40], 'n_estimators': [500], 'max_features': [100,140,160]} 
gsCV = GridSearchCV(estimator=rf_test, param_grid=params, cv=cv5, n_jobs=-1, verbose=1) 
gsCV.fit(Xtrain,ytrain)
print(gsCV.best_estimator_)

After running that to get gsCV.best_estimator_, I rerun this several times, and get slightly different scores each time:

rf_test = gsCV.best_estimator_
rf_test.random_state=42
ypred = cross_val_predict(rf_test, Xtrain, ytrain, cv=cv2)
custom_scorer(np.expm1(ytrain),np.expm1(ypred))

Example of (extremely small) score differences:

0.13200993923446158
0.13200993923446164
0.13200993923446153
0.13200993923446161

I'm trying to set seeds so I get the same score every time for the same model, in order to be able to compare different models. In Kaggle competitions very small differences in scores seem to matter (although admittedly not this small), but I'd just like to understand why. Does it have something to do with rounding in my machine when performing calculations? Any help is greatly appreciated!

Edit: I forgot the line rf_test.random_state=42 which made a much larger difference in score disparity, but even with this line included I still have minuscule differences.

2

There are 2 best solutions below

0
On

You are using cv2 while testing out your RandomForest Regressor. Have you set it's random seed as well ? Otherwise the splits while testing out your regressor will be different.

2
On

Random forest, is a set of decision trees, it uses randomness to select the height and split of these tree. It is really unlikely that you will get the same random forests when you run your program twice. I think, you are getting this slight variation because of it.