I have experienced an unexpected behaviour of with the estimator of the RandomizedSearchCV:

I am searching for the best parameter for a random forest. When I determine the accuracy with the resulting best estimator I get different results compared to training a new random forest with the best parameters from the randomized search. Why is that?

Here is a code example for the RandomizedSearch (with just very few iterations):

n_estimators = np.linspace(start=100,stop=2500,num=11, dtype=int)
max_features = ['sqrt',None , 0.2, 0.4]
max_depth = [10, 20, 50, 75, 100, 125, 150]
min_samples_split = [2, 5, 8, 11]
min_samples_leaf = [1, 2, 4]
bootstrap = [True, False]
criterion=['gini', 'entropy']
random_grid = {'n_estimators': n_estimators,
               'max_features': max_features,
               'max_depth': max_depth,
               'min_samples_split': min_samples_split,
               'min_samples_leaf': min_samples_leaf,
               'bootstrap': bootstrap,
               'criterion': criterion}
rf_base=RandomForestClassifier()
rf_random=RandomizedSearchCV(estimator = rf_base, param_distributions = random_grid,
                             n_iter = 10, cv = 5, verbose=2, random_state=42, n_jobs = -1)

training_features, training_labels = gd.get_data(1)
test_features, test_labels = gd.get_data(2)

rf_random.fit(training_features, training_labels)
print("The best estimator: ", rf_random.best_estimator_)
print("The best score: ", rf_random.best_score_)

print ('Training Accuracy: ', str(rf_random.score(training_features, training_labels)))
print ('Test Accuracy: ', str(rf_random.score(test_features,test_labels)))

This returns for exapmle the results n_estimators=1780, min_samples_split=11, min_samples_leaf=2, max_features=0.2, max_depth=20, criterion='entropy',bootstrap='False' and a test accuracy of 0.8417

But when I train a new model with these parameters I get for example a test accuracy of 0,8339. The code would look like this:

training_features, training_labels = gd.get_data(1)
test_features, test_labels = gd.get_data(2)

rf=RandomForestClassifier()
rf.set_params(n_estimators=1780, min_samples_split=11, min_samples_leaf=2,
                            max_features=0.2, max_depth=20, criterion='entropy',bootstrap='False')

rf.fit(training_features, training_labels)
print('Training Accuracy: ', str(rf.score(training_features, training_labels)))
print ('Test Accuracy: ', str(rf.score(test_features,test_labels)))

1

There are 1 best solutions below

0
On

The solution is to set in both cases the random_state to the same value (it was missing for the new estimator).