Why is RandomForestClassifier of scikit-learn not deterministic in explicit setup?

2.8k Views Asked by At

I would like to know why RandomForestClassifiers I create in Python using scikit-learn produce different results when I repeat learning on the same data set. Can someone explain this to me please?

The relevant part of my code is this:

from sklearn.ensemble import RandomForestClassifier as RFC
RFC(n_estimators=100, max_features=None, criterion="entropy", bootstrap=False)

This setup should actually prevent any randomness, right? It should force the algorithm to use the same amount of data instances (which is the complete data set) for every base learner to grow and to consider every feature for every split.

One explaination I got is this: Maybe the algorithm keeps some randomness in how the features are taken out of the set of all features, e.g.:

  • Given are features f1 and f2
  • For a split in node 1 the algorithm chooses f1 and afterwards f2 to consider for this split
  • Both features might have the same split quality and f1 is picked first as it was drawn first
  • For a split in node 2 the algorithm might as well choose f2 first and f1 afterwards and therefore the model created can differ from other models created by that algorithm

Can anyone give reliable information?

1

There are 1 best solutions below

4
On

No, that set-up will not guarantee reproducible results. You've detailed exactly the case that happens with most RF implementations: the random splits depend on the seed to the random number function.

To control that, look in the documentation to find which random-number package your ML algorithm uses. You can likely import that package and force a value to the seed method. If you want to try for a quick solution, try

import random
random.seed(<value>)

... where is any hashable constant of your choice. I recommend an integer or string you like. I suggest this because I suspect that your RF package uses the Python random package -- giving this a try might save you the trouble of poring over documentation.

Most packages will use the system time as the default seed; that's to make sure you get varying results.