Huge variance for RandomForestRegressor models

31 Views Asked by At

The experiment is the following:

  1. train a RFR with a 15k train rows
  2. get predictions on 8k test rows, save predictions as y_hat0
  3. remove 1 random row from the training set and retrain the RFR
  4. save prediction for the newly trained model as y_hat1

Comparing y_hat0 and y_hat1 shows large discrepancies in model output:

  • median diff: 1.8%
  • 90th diff : 4.3%
  • 99th diff: 6.7%

This are the results obtained with n_estimators=350 and min_sample_split=55.
Going up to n_estimators=2000 and min_sample_split=200 leads to better results but at a huge computational cost (fitting time x 6):

  • median diff: 0.5%
  • 90th diff: 1.2%
  • 99th diff: 1.9%

So I am wondering if just 1 row less out of 15k should have such a dramatic impact on model outputs? I thought RF were more robust than that to the slightest change in training data.

Any thoughts appreciated

0

There are 0 best solutions below