The experiment is the following:
- train a RFR with a 15k train rows
- get predictions on 8k test rows, save predictions as y_hat0
- remove 1 random row from the training set and retrain the RFR
- save prediction for the newly trained model as y_hat1
Comparing y_hat0 and y_hat1 shows large discrepancies in model output:
- median diff: 1.8%
- 90th diff : 4.3%
- 99th diff: 6.7%
This are the results obtained with n_estimators=350 and min_sample_split=55.
Going up to n_estimators=2000 and min_sample_split=200 leads to better results but at a huge computational cost (fitting time x 6):
- median diff: 0.5%
- 90th diff: 1.2%
- 99th diff: 1.9%
So I am wondering if just 1 row less out of 15k should have such a dramatic impact on model outputs? I thought RF were more robust than that to the slightest change in training data.
Any thoughts appreciated