I am very new to machine learning. I am trying to explore fitting random forests with the ranger library in R. My dependent variable is continuous - so it would be a regression tree (and not just classification). Upon trying out the functions, I have noticed that there seems to be a discrepancy between ranger and predict ranger. The following lines result in different predictions in results and results_alternative:
rf_reg <- ranger(formula = y ~ ., data = training_df)
results <- rf_reg$predictions
results_alterantive <- predict(rf_reg, data = training_df)$predictions
Could anybody please explain why there is a discrepancy and what is causing it? Which one is correct? I have tried it with classification on iris data and that seemed to give the same results. Many thanks!
The predictions using
rf_reg$predictionsare based only on the out-of-bag samples (as stated in the "value" section of?ranger). On the other hand,predict.rangeris based on all samples.To demonstrate this, first lets train a RF model (using mtcars as some example data). We use the
keep.inbag = TRUEargument, so that we will know which samples were in-bag versus out-of-bag for each tree.Now we generate the predictions using three methods. The first two are the same as in the question. We also add a third predict method where we specify
predict.all = TRUE, which will give us separate predictions for all trees. That will allow us to take averages of the individual tree predictions according to whether the observation was in or out of bag.Now we can check
whether
results.predis identical to the average across all trees fromresults.all; andwhether
results.rfis identical to the average across all trees from only the out-of-bag samples.I demonstrate here for the first sample (i.e. the first row in mtcars).
We can do the same check for all rows too: