Why does ranger predict give different numbers when re-applied to training data?

Question

Why does ranger predict give different numbers when re-applied to training data?

196 Views Asked by Jhonny At 19 February 2023 at 18:20

I am very new to machine learning. I am trying to explore fitting random forests with the ranger library in R. My dependent variable is continuous - so it would be a regression tree (and not just classification). Upon trying out the functions, I have noticed that there seems to be a discrepancy between ranger and predict ranger. The following lines result in different predictions in results and results_alternative:

rf_reg <- ranger(formula = y ~ ., data = training_df)

results <- rf_reg$predictions
results_alterantive <- predict(rf_reg, data = training_df)$predictions

Could anybody please explain why there is a discrepancy and what is causing it? Which one is correct? I have tried it with classification on iris data and that seemed to give the same results. Many thanks!

Original Q&A

There are 1 best solutions below

**dww** · Answer 1 · 2024-02-28T16:52:10.150000

The predictions using rf_reg$predictions are based only on the out-of-bag samples (as stated in the "value" section of ?ranger). On the other hand, predict.ranger is based on all samples.

To demonstrate this, first lets train a RF model (using mtcars as some example data). We use the keep.inbag = TRUE argument, so that we will know which samples were in-bag versus out-of-bag for each tree.

rf = ranger(formula = mpg ~ ., data = mtcars, keep.inbag = TRUE)

Now we generate the predictions using three methods. The first two are the same as in the question. We also add a third predict method where we specify predict.all = TRUE, which will give us separate predictions for all trees. That will allow us to take averages of the individual tree predictions according to whether the observation was in or out of bag.

results.rf   = rf$predictions   # based on out-of-bag samples
results.pred = predict(rf, data = mtcars)$predictions #  based on all samples
results.all  = predict(rf, data = mtcars, predict.all = TRUE)$predictions # has all trees and all samples

Now we can check

whether results.pred is identical to the average across all trees from results.all; and
whether results.rf is identical to the average across all trees from only the out-of-bag samples.

I demonstrate here for the first sample (i.e. the first row in mtcars).

inbag.counts = sapply(rf$inbag.counts, \(x) x[1])
oob = (inbag.counts == 0) # logical vector of which trees the sample is out-of-bag. 

all.equal(mean(results.all[1,    ]),  results.pred[1])
# [1] TRUE
all.equal(mean(results.all[1, oob]),  results.rf  [1])
# [1] TRUE

We can do the same check for all rows too:

for (i in 1:nrow(mtcars)) {
  inbag.counts = sapply(rf$inbag.counts, \(x) x[i])
  oob = (inbag.counts == 0) 
  print(all.equal(mean(results.all[i,    ]),  results.pred[i]))
  print(all.equal(mean(results.all[i, oob]),  results.rf  [i]))
}

Why does ranger predict give different numbers when re-applied to training data?

There are 1 best solutions below

Related Questions in R

Related Questions in RANDOM-FOREST

Related Questions in R-RANGER

Trending Questions

Popular # Hahtags

Popular Questions