For each observation in a data frame that trains a random forest model, there is a set of trees (of size ~1/3 of the total number of forest trees) for which that observation was not in-bag. I would like to get a measure of spread of such out-of-bag, tree-level predictions at each observation, ideally by retrieving a prediction from each tree.
Is there a way to do this for random forest models fit using the ranger package in R?
library(ranger)
data("iris")
iris_train <- sample(1:nrow(iris), size=floor(nrow(iris)*0.8))
new_data <- setdiff(1:nrow(iris), iris_train)
rf <- ranger::ranger(formula=Sepal.Length ~ Sepal.Width + Petal.Length + Petal.Width + Species,
data=iris[iris_train,])
# OOB predictions (average only):
rf$predictions
Note that for new data, it is possible to get tree-level predictions from a random forest model using predict.ranger(..., predict.all=TRUE). I do not see such an option for returning in-sample but out-of-bag tree-level predictions.
# New data predictions (all trees):
p <- predict(rf, iris[new_data,], predict.all = TRUE)
The way to do this is to make sure to set
keep.inbag=TRUEwhen running the random forest.The
inbag.countsgives us, for each tree, a vector of how many times each observation was used in the tree. We can use this to "mask" predictions back to the whole data set.Check that the average across rows gets the same result as
predictions