Access which number of trees had the lowest error when running random forest

41 Views Asked by At

I am following this example and I want to change one part of the code from:

# default RF model
m1 <- randomForest(
  formula = Sale_Price ~ .,
  data    = ames_train
)

# number of trees with lowest MSE
btree <- which.min(m1$mse)

to it's equivalent ranger-based code. The issue is that ranger doesn't provide access directly to number of trees with the lowest MSE. How can I calculate the and store in a variable (I call this var btree) the number of trees with the lowest MSE?

library(rsample)      # data splitting 
library(randomForest) # basic implementation
library(ranger)       # a faster implementation of randomForest

set.seed(123)
ames_split <- initial_split(AmesHousing::make_ames(), prop = .7)
ames_train <- training(ames_split)
ames_test  <- testing(ames_split)

# for reproduciblity
set.seed(123)

# default RF model
m1 <- randomForest(
  formula = Sale_Price ~ .,
  data    = ames_train
)

# the equivalent in ranger
m1 <- ranger(
      formula = Sale_Price ~ .,
      data    = ames_train
    )

# number of trees with lowest MSE (randomForest package)
btree <- which.min(m1$mse)

Based on the ranger documentation:

prediction.error: Overall out-of-bag prediction error. For classification this is accuracy (proportion of misclassified observations), for probability estimation the Brier score, for regression the mean squared error and for survival one minus Harrell's C-index.

So if I do:

  m1 <- ranger(
    formula = Sale_Price ~ .,
    data    = ames_train
  )
  
  # number of trees with highest r2
  btree = which.max(m1$prediction.error)
  print(btree)

The result is:

[1] 1

which obviously is not right.

1

There are 1 best solutions below

0
rw2 On BEST ANSWER

I don't think there is a way to get this directly from the ranger outputs. But you could run predictions for each tree and calculate it yourself. For example:

m1 <- ranger(
  formula = Sale_Price ~ .,
  data    = ames_train,
  keep.inbag = TRUE, 
  write.forest = TRUE 
)

num_trees <- m1$num.trees
predictions <- matrix(nrow = num_trees, ncol = nrow(ames_train))
mse <- numeric(num_trees)

for(i in 1:num_trees){
  pred <- predict(m1, 
                  data = ames_train, 
                  num.trees = i)$predictions
  mse[i] <- mean((pred - ames_train$Sale_Price)^2)
}

btree <- which.min(mse)