How to plot OOB error vs. Number of trees using ranger?

1.3k Views Asked by At

I want to obtain the optimal number of trees for random forest by plotting the OOB error vs. Number of trees and see at which point the error plateaus. However, as my problem involves text mining, my training data is of sparse matrix type i.e., in dgCMatrix. This means that I cannot use the randomForest package to train my model as randomForest does not support sparse matrix. Instead, I have to use the ranger package but ranger does not give the OOB error vs. Number of trees. I have tried converting my sparse matrix to a dataframe of dimension 90,000 by 5,500 to run in randomForest but it is taking a very long time even with parallel execution and I do not have that computing capacity.

So my questions are:

  1. How can I plot the OOB error vs. Number of trees using ranger?

  2. What are other methods to convert a sparse matrix into a dataframe? So far I have tried

    train_matrix <- as.data.frame(as.matrix(train_dtm))

  3. What are some ways to reduce the runtime of randomForest using the converted dataframe?

  4. Are there other ways to determine the optimal number of trees without plotting the OOB error vs. Number of trees if the above fails?

Would appreciate any help if possible. Thanks!

1

There are 1 best solutions below

0
On

I ran into a similar issue and I ended up doing the poor's man approach (this only answers your first question):

library(ranger)

# sample data
# install.packages("AmesHousing")
d <- AmesHousing::make_ames()

nt <- seq(1, 501, 10)

oob_mse <- vector("numeric", length(nt))

for(i in 1:length(nt)){
  rf <- ranger(Sale_Price ~ ., d, num.trees = nt[i], write.forest = FALSE)
  oob_mse[i] <- rf$prediction.error
}


plot(x = nt, y = oob_mse, col = "red", type = "l")

I don't know if there is an "optimal" number of trees, but building more trees than needed can slow down your predictions considerably, particularly when doing partial dependence plots. That's the only reason I did this.