When I use ranger for a classification model and treeInfo()
to extract a tree, I see that sometimes a split results in two identical terminal nodes.
Is this expected behaviour? Why does it make sense to introduce a split where the final nodes are the same?
From this question, I take that the prediction
variable could be the majority class (albeit for python and another random forest implementation). The ranger ?treeInfo
documentation says it should be the predicted class.
MWE
library(ranger)
data <- iris
data$is_versicolor <- factor(data$Species == "versicolor")
data$Species <- NULL
rf <- ranger(is_versicolor ~ ., data = data,
num.trees = 1, # no need for many trees in this example
max.depth = 3, # keep depth at an understandable level
seed = 1351, replace = FALSE)
treeInfo(rf, 1)
#> nodeID leftChild rightChild splitvarID splitvarName splitval terminal prediction
#> 1 0 1 2 2 Petal.Length 2.60 FALSE <NA>
#> 2 1 NA NA NA <NA> NA TRUE FALSE
#> 3 2 3 4 3 Petal.Width 1.75 FALSE <NA>
#> 4 3 5 6 2 Petal.Length 4.95 FALSE <NA>
#> 5 4 7 8 0 Sepal.Length 5.95 FALSE <NA>
#> 6 5 NA NA NA <NA> NA TRUE TRUE
#> 7 6 NA NA NA <NA> NA TRUE TRUE
#> 8 7 NA NA NA <NA> NA TRUE FALSE
#> 9 8 NA NA NA <NA> NA TRUE FALSE
In this example, the last four rows (final nodes with nodeID 5 and 6, as well as 7 and 8) have the prediction TRUE
and FALSE
.
Graphically this would look like this
I think I found a (partial) answer to the issue, namely the
mtry
andmin.node.size
arguments and their functionality.As the random forest chooses only
mtry
variables at each split, the final split might take only variables into account, which do not split the data in a way that results in a maximum gini difference (or whatever metric was chosen) but still in each final node, a given class might prevail.Playing around with
mtry
andmin.node.size
can change this. But we still might get splits with the same results.