I am using rpart to run a regression tree analysis within the caret package using the oneSE option for the selection function. When I do, I often end up with a model with zero splits. It suggests that no model would be better than any model. Should this be happening?
Here's an example:
# set training controls
tc <- trainControl("repeatedcv", repeats=100, selectionFunction="oneSE", num=10)
# run the model
mod <- train(yvar ~ ., data=dat, method="rpart", trControl=tc)
# it runs.....
# look at the cptable of the final model
printcp(mod$finalModel)
Here's the model output:
> mod
No pre-processing
Resampling: Cross-Validation (10 fold, repeated 100 times)
Summary of sample sizes: 81, 79, 80, 80, 80, 80, ...
Resampling results across tuning parameters:
cp RMSE Rsquared RMSE SD Rsquared SD
0.0245 0.128 0.207 0.0559 0.23
0.0615 0.127 0.226 0.0553 0.241
0.224 0.123 0.193 0.0534 0.195
RMSE was used to select the optimal model using the one SE rule.
The final value used for the model was cp = 0.224.
Here's the output of printcp:
Variables actually used in tree construction:
character(0)
Root node error: 1.4931/89 = 0.016777
n= 89
CP nsplit rel error
1 0.22357 0 1
However, if I just run the model directly in rpart, I can see the larger, unpruned tree that was trimmed to the supposedly more parsimonious model above:
unpruned = rpart(yvar ~., data=dat)
printcp(unpruned)
Regression tree:
rpart(formula = yvar ~ ., data = dat)
Variables actually used in tree construction:
[1] c.n.ratio Fe.ppm K.ppm Mg.ppm NO3.ppm
Root node error: 1.4931/89 = 0.016777
n= 89
CP nsplit rel error xerror xstd
1 0.223571 0 1.00000 1.0192 0.37045
2 0.061508 2 0.55286 1.1144 0.33607
3 0.024537 3 0.49135 1.1886 0.38081
4 0.010539 4 0.46681 1.1941 0.38055
5 0.010000 6 0.44574 1.2193 0.38000
Caret [I think] is trying to find the smallest tree whose RMSE is within 1 SD of the model with the lowest RMSE. This is similar to the 1-SE approach advocated in Venebles and Ripley. In this case, it seems to get stuck picking the model with no splits, even though it has no explanatory power.
Is this right? Is this OK? It seems there should be a rule to prevent selection of a model with no splits.
Try eliminating
selectionFunction="oneSE"
.That should identify the depth with the smallest possible error. In doing so, there is some potential for "optimization bias" from picking the minimum observed RMSE, but I have found that to be small in practice.
Max