rpart models collapse to zero splits in caret

1.5k Views Asked by At

I am using rpart to run a regression tree analysis within the caret package using the oneSE option for the selection function. When I do, I often end up with a model with zero splits. It suggests that no model would be better than any model. Should this be happening?

Here's an example:

# set training controls
tc <- trainControl("repeatedcv", repeats=100, selectionFunction="oneSE", num=10)

# run the model
mod <- train(yvar ~ ., data=dat, method="rpart", trControl=tc)

# it runs.....
# look at the cptable of the final model
printcp(mod$finalModel)

Here's the model output:

> mod
No pre-processing
Resampling: Cross-Validation (10 fold, repeated 100 times) 

Summary of sample sizes: 81, 79, 80, 80, 80, 80, ... 

Resampling results across tuning parameters:

  cp      RMSE   Rsquared  RMSE SD  Rsquared SD
  0.0245  0.128  0.207     0.0559   0.23       
  0.0615  0.127  0.226     0.0553   0.241      
  0.224   0.123  0.193     0.0534   0.195      

RMSE was used to select the optimal model using  the one SE rule.
The final value used for the model was cp = 0.224. 

Here's the output of printcp:

Variables actually used in tree construction:

character(0)
Root node error: 1.4931/89 = 0.016777
n= 89 
CP nsplit rel error
1 0.22357      0         1

However, if I just run the model directly in rpart, I can see the larger, unpruned tree that was trimmed to the supposedly more parsimonious model above:

unpruned = rpart(yvar ~., data=dat)
printcp(unpruned)

Regression tree:
rpart(formula = yvar ~ ., data = dat)

Variables actually used in tree construction:
[1] c.n.ratio Fe.ppm    K.ppm     Mg.ppm    NO3.ppm  

Root node error: 1.4931/89 = 0.016777

n= 89 

    CP nsplit rel error xerror    xstd
1 0.223571      0   1.00000 1.0192 0.37045
2 0.061508      2   0.55286 1.1144 0.33607
3 0.024537      3   0.49135 1.1886 0.38081
4 0.010539      4   0.46681 1.1941 0.38055
5 0.010000      6   0.44574 1.2193 0.38000

Caret [I think] is trying to find the smallest tree whose RMSE is within 1 SD of the model with the lowest RMSE. This is similar to the 1-SE approach advocated in Venebles and Ripley. In this case, it seems to get stuck picking the model with no splits, even though it has no explanatory power.

Is this right? Is this OK? It seems there should be a rule to prevent selection of a model with no splits.

1

There are 1 best solutions below

1
On

Try eliminating selectionFunction="oneSE".

That should identify the depth with the smallest possible error. In doing so, there is some potential for "optimization bias" from picking the minimum observed RMSE, but I have found that to be small in practice.

Max