Random forest cross validated k fold with caret package R best auc

3.1k Views Asked by At

I have a fairly serious problem that I haven't been able to solve for many days! I cannot understand exactly how the traincontrol function of the caret package works in R. I need to cross validate (10-fold) a random forest and thought that the caret package could automatically pick only 1 test set (10% of my dataset) at a time (10 times, 10 different test set) and validate the random forest on it, training on the 90% of dataset. All the tutorials on the web enter the Trainset for the train function, and the testset for the predict function... but why?!

Specifically, I need to classify a binary category ~ respect to 5 mix variables (stepwise, starting from a pair of variables) and study specificity, sensitivity, accuracy and AUC. Then i need to choose best variables for model (Best AUC) and adding another variable at model, repeating the iter.

Can someone kindly explain to me how it works once and for all? And if there are examples of both manual loop of cross-validation repeated or automatic with caret!

I would prefer to learn a manual loop, so that I can use it for various analyzes. I would be very grateful to you.

Thanks a lot.

2

There are 2 best solutions below

3
On

Here are some thoughts:

Feature Engineering

Instead of applying the model over each df feature, I suggest to apply a feature selection based on your data;

This table could be an option to select the features, If you wish to compare the response variable (binary category) against other categorical variables use the chi-squared test if instead these variables are continuous you could use ANOVA test instead

###############################################################
#                Categorical           Continuous             #
#  Categorical   Chi-Squared Test         ANOVA               # 
#  Continuous    Chi-Squared Test         Pearson Correlation #
###############################################################

Of course you could use other non-parametric statistical test as Kruskal-Wallis, However I would need to see your df structure in order to give you a final test opinion.

Train Set vs. Validation Set

A: Train set is used to get the model to learn and by that to estimate his parameters, while Test set is used only to assess model performance; that's why train function and test function should be the same.

For more info you can dive into here:https://towardsdatascience.com/train-validation-and-test-sets-72cb40cba9e7

Model Building:

For the model building, you could use h2o.ai package Here's the documentation: https://docs.h2o.ai/h2o/latest-stable/h2o-r/docs/reference/h2o.randomForest.html

This function has some parameters that you could use to set your model: As well it already handles specificity, the task to report sensitivity, accuracy, recall, F1-score and AUC.

stopping_metric = c("AUTO", "deviance", "logloss", "MSE", "RMSE", "MAE", "RMSLE",
    "AUC", "AUCPR", "lift_top_group", "misclassification", "mean_per_class_error",
    "custom", "custom_increasing")

# This parameter let you choose "AUC" as the model metric to get optimized. 
 nfolds = 0 
 fold_assignment = c("AUTO", "Random", "Modulo", "Stratified")

# This parameter let you choose the number of folds as the resampling method to be perform 
# in order to validate your model, with no need of iteration.

keep_cross_validation_models= TRUE
keep_cross_validation_predictions = TRUE
keep_cross_validation_fold_assignment = TRUE

# This parameter let you see individual accuracy by fold. 
5
On

We can break this down in parts:

the caret package could automatically pick only 1 test set (10% of my dataset) at a time (10 times, 10 different test set) and validate the random forest on it, training on the 90% of dataset.

Yes, this will be done on the training set you have provided. For example, like in most tutorials you have seen, we can set the trainControl to return all the results using returnResamp="all" and this will be clear:

set.seed(111)
dat = iris
dat$Species = factor(ifelse(dat$Species == "versicolor","v","o"))
ix = sample(nrow(dat),100)
trainset = dat[ix,]
testset = dat[-ix,]

trcontrol = trainControl(method='cv', number=10, savePredictions = T,
classProbs = TRUE,summaryFunction = twoClassSummary,returnResamp="all")

model = train(Species ~ . , data=trainset, method = "rf", trControl = trcontrol,metric="ROC") 

model$resample
    
              ROC      Sens Spec mtry Resample
1  1.0000000 0.8333333 1.00    2   Fold01
2  1.0000000 0.8333333 1.00    3   Fold01
3  1.0000000 0.8333333 1.00    4   Fold01
4  1.0000000 1.0000000 0.75    2   Fold02
5  1.0000000 1.0000000 0.75    3   Fold02
6  1.0000000 1.0000000 0.75    4   Fold02

You can see now, for every test fold, the ROC (also the AUC under ROC curve) is being calculated. In total, for each parameter of mtry, it is evaluated on 10 folds (with the others used as training). And then you take the average and select model with the best AUC:

Random Forest 

100 samples
  4 predictor
  2 classes: 'o', 'v' 

No pre-processing
Resampling: Cross-Validated (10 fold) 
Summary of sample sizes: 90, 90, 89, 90, 90, 90, ... 
Resampling results across tuning parameters:

  mtry  ROC        Sens       Spec
  2     1.0000000  0.9523810  0.95
  3     1.0000000  0.9523810  0.95
  4     0.9958333  0.9357143  0.95

ROC was used to select the optimal model using the largest value.
The final value used for the model was mtry = 2.

Now the test fold (within the training) might be small, and ideally once we have a rough idea of the hyper parameter to use, we want to check the predictability of our model on a larger unseen set.

Hence we retain another of the data to check this. And this is normally call test or validation, like what is done at the start of this example, we can check this:

confusionMatrix(predict(model,testset),testset$Species)
Confusion Matrix and Statistics

          Reference
Prediction  o  v
         o 37  0
         v  2 11

So in this case, we see it can predict unseen data pretty well. It really depends on your data and how likely your model would overfit.