I have a fairly serious problem that I haven't been able to solve for many days! I cannot understand exactly how the traincontrol function of the caret package works in R. I need to cross validate (10-fold) a random forest and thought that the caret package could automatically pick only 1 test set (10% of my dataset) at a time (10 times, 10 different test set) and validate the random forest on it, training on the 90% of dataset. All the tutorials on the web enter the Trainset for the train function, and the testset for the predict function... but why?!
Specifically, I need to classify a binary category ~ respect to 5 mix variables (stepwise, starting from a pair of variables) and study specificity, sensitivity, accuracy and AUC. Then i need to choose best variables for model (Best AUC) and adding another variable at model, repeating the iter.
Can someone kindly explain to me how it works once and for all? And if there are examples of both manual loop of cross-validation repeated or automatic with caret!
I would prefer to learn a manual loop, so that I can use it for various analyzes. I would be very grateful to you.
Thanks a lot.
Here are some thoughts:
Feature Engineering
Instead of applying the model over each df feature, I suggest to apply a feature selection based on your data;
This table could be an option to select the features, If you wish to compare the response variable (binary category) against other categorical variables use the chi-squared test if instead these variables are continuous you could use ANOVA test instead
Of course you could use other non-parametric statistical test as Kruskal-Wallis, However I would need to see your df structure in order to give you a final test opinion.
Train Set vs. Validation Set
A: Train set is used to get the model to learn and by that to estimate his parameters, while Test set is used only to assess model performance; that's why train function and test function should be the same.
For more info you can dive into here:https://towardsdatascience.com/train-validation-and-test-sets-72cb40cba9e7
Model Building:
For the model building, you could use
h2o.ai
package Here's the documentation: https://docs.h2o.ai/h2o/latest-stable/h2o-r/docs/reference/h2o.randomForest.htmlThis function has some parameters that you could use to set your model: As well it already handles specificity, the task to report sensitivity, accuracy, recall, F1-score and AUC.