I've been using caret::createDataPartition()
in order to split the data in a stratified way. Now I'm trying another approach that I found here in stack, which is splitstackshape::stratified()
, and the reason I'm intrested in this is that it allows to stratifiy based on features that I choose manually, very handy.
I have problem with splitting the data:
library(splitstackshape)
set.seed(40)
Train = stratified(Data, c('age','gender','treatment_1','treatment_2','cancers'), 0.75)
This produces the train set, but how do I get the test set? I didn't get it.
I tired the createDataPartition
way:
INDEX = stratified(Data, c('age','gender','treatment_1','treatment_2','cancers'), 0.75)
Train = Data[INDEX , ]
Test = Data[-INDEX ,]
But that doesn't work because stratified
creates an actual train data, not an index.
So how do I get the test data using this function? thanks!
If you add a unique sequential row identifier to the data, you can use it to extract the rows that were not selected for the training data frame as follows. We'll use
mtcars
for a reproducible example....and the output:
Next level of detail...
The
stratified()
function extracts a set of rows based on the by groups passed to the function. By adding arowId
field we can track the observations that are included in the training data.We then use the extract operator to create the test data frame via the ! operator:
Finally we count the number of rows to be included in the test data frame, given the selection criteria, which should equal 32 - 19 or 13:
Comparison to bothSets argument
Another answer noted that the
stratified()
function includes an argument,bothSets
, that generates a list with both the sampled data and the remaining data. We can demonstrate equivalence of the two approaches as follows....and the output:
A Final Comment
It's important to note that
caret::createDataPartition()
is typically used to split the data according to values of the dependent variable so thetraining
andtest
partitions have relatively equal representation across values of the dependent variable.In contrast,
stratified()
partitions according to combinations of one or more features, i.e. the independent variables. Partitioning based on independent variables has the potential to introduce variability in the distributions of values of the dependent variable across the training and test partitions. That is, the distribution of dependent variable values in the training partition may be significantly different from the dependent variable distribution in the test partition.