I have a data set with 10,000 observations. My target variable has two classes - "Y" and "N. Below is the distribution of "Y" and "N":
> table(data$Target_Var)
Y N
2000 8000
Now I want to create a balanced training data set such that 50% (1000) of the "Y" is in training. As the training data set is supposed to be balanced, it will have another 1000 rows with "N". Total number of observations = 2000.
table(Training$Target_Var)
Y N
1000 1000
The test data set will be unbalanced but with same ratio of "Y" and "N" as in the population i.e., the test will have 5000 rows of observation with 1000 "Y" and 4000 rows of "N".
table(Test$Target_Var)
Y N
1000 4000
Now, I can write a function to do it, but is there any inbuilt R function which can do this? I explored sampling functions of caret and sampling packages, but could not find any function which can create a BALANCED training data set. SMOTE does this but by creating a new observation.
I was able to do it in two steps. Suppose I have following data set:
Step1: Create test data set with 50% of the 'Y' (i.e 1000 rows) and 4000 rows of 'N'. This has the same distribution of 'Y' and 'N' as in the population.
Step2: Create balanced training data set form the remaining data (1000 rows of 'Y' and 1000 rows of 'N')