How to create a balanced training and an unbalanced test data set in R?

3.5k Views Asked by At

I have a data set with 10,000 observations. My target variable has two classes - "Y" and "N. Below is the distribution of "Y" and "N":

> table(data$Target_Var)
Y    N 
2000 8000 

Now I want to create a balanced training data set such that 50% (1000) of the "Y" is in training. As the training data set is supposed to be balanced, it will have another 1000 rows with "N". Total number of observations = 2000.

table(Training$Target_Var)
Y    N 
1000 1000

The test data set will be unbalanced but with same ratio of "Y" and "N" as in the population i.e., the test will have 5000 rows of observation with 1000 "Y" and 4000 rows of "N".

table(Test$Target_Var)
Y    N 
1000 4000 

Now, I can write a function to do it, but is there any inbuilt R function which can do this? I explored sampling functions of caret and sampling packages, but could not find any function which can create a BALANCED training data set. SMOTE does this but by creating a new observation.

1

There are 1 best solutions below

0
On BEST ANSWER

I was able to do it in two steps. Suppose I have following data set:

data<- data.frame(Target_Var = rep("A", 2000), Pop = rep(1:100,20))
data<- rbind(data, data.frame(Target_Var = rep("B", 8000), Pop = rep(1:100,80)))

> table(data$Target_Var)
Y    N 
2000 8000 

Step1: Create test data set with 50% of the 'Y' (i.e 1000 rows) and 4000 rows of 'N'. This has the same distribution of 'Y' and 'N' as in the population.

test_index <- createDataPartition(data$Target_Var, p = .5, list = F)
Test<- data[test_index,]

table(Test$Target_Var)
A    B 
1000 4000 

Step2: Create balanced training data set form the remaining data (1000 rows of 'Y' and 1000 rows of 'N')

Training<- data[-test_index,]
Training<- strata(Training, stratanames = "Target_Var", size = c(1000,1000))

table(Training$Target_Var)
A    B 
1000 1000