stratified sampling with fixed proportions of observation types in R

714 Views Asked by At

I have a sample where 50% of the observations are White and 50% African-American.

I would like to obtain a random subsample where such proportion is modified to 80% White and 20% African-American.

I have tried the command stratified but I could not find an option allowing me to allocate shares to the stratifying criterion.

Thank you in advance for your help!

2

There are 2 best solutions below

0
On

Well I'd filter the data for White and African-American and then select from each subset.

## 80% of the white sample
  smp_size <- floor(train_ratio * nrow(df_white))

  ## set the seed to make your partition reproductible
  set.seed(42)
  data_ind_w <- sample(seq_len(nrow(df_white)), size = smp_size)

and for the African-American

## 20% of the african sample
  smp_size <- floor(train_ratio * nrow(df_african))

  ## set the seed to make your partition reproductible
  set.seed(42)
  data_ind_a <- sample(seq_len(nrow(df_african)), size = smp_size)

thats the new data

  new_data <- c(white[data_ind_w,],african[data_ind_a,])
0
On

If your original dataset had 100 rows (50 white and 50 African-American) then 80% would be 40 samples, and 20% would be 10 samples. Knowing these values, you can try: stratified(mydf, "group", size = c("White" = 40, "African-American" = 10)).

Example:

mydf <- data.frame(group = rep(c("White", "African-American"), each = 50), value = 1:100)
mydf
library(splitstackshape)
set.seed(1)
x <- stratified(mydf, "group", size = c("White" = 40, "African-American" = 10))
summary(x)
 #              group        value      
 # African-American:10   Min.   : 1.00  
 # White           :40   1st Qu.:15.25  
 #                       Median :31.00  
 #                       Mean   :34.88  
 #                       3rd Qu.:47.50  
 #                       Max.   :93.00