I'm working on psychological assessment data that was collected in a non-random sample design. My dataframe is formed of "sex" (male and female) and "level of education"("elementary", "high school", "college"). However, my empirical distribution is different from the true distribution.

I know that true parameter for sex is 0.7 female and 0.3 male. I know as well that the true parameter for schooling is elementary equals 0.5, high school = 0.3, and college equal 0.2

I would like to have a code that could "cut" (adjust?) my dataframe to match these characteristics. I know my final dataframe will have fewer participants than my current one. I'm wondering if a for / loop solution is duable in this case.

Dat:

df2 = data.frame(
  sex = rep(c("m","f"),135),
  schooling = c("elementary","highschool","college")
)

prop.table(table(df2$sex))
prop.table(table(df2$schooling))
1

There are 1 best solutions below

0
zephryl On BEST ANSWER

You could weight your observations by your desired proportions, then use dplyr::slice_sample():

set.seed(13)
library(dplyr)

prop_sex <- c(f = 0.7, m = 0.3)
prop_school <- c(elementary = 0.5, highschool = 0.3, college = 0.2)

df3 <- df2 %>%
  mutate(wt = prop_sex[sex] * prop_school[schooling]) %>%
  slice_sample(prop = 0.6, weight_by = wt)

prop.table(table(df3$sex))
#        f        m 
# 0.617284 0.382716 

prop.table(table(df3$schooling))
#   college elementary highschool 
# 0.2716049  0.4259259  0.3024691 

Depending on the level of precision you want, you could iterate until the proportions fall within a certain tolerance of your targets.

df2 <- mutate(df2, wt = prop_sex[sex] * prop_school[schooling])

max_iter <- 1000
tol <- 0.05
for (i in seq(max_iter)) {
  df3 <- slice_sample(df2, prop = 0.6, weight_by = wt)
  obs_sex <- prop.table(table(df3$sex))
  obs_school <- prop.table(table(df3$schooling))
  if (
    max(abs(obs_sex - prop_sex[names(obs_sex)])) < tol &&
    max(abs(obs_school - prop_school[names(obs_school)])) < tol
  ) break
  if (i == max_iter) warning("max iterations reached")
}

prop.table(table(df3$sex))
#         f         m 
# 0.6728395 0.3271605 

prop.table(table(df3$schooling))
#   college elementary highschool 
# 0.2469136  0.4567901  0.2962963 

You may also want to look into the survey or srvyr packages for working with weighted data.