I would like to proportionally split the data I have. For example, I have 100 rows and I want to randomly sample 1 row every two rows. Using tidymodels rsample I assumed I would do the below.
dat <- as_tibble(seq(1:100))
split <- inital_split(dat, prop = 0.5, breaks = 50)
testing <- testing(split)
When checking the data the split hasnt done what I thought it would. It seems close but not exactly. I thought the breaks call generates bins which are sampled from. So, breaks = 50
would split the the 100 rows into 50 bins, therefore having two rows per bin. I have also tried strata = value
to strafy accross the rows but I cannot get this to work either.
I am using this as an exaple but I am also curious how this would work when sampling 1 row every four etc.
Have I miss understood the breaks call function?
There is an argument that protects users from trying to create stratified splits that are too small that you are running up against; it's called
pool
:Created on 2022-02-22 by the reprex package (v2.0.1)
We really don't recommend turning
pool
down to zero like this, but you can do it here to see how thestrata
andprop
arguments work.