How to propotionally split data using initial_split r

630 Views Asked by At

I would like to proportionally split the data I have. For example, I have 100 rows and I want to randomly sample 1 row every two rows. Using tidymodels rsample I assumed I would do the below.

dat <- as_tibble(seq(1:100))

split <- inital_split(dat, prop = 0.5, breaks = 50)

testing <- testing(split)

When checking the data the split hasnt done what I thought it would. It seems close but not exactly. I thought the breaks call generates bins which are sampled from. So, breaks = 50 would split the the 100 rows into 50 bins, therefore having two rows per bin. I have also tried strata = value to strafy accross the rows but I cannot get this to work either.

I am using this as an exaple but I am also curious how this would work when sampling 1 row every four etc.

Have I miss understood the breaks call function?

1

There are 1 best solutions below

1
On BEST ANSWER

There is an argument that protects users from trying to create stratified splits that are too small that you are running up against; it's called pool:

library(rsample)
library(dplyr)
#> 
#> Attaching package: 'dplyr'
#> The following objects are masked from 'package:stats':
#> 
#>     filter, lag
#> The following objects are masked from 'package:base':
#> 
#>     intersect, setdiff, setequal, union

dat <- tibble(value = seq(1:100), strat = as.factor(rep(1:50, each = 2))) 
dat
#> # A tibble: 100 × 2
#>    value strat
#>    <int> <fct>
#>  1     1 1    
#>  2     2 1    
#>  3     3 2    
#>  4     4 2    
#>  5     5 3    
#>  6     6 3    
#>  7     7 4    
#>  8     8 4    
#>  9     9 5    
#> 10    10 5    
#> # … with 90 more rows

split <- initial_split(dat, prop = 0.5, strata = strat, pool = 0.0)
#> Warning: Stratifying groups that make up 0% of the data may be statistically risky.
#> • Consider increasing `pool` to at least 0.1
split
#> <Analysis/Assess/Total>
#> <50/50/100>

training(split) %>% arrange(strat)
#> # A tibble: 50 × 2
#>    value strat
#>    <int> <fct>
#>  1     1 1    
#>  2     4 2    
#>  3     5 3    
#>  4     8 4    
#>  5    10 5    
#>  6    12 6    
#>  7    13 7    
#>  8    16 8    
#>  9    17 9    
#> 10    20 10   
#> # … with 40 more rows
testing(split) %>% arrange(strat)
#> # A tibble: 50 × 2
#>    value strat
#>    <int> <fct>
#>  1     2 1    
#>  2     3 2    
#>  3     6 3    
#>  4     7 4    
#>  5     9 5    
#>  6    11 6    
#>  7    14 7    
#>  8    15 8    
#>  9    18 9    
#> 10    19 10   
#> # … with 40 more rows

Created on 2022-02-22 by the reprex package (v2.0.1)

We really don't recommend turning pool down to zero like this, but you can do it here to see how the strata and prop arguments work.