Splitting the group which contains the bottom 25th percentile of volume into 2 rows

55 Views Asked by At
library(tidytable)

my_dt <- tidytable(
     price = c(4.5, 4.6, 4.7, 4.8, 5.1),
     volume = c(0.4,1.2,0.3,8.7,6.5)
)

total = my_dt %>% summarize(total = sum(volume)) %>% pull(total)

trim = total*0.25

What I would like to do is find the price value where 25% of the total volume occurs. And then splitting that row into a 'trimmed' portion and a non trimmed portion. So in this example I would end up with something that looks like this:

post_trim <- tidytable(
         price = c(4.5, 4.6, 4.7, 4.8, 4.8, 5.1),
         volume = c(0.4, 1.2, 0.3, 2.375, 6.325, 6.5),
         trim = c(True, True, True, True, False, False)
    )

The way I had initially thought of how to do this would be to iterate through the rows of the dataframe, where the rows would be returned as lists and tracking the remainder after subtracting the quantity until I find a row where the remainder is less then the volume:

remainder = trim
pivot = 0

for (row in my_dt){
    if (remainder <= row[2]){
        pivot = row[1]
        break
    } else { 
        remainder = remainder - row[2]
    }

And then from here I just delete the row entry for where price == pivot and re-enter two rows one where volume is the remainder and the other where the volume is original volume - remainder.

I know loops over rows are generally not a good idea though so I'm wondering how I would accomplish this without doing that?

2

There are 2 best solutions below

0
zephryl On BEST ANSWER

This kind of problem is a bit easier to work with in base R:

trim <- sum(my_dt$volume) * 0.25
csum <- cumsum(my_dt$volume)
# identify 25th %ile row
trow <- min(which(csum >= trim))

# duplicate 25th %ile row
my_dt <- my_dt[c(1:trow, trow:nrow(my_dt)), ]
# trim value, with safeguard if first row
my_dt$volume[[trow]] <- if (trow == 1) trim else trim - csum[[trow - 1]]
# remainder in next row
my_dt$volume[[trow + 1]] <- csum[[trow]] - trim
# add logical flag
my_dt$trim <- seq(nrow(my_dt)) <= trow

Result:

# A tibble: 6 × 3
  price volume trim 
  <dbl>  <dbl> <lgl>
1   4.5   0.4  TRUE 
2   4.6   1.2  TRUE 
3   4.7   0.3  TRUE 
4   4.8   2.37 TRUE 
5   4.8   6.32 FALSE
6   5.1   6.5  FALSE
0
Adriano Mello On

tidyverse solution is really easy (or am I missing something)?

trim <- 0.25 * sum(my_dt$volume, na.rm = TRUE)
post_trim <- mutate(my_dt, trim = if_else(cumsum(volume) < trim, TRUE, FALSE))

> post_trim
# A tibble: 5 × 3
  price volume trim 
  <dbl>  <dbl> <lgl>
1   4.5    0.4 TRUE 
2   4.6    1.2 TRUE 
3   4.7    0.3 TRUE 
4   4.8    8.7 FALSE
5   5.1    6.5 FALSE

PS. OP's my_dt and my_trim have different price and volume data at the end.

# my_dt -------------------
my_dt <- tibble::tribble(
  ~price, ~volume,
  4.5,     0.4,
  4.6,     1.2,
  4.7,     0.3,
  4.8,     8.7,
  5.1,     6.5)