Aged inventory schedule using dplyr, cumsum and lag

109 Views Asked by At

This is my first stackoverflow post after years of passive browsing. I am stuck on this problem and it's driving me nuts! Thanks so much for helping.

I have a data frame of inventory supply and demand bucketed by age. I have collected this data across time for many products. Demand for inventory of a certain age can be met with supply of the same age or younger. I am trying to calculate how much demand can be met with supply within each age bucket, filled youngest to oldest.

The data frame will be large (10^7 rows) so I am trying to do this with dplyr, mutate, lag, and cumsum versus a loop, which I suspect will be slow.

Here is a sample group within my data set (product and date groupings omitted):

library(dplyr)

Inventory <- data.frame(
  Age = c(90, 120, 270, 365, Inf),
  Demand = c(0, 5000, 25, 5000, 10),
  Supply = c(4000, 50, 4000, 300, 0))

View(Inventory)

The result I am expecting is:

Result <- Inventory
Result$Start = c(0, 4000, 0, 3975, 0)
Result$In = c(4000, 50, 4000, 300, 0)
Result$Out = c(0, 4050, 25, 4275, 0)
Result$End = c(4000, 0, 3975, 0, 0)
Result$Short = c(0, 950, 0, 725, 10)

View(Result)

I applied standard inventory calculations above:

  • Start = End
  • In = Supply
  • Out = min(Demand, Start + In)
  • End = Start + In - Out
  • Short = Demand - Out

I am having no luck using dplyr, but I think there is a solution using clever combinations of max, min, lag, and cumsum.

3

There are 3 best solutions below

3
On BEST ANSWER

If speed becomes an issue over a large number of rows then potentially the fastest way to deal with iterating calculations is via Rcpp.

You essentially need a cummulative-sum-but-floored-at-zero function, which adds the Supply - Demand outcome of each day to the last total and zeros it if it's negative. Here's a cumnominus trial function which gives the right table and can be used in dplyr:

library(dplyr)

cumnominus <- Rcpp::cppFunction("NumericVector cumnominus(NumericVector x) {
  int n = x.size();
  
  NumericVector sumout(n);
  
  sumout[0] = (x[0] < 0) ? 0 : x[0];
  
  for(int i = 1; i < n; i++) {
      
      sumout[i] = (x[i] < 0) ? 0 : x[i] + sumout[i - 1];
    
    }
  
  return sumout;
}")

Inventory |> 
  mutate(In = Supply,
         End = cumnominus(Supply - Demand),
         Start = lag(End, default = 0),
         Short = pmax(0, Demand - (Start + Supply)),
         Out = pmin(Demand, Start + In)) |> 
  select(Age, Demand, Supply, Start, In, Out, End, Short)

#>   Age Demand Supply Start   In  Out  End Short
#> 1  90      0   4000     0 4000    0 4000     0
#> 2 120   5000     50  4000   50 4050    0   950
#> 3 270     25   4000     0 4000   25 3975     0
#> 4 365   5000    300  3975  300 4275    0   725
#> 5 Inf     10      0     0    0    0    0    10

Result
#>   Age Demand Supply Start   In  Out  End Short
#> 1  90      0   4000     0 4000    0 4000     0
#> 2 120   5000     50  4000   50 4050    0   950
#> 3 270     25   4000     0 4000   25 3975     0
#> 4 365   5000    300  3975  300 4275    0   725
#> 5 Inf     10      0     0    0    0    0    10

As a bit of a test against an R-only loop on a dataframe of 5m rows it takes around 0.05s compared to R-loop of 8.5s:

cumnominus_r <- function(x) {
  out_sum <- integer(length(x))
  out_sum[1] <- max(0, x[1])
  for (i in 2:length(x)) {
    out_sum[i] <- ifelse(x[i] < 0, 0, out_sum[i - 1] + x[i])
  }
  
  out_sum
}

big_df <- tibble(
  Demand = sample(seq(1000, 6000, 500), 5000000, replace = TRUE),
  Supply = sample(seq(1000, 6000, 500), 5000000, replace = TRUE)
) 


bench::mark(
  Rcpp_fun = big_df |> 
    mutate(End = cumnominus(Supply - Demand)),
  R_only_fun = big_df |> 
    mutate(End = cumnominus_r(Supply - Demand))
)


#> # A tibble: 3 × 6
#>   expression       min   median `itr/sec` mem_alloc `gc/sec`
#>   <bch:expr>  <bch:tm> <bch:tm>     <dbl> <bch:byt>    <dbl>
#> 1 Rcpp_fun     43.15ms  52.95ms    16.1      77.7MB     8.94
#> 2 R_only_fun     8.59s    8.59s     0.116    95.4MB    20.8 
0
On

The issue with using dplyr in this context is as, Sean has stated, to create all of the Start column we'd need the End column, and vice versa. You could potentially do this using accumulate2(), but I don't imagine it would be any faster than a for loop.

Inventory$In <- Inventory$Supply

for (r in seq_len(nrow(Inventory))) {
  Inventory$Start[r] <- if (r == 1) 0 else Inventory[r - 1, "End"]
  Inventory$Out[r] <- min(Inventory$Demand[r], Inventory$Start[r] + Inventory$In[r])
  Inventory$End[r] <- Inventory$Start[r] + Inventory$In[r] - Inventory$Out[r]
  Inventory$Short[r] <- Inventory$Demand[r] - Inventory$Out[r]
}
0
On

Andy Baxter solved this with an Rcpp function that works in mutate. Slight edit to his function to get it to work in all cases:

cumnominus <- Rcpp::cppFunction("NumericVector cumnominus(NumericVector x) {
  
  int n = x.size();
  
  NumericVector sumout(n);
  
  for(int i = 0; i < n; i++) {
      
      sumout[i] =  (x[i] + sumout[i - 1]) > 0 ? x[i] + sumout[i - 1] : 0;
    
    }
  
  return sumout;
}")

And then the dplyr::mutate() steps per Andy's answer.