Rolling cumulative sum with dplyr and lag not restricted to lag range

1.1k Views Asked by At

I have a large data frame with data of the following structure:

   name       date val1 val2
1     A 2017-01-01    0    2
2     A 2017-01-02    1    1
3     A 2017-01-03    1    0
4     A 2017-01-04    0    3
5     A 2017-01-05    1    1
6     A 2017-01-06    0    0
7     B 2017-01-01    0    0
8     B 2017-01-02    0    3
9     B 2017-01-03    1    2
10    B 2017-01-04    1    1
11    B 2017-01-05    0    0
12    B 2017-01-06    1    0
13    C 2017-01-01    0    2
14    C 2017-01-02    0    1
15    C 2017-01-03    1    2
16    C 2017-01-04    0    0
17    C 2017-01-05    0    0
18    C 2017-01-06    1    3

For any date within each group of name, I would now like to calculate the cumsum() of val1 for the last 2 occurrences and of val2 for the last 3 occurrences.

I am trying this with the following code (based on this answer: https://stackoverflow.com/a/27649238/1162278; incl. creation of sample data set):

library(dplyr)
library(data.table)

dates <- seq(as.Date('2017-01-01'), as.Date('2017-01-06'), by = '1 day')

d <- CJ(
  name = c('A', 'B', 'C'),
  date = dates
) %>% 
  left_join(
    data.frame(
      name = c(rep('A',6), rep('B',6), rep('C',6)),
      date = c(rep(dates, 3)),
      val1 = c(0,1,1,0,1,0,0,0,1,1,0,1,0,0,1,0,0,1),
      val2 = c(2,1,0,3,1,0,0,3,2,1,0,0,2,1,2,0,0,3)
    )
  )


d %>% 
  group_by(name) %>% 
  mutate(
    val1_l2 = dplyr::lag(cumsum(val1), k=2),
    val2_l3 = dplyr::lag(cumsum(val2), k=3)
  )

This yields:

    name       date  val1  val2 val1_l2 val2_l3
   <chr>     <date> <dbl> <dbl>   <dbl>   <dbl>
 1     A 2017-01-01     0     2      NA      NA
 2     A 2017-01-02     1     1       0       2
 3     A 2017-01-03     1     0       1       3
 4     A 2017-01-04     0     3       2       3
 5     A 2017-01-05     1     1       2       6
 6     A 2017-01-06     0     0       3       7
 7     B 2017-01-01     0     0      NA      NA
 8     B 2017-01-02     0     3       0       0
 9     B 2017-01-03     1     2       0       3
10     B 2017-01-04     1     1       1       5
11     B 2017-01-05     0     0       2       6
12     B 2017-01-06     1     0       2       6
13     C 2017-01-01     0     2      NA      NA
14     C 2017-01-02     0     1       0       2
15     C 2017-01-03     1     2       0       3
16     C 2017-01-04     0     0       1       5
17     C 2017-01-05     0     0       1       5
18     C 2017-01-06     1     3       1       5

However, it seems like the cumsum() is always calculated for all previous records within the name group, and not for the rolling range of k=2 and k=3 for val1 and val2, respectively.

Example:

Row   Variable   Calculated   Expected
  5   val1_l2        2           1
  5   val2_l3        6           4

What am I doing wrong?

1

There are 1 best solutions below

0
On

We probably do not need to use lag here. We can replace all values to be 0 except the last two or three rows and then use cumsum. Here is an example. Notice that d2 is the final output. n():(n() - 1) or n():(n() - 2) indicate the last two or three rows. ifelse(row_number() %in% ...) check if the row numbers match the last two or three rows.

d2 <- d %>%
  group_by(name) %>% 
  mutate(val1_l2 = ifelse(row_number() %in% n():(n() - 1), val1, 0),
         val2_l3 = ifelse(row_number() %in% n():(n() - 2), val2, 0)) %>%
  mutate(val1_l2 = cumsum(val1_l2), 
         val2_l3 = cumsum(val2_l3))

d2
# A tibble: 18 x 6
# Groups:   name [3]
    name       date  val1  val2 val1_l2 val2_l3
   <chr>     <date> <dbl> <dbl>   <dbl>   <dbl>
 1     A 2017-01-01     0     2       0       0
 2     A 2017-01-02     1     1       0       0
 3     A 2017-01-03     1     0       0       0
 4     A 2017-01-04     0     3       0       3
 5     A 2017-01-05     1     1       1       4
 6     A 2017-01-06     0     0       1       4
 7     B 2017-01-01     0     0       0       0
 8     B 2017-01-02     0     3       0       0
 9     B 2017-01-03     1     2       0       0
10     B 2017-01-04     1     1       0       1
11     B 2017-01-05     0     0       0       1
12     B 2017-01-06     1     0       1       1
13     C 2017-01-01     0     2       0       0
14     C 2017-01-02     0     1       0       0
15     C 2017-01-03     1     2       0       0
16     C 2017-01-04     0     0       0       0
17     C 2017-01-05     0     0       0       0
18     C 2017-01-06     1     3       1       3

Data

library(dplyr)
library(data.table)

dates <- seq(as.Date('2017-01-01'), as.Date('2017-01-06'), by = '1 day')

d <- CJ(
  name = c('A', 'B', 'C'),
  date = dates
) %>% 
  left_join(
    data.frame(
      name = c(rep('A',6), rep('B',6), rep('C',6)),
      date = c(rep(dates, 3)),
      val1 = c(0,1,1,0,1,0,0,0,1,1,0,1,0,0,1,0,0,1),
      val2 = c(2,1,0,3,1,0,0,3,2,1,0,0,2,1,2,0,0,3)
    )
  )