I have a large data frame with data of the following structure:
name date val1 val2
1 A 2017-01-01 0 2
2 A 2017-01-02 1 1
3 A 2017-01-03 1 0
4 A 2017-01-04 0 3
5 A 2017-01-05 1 1
6 A 2017-01-06 0 0
7 B 2017-01-01 0 0
8 B 2017-01-02 0 3
9 B 2017-01-03 1 2
10 B 2017-01-04 1 1
11 B 2017-01-05 0 0
12 B 2017-01-06 1 0
13 C 2017-01-01 0 2
14 C 2017-01-02 0 1
15 C 2017-01-03 1 2
16 C 2017-01-04 0 0
17 C 2017-01-05 0 0
18 C 2017-01-06 1 3
For any date
within each group of name
, I would now like to calculate the cumsum()
of val1
for the last 2 occurrences and of val2
for the last 3 occurrences.
I am trying this with the following code (based on this answer: https://stackoverflow.com/a/27649238/1162278; incl. creation of sample data set):
library(dplyr)
library(data.table)
dates <- seq(as.Date('2017-01-01'), as.Date('2017-01-06'), by = '1 day')
d <- CJ(
name = c('A', 'B', 'C'),
date = dates
) %>%
left_join(
data.frame(
name = c(rep('A',6), rep('B',6), rep('C',6)),
date = c(rep(dates, 3)),
val1 = c(0,1,1,0,1,0,0,0,1,1,0,1,0,0,1,0,0,1),
val2 = c(2,1,0,3,1,0,0,3,2,1,0,0,2,1,2,0,0,3)
)
)
d %>%
group_by(name) %>%
mutate(
val1_l2 = dplyr::lag(cumsum(val1), k=2),
val2_l3 = dplyr::lag(cumsum(val2), k=3)
)
This yields:
name date val1 val2 val1_l2 val2_l3
<chr> <date> <dbl> <dbl> <dbl> <dbl>
1 A 2017-01-01 0 2 NA NA
2 A 2017-01-02 1 1 0 2
3 A 2017-01-03 1 0 1 3
4 A 2017-01-04 0 3 2 3
5 A 2017-01-05 1 1 2 6
6 A 2017-01-06 0 0 3 7
7 B 2017-01-01 0 0 NA NA
8 B 2017-01-02 0 3 0 0
9 B 2017-01-03 1 2 0 3
10 B 2017-01-04 1 1 1 5
11 B 2017-01-05 0 0 2 6
12 B 2017-01-06 1 0 2 6
13 C 2017-01-01 0 2 NA NA
14 C 2017-01-02 0 1 0 2
15 C 2017-01-03 1 2 0 3
16 C 2017-01-04 0 0 1 5
17 C 2017-01-05 0 0 1 5
18 C 2017-01-06 1 3 1 5
However, it seems like the cumsum()
is always calculated for all previous records within the name
group, and not for the rolling range of k=2
and k=3
for val1
and val2
, respectively.
Example:
Row Variable Calculated Expected
5 val1_l2 2 1
5 val2_l3 6 4
What am I doing wrong?
We probably do not need to use
lag
here. We can replace all values to be 0 except the last two or three rows and then usecumsum
. Here is an example. Notice thatd2
is the final output.n():(n() - 1)
orn():(n() - 2)
indicate the last two or three rows.ifelse(row_number() %in% ...)
check if the row numbers match the last two or three rows.Data