rolling sum with conditions

72 Views Asked by At

I have a dataset like:

  ID    DATETIME    CODE  Value
999 1/2/2024 16:22  TX     100
123 1/2/2024 16:47  IP     100
666 1/2/2024 17:13  IP      85
666 1/2/2024 17:38  IP     100
123 1/2/2024 18:03  TX      90
666 1/2/2024 18:28  TX      85
666 1/2/2024 18:54  IP     100
123 1/2/2024 19:19  CA     100
666 1/2/2024 19:44  OX      95
999 1/2/2024 20:09  18      75
123 1/2/2024 20:35  12     100
654 1/2/2024 21:00  IP      85

Here's, the Reprex of above:

structure(list(ID = c("999", "123", "666", "666", "123", "666", 
"666", "123", "666", "999", "123", "654"), DATETIME = structure(c(1706804520, 
1706806020, 1706807580, 1706809080, 1706810580, 1706812080, 1706813640, 
1706815140, 1706816640, 1706818140, 1706819700, 1706821200), class = c("POSIXct", 
"POSIXt"), tzone = "UTC"), CODE = c("TX", "IP", "IP", "IP", "TX", 
"TX", "IP", "CA", "OX", "18", "12", "IP"), Value = c(100, 100, 
85, 100, 90, 85, 100, 100, 95, 75, 100, 85)), class = c("tbl_df", 
"tbl", "data.frame"), row.names = c(NA, -12L))

I'd like to add a column with the cumulative sum, by ID within the last 2 hours, according to a condition (CODE == IP). Like this:

ID  DATETIME       CODE Value   cum_IP
999 1/2/2024 16:22  TX  100      0
123 1/2/2024 16:47  IP  100    100
666 1/2/2024 17:13  IP   85     85
666 1/2/2024 17:38  IP  100    185
123 1/2/2024 18:03  TX   90      0
666 1/2/2024 18:28  TX   85      0
666 1/2/2024 18:54  IP  100    285
123 1/2/2024 19:19  CA  100      0
666 1/2/2024 19:44  OX   95      0
999 1/2/2024 20:09  18   75      0
123 1/2/2024 20:35  12  100      0
654 1/2/2024 21:00  IP   85     85

I expect not to had made any error computing manually the column, but the idea is understandable. A function that computes an aggregated operation (a sum or a simple count) over rows with a grouping var (ID), a predicate (CODE == IP in this case, but it would be v.g. is.number(CODE)) and a window (2 hours from the current row).

3

There are 3 best solutions below

4
r2evans On BEST ANSWER

Try the below. I'm choosing to group by CODE as well to reduce the conditioning inside the code. We use sapply as a simple rolling-window for the 2 hour window. I'm doing the singular if conditional so that don't sapply(.) through the data unnecessarily.

dplyr

library(dplyr)
quux %>%
  mutate(
    cum_IP = if (first(CODE) == "IP") {
        sapply(DATETIME, function(tm) sum(Value[between(DATETIME, tm-7200, tm)]))
      } else 0,
    .by = c(ID, CODE))
# # A tibble: 12 × 5
#    ID    DATETIME                CODE  Value cum_IP
#    <chr> <dttm>                  <chr> <dbl>  <dbl>
#  1 999   2024-02-01 16:22:00.000 TX      100      0
#  2 123   2024-02-01 16:47:00.000 IP      100    100
#  3 666   2024-02-01 17:13:00.000 IP       85     85
#  4 666   2024-02-01 17:38:00.000 IP      100    185
#  5 123   2024-02-01 18:03:00.000 TX       90      0
#  6 666   2024-02-01 18:28:00.000 TX       85      0
#  7 666   2024-02-01 18:54:00.000 IP      100    285
#  8 123   2024-02-01 19:19:00.000 CA      100      0
#  9 666   2024-02-01 19:44:00.000 OX       95      0
# 10 999   2024-02-01 20:09:00.000 18       75      0
# 11 123   2024-02-01 20:35:00.000 12      100      0
# 12 654   2024-02-01 21:00:00.000 IP       85     85

.by= requires dplyr_1.1.0 or newer, replace with group_by(..) if on an older version.

data.table

(Since you tagged .)

library(data.table)
as.data.table(quux) |>
  _[, cum_IP := if (CODE == "IP") sapply(DATETIME, function(tm) sum(Value[between(DATETIME, tm-7200, tm)])) else 0,
    by = .(ID, CODE)]

We can do CODE instead of first(CODE) because in data.table when grouping by a variable, the inner expression only sees length-1 for the grouping variables.

The use of |> _[..] requires R-4.3 or newer. Other options exist for pipeline data.table for this.

0
G. Grothendieck On

Here is an sql left self join:

library(sqldf)

sqldf("select a.*, (a.CODE = 'IP') * sum(b.Value * (b.CODE == 'IP')) cum_IP 
  from dat a
  left join dat b on a.ID = b.ID and 
                     b.DATETIME between a.DATETIME - 2 * 60 * 60 and a.DATETIME
  group by a.rowid")

giving

    ID            DATETIME CODE Value cum_IP
1  999 2024-02-01 11:22:00   TX   100      0
2  123 2024-02-01 11:47:00   IP   100    100
3  666 2024-02-01 12:13:00   IP    85     85
4  666 2024-02-01 12:38:00   IP   100    185
5  123 2024-02-01 13:03:00   TX    90      0
6  666 2024-02-01 13:28:00   TX    85      0
7  666 2024-02-01 13:54:00   IP   100    285
8  123 2024-02-01 14:19:00   CA   100      0
9  666 2024-02-01 14:44:00   OX    95      0
10 999 2024-02-01 15:09:00   18    75      0
11 123 2024-02-01 15:35:00   12   100      0
12 654 2024-02-01 16:00:00   IP    85     85
1
Onyambu On
library(tidyverse)
df %>%
  mutate(cum_IP = CODE == 'IP',
         cum_IP = cum_IP  & c(0,as.numeric(diff(DATETIME), unit='hours')) <= 2,
         cum_IP  = cumsum(Value * cum_IP) * cum_IP,
                 .by = ID)

# A tibble: 12 × 5
   ID    DATETIME            CODE  Value cum_IP
   <chr> <dttm>              <chr> <dbl>  <dbl>
 1 999   2024-02-01 16:22:00 TX      100      0
 2 123   2024-02-01 16:47:00 IP      100    100
 3 666   2024-02-01 17:13:00 IP       85     85
 4 666   2024-02-01 17:38:00 IP      100    185
 5 123   2024-02-01 18:03:00 TX       90      0
 6 666   2024-02-01 18:28:00 TX       85      0
 7 666   2024-02-01 18:54:00 IP      100    285
 8 123   2024-02-01 19:19:00 CA      100      0
 9 666   2024-02-01 19:44:00 OX       95      0
10 999   2024-02-01 20:09:00 18       75      0
11 123   2024-02-01 20:35:00 12      100      0
12 654   2024-02-01 21:00:00 IP       85     85