We have written a package to analyse a large number of events in relation to time windows.
To do the analysis we need to establish a number of attributes of the windows and cross-references between them.
This has been done using data.table in its native syntax. Examples of some of the steps is included in the reprex below.
We are now looking to re-frame this package using dplyr/dtplyr for readability and sharing with other parties.
While I can write the 'queries' in dplyr syntax, I am not seeing a tidyverse way to apply updates to the underlying tables - adding columns, updating rows etc. without repeatedly creating and replacing copies. When the data is large, the 'update in place' features of data.table are very desirable. Is there a way to take advantage of this in the dplyr syntax? (I have hit barriers with immutable = FALSE and attempts to use rows_update())
library(data.table)
set.seed <- 123
#Create a table of events with timestamp and an event type (501 events randomly generated over the previous 30 days)
DT1 <- data.table(timeStamp = as.POSIXct('2021-03-25') - as.integer(runif(501)*60*1440*30),
eventType=c('A', 'B', 'C'))
setkey(DT1, timeStamp)
print(DT1)
#> timeStamp eventType
#> 1: 2021-02-23 00:42:37 A
#> 2: 2021-02-23 04:21:43 A
#> 3: 2021-02-23 05:23:51 C
#> 4: 2021-02-23 06:45:36 C
#> 5: 2021-02-23 08:34:32 B
#> ---
#> 497: 2021-03-24 11:32:09 A
#> 498: 2021-03-24 13:49:53 B
#> 499: 2021-03-24 14:26:55 C
#> 500: 2021-03-24 18:11:33 C
#> 501: 2021-03-24 20:13:51 A
#Create a table of time windows. One for each date represented with an early and late time for each
#Assign this a class (in this example the value of the most common eventType)
DT2 <- DT1[,keyby=.(date=lubridate::date(timeStamp)),
.(earlyTime = min(timeStamp - 1),
lateTime = max(timeStamp + 1),
as = sum(eventType == 'A'),
bs = sum(eventType == 'B'),
cs = sum(eventType == 'C'))][
,.(date,
earlyTime,
lateTime,
class=ifelse(as >= bs & as >= cs, 'A', ifelse(bs >= cs, 'B', 'C')))]
print(head(DT2))
#> date earlyTime lateTime class
#> 1: 2021-02-23 2021-02-23 00:42:36 2021-02-23 23:14:13 B
#> 2: 2021-02-24 2021-02-24 04:10:27 2021-02-24 21:28:14 B
#> 3: 2021-02-25 2021-02-25 03:38:29 2021-02-25 21:55:44 A
#> 4: 2021-02-26 2021-02-26 01:49:00 2021-02-26 23:40:51 B
#> 5: 2021-02-27 2021-02-27 00:18:40 2021-02-27 22:42:46 A
#> 6: 2021-02-28 2021-02-28 02:50:25 2021-02-28 22:44:44 A
#Give each row in DT2 a row number (so that we can readily cross-reference between rows)
DT2[order(lateTime), rn := .I]
#For each row, get the row number of the previous instance of this class
DT2[order(class, rn), prevOfClass := shift(rn, 1), by=.(class)]
print(head(DT2))
#> date earlyTime lateTime class rn prevOfClass
#> 1: 2021-02-23 2021-02-23 00:42:36 2021-02-23 23:14:13 B 1 NA
#> 2: 2021-02-24 2021-02-24 04:10:27 2021-02-24 21:28:14 B 2 1
#> 3: 2021-02-25 2021-02-25 03:38:29 2021-02-25 21:55:44 A 3 NA
#> 4: 2021-02-26 2021-02-26 01:49:00 2021-02-26 23:40:51 B 4 2
#> 5: 2021-02-27 2021-02-27 00:18:40 2021-02-27 22:42:46 A 5 3
#> 6: 2021-02-28 2021-02-28 02:50:25 2021-02-28 22:44:44 A 6 5
#For each row that is not a 'C' find the previous and next instances of a C type row
#Note that when we assigned rn we ensured that the rows were in ascending time order
#so rn can be used as a proxy for sorting by time
DT2[class=='C'][DT2[class != 'C'],
on=.(rn > rn),
by=.EACHI,
.(rn=i.rn, nextC = min(x.rn), prevC = min(x.prevOfClass))]
#> rn rn nextC prevC
#> 1: 1 1 8 NA
#> 2: 2 2 8 NA
#> 3: 3 3 8 NA
#> 4: 4 4 8 NA
#> 5: 5 5 8 NA
#> 6: 6 6 8 NA
#> 7: 7 7 8 NA
#> 8: 9 9 13 8
#> 9: 10 10 13 8
#> 10: 11 11 13 8
#> 11: 12 12 13 8
#> 12: 14 14 16 13
#> 13: 15 15 16 13
#> 14: 17 17 26 16
#> 15: 18 18 26 16
#> 16: 19 19 26 16
#> 17: 20 20 26 16
#> 18: 21 21 26 16
#> 19: 22 22 26 16
#> 20: 23 23 26 16
#> 21: 24 24 26 16
#> 22: 25 25 26 16
#> 23: 28 28 30 27
#> 24: 29 29 30 27
#> rn rn nextC prevC
#But I want to add this information as additional columns to the base table
DT2[DT2[class=='C'][DT2[class != 'C'],
on=.(rn > rn),
by=.EACHI,
.(rn=i.rn, nextC = min(x.rn), prevC = min(x.prevOfClass))],
on = .(rn),
':='(nextC=i.nextC, prevC = i.prevC)
]
print(DT2[,.(rn, date, class, prevOfClass, nextC, prevC)])
#> rn date class prevOfClass nextC prevC
#> 1: 1 2021-02-23 B NA 8 NA
#> 2: 2 2021-02-24 B 1 8 NA
#> 3: 3 2021-02-25 A NA 8 NA
#> 4: 4 2021-02-26 B 2 8 NA
#> 5: 5 2021-02-27 A 3 8 NA
#> 6: 6 2021-02-28 A 5 8 NA
#> 7: 7 2021-03-01 A 6 8 NA
#> 8: 8 2021-03-02 C NA NA NA
#> 9: 9 2021-03-03 A 7 13 8
#> 10: 10 2021-03-04 A 9 13 8
#> 11: 11 2021-03-05 B 4 13 8
#> 12: 12 2021-03-06 A 10 13 8
#> 13: 13 2021-03-07 C 8 NA NA
#> 14: 14 2021-03-08 A 12 16 13
#> 15: 15 2021-03-09 B 11 16 13
#> 16: 16 2021-03-10 C 13 NA NA
#> 17: 17 2021-03-11 A 14 26 16
#> 18: 18 2021-03-12 B 15 26 16
#> 19: 19 2021-03-13 A 17 26 16
#> 20: 20 2021-03-14 B 18 26 16
#> 21: 21 2021-03-15 A 19 26 16
#> 22: 22 2021-03-16 A 21 26 16
#> 23: 23 2021-03-17 A 22 26 16
#> 24: 24 2021-03-18 A 23 26 16
#> 25: 25 2021-03-19 B 20 26 16
#> 26: 26 2021-03-20 C 16 NA NA
#> 27: 27 2021-03-21 C 26 NA NA
#> 28: 28 2021-03-22 B 25 30 27
#> 29: 29 2021-03-23 A 24 30 27
#> 30: 30 2021-03-24 C 27 NA NA
#> rn date class prevOfClass nextC prevC
#What would be the best approach to this using dplyr / dtplyr syntax?
#In practice there are many hundreds of thousands of rows in the tables
#and...
#There are many more update and enrichments that need to be applied
#some of which add new columns, others will update just a few rows
#in a column
#So 'mutate in place/by reference' is highly desirable
Created on 2021-03-25 by the reprex package (v1.0.0)