data.table non-equi join with min/max of joined value

311 Views Asked by At

I'm trying to do a non-equi join in data.table and extract the min/max of the joined values in that join.

set.seed(42)
dtA <- data.table(id=rep(c("A","B"),each=3), start=rep(1:3, times=2), end=rep(2:4, times=2))
dtB <- data.table(id=rep(c("A","B"),times=20), time=sort(runif(40, 1, 4)))

I'd like to preserve the min/max values of time when it is between start and end (and on id). Nominally this is just a non-equi join, but I can't find a combination of by=.EACHI or mult="..." that gets me what I want. Instead, the min/max typically are not aligned with the range I need. Unfortunately roll= does not support non-equi ranges.

dtA[dtB, c("Min", "Max") := .(min(time), max(time)),
    on = .(id, start <= time, end > time), mult = "first"]
#        id start   end      Min      Max
#    <char> <int> <int>    <num>    <num>
# 1:      A     1     2 1.011845 3.966675
# 2:      A     2     3 1.011845 3.966675
# 3:      A     3     4 1.011845 3.966675
# 4:      B     1     2 1.011845 3.966675
# 5:      B     2     3 1.011845 3.966675
# 6:      B     3     4 1.011845 3.966675
dtA[dtB, c("Min", "Max") := .(min(time), max(time)),
    on = .(id, start <= time, end > time), by = .EACHI]
#        id start   end      Min      Max
#    <char> <int> <int>    <num>    <num>
# 1:      A     1     2 1.858419 1.858419
# 2:      A     2     3 2.970977 2.970977
# 3:      A     3     4 3.934679 3.934679
# 4:      B     1     2 1.766286 1.766286
# 5:      B     2     3 2.925237 2.925237
# 6:      B     3     4 3.966675 3.966675

The second is the closest ("Max" is correct), but what I'd like to be able to get is:

       id start   end      Min      Max
   <char> <num> <int>    <num>    <num>
1:      A     1     2 1.011845 1.858419
2:      A     2     3 2.170610 2.970977
3:      A     3     4 3.115194 3.934679
4:      B     1     2 1.022002 1.766286
5:      B     2     3 2.164325 2.925237
6:      B     3     4 3.055509 3.966675

The real problem has around 400K or so rows with ranges joining in 2Mi rows of values, so I'd prefer to not do a full expansion of both frames to manually cut it back down to the size of dtA.

(I'm open to collapse suggestions.)

3

There are 3 best solutions below

1
thelatemail On BEST ANSWER

Switch the join around so it's B[A], and then assign inside A:

dtA[,
    c("min","max") := dtB[
        dtA,
        on=.(id, time >= start, time < end), 
        .(min=min(x.time), max=max(x.time)),
        by=.EACHI][, c("min","max")]
    ]
dtA

#       id start   end      min      max
#   <char> <int> <int>    <num>    <num>
#1:      A     1     2 1.011845 1.858419
#2:      A     2     3 2.170610 2.970977
#3:      A     3     4 3.115194 3.934679
#4:      B     1     2 1.022002 1.766286
#5:      B     2     3 2.164325 2.925237
#6:      B     3     4 3.055509 3.966675

You can see it needs to be spun around, or otherwise the .EACHI group ends up being for each individual row in B instead of the matching rows in B inside the criteria for A:

dtB[dtA, on=.(id, time >= start, time < end), .N, by=.EACHI]
#       id  time  time     N
#   <char> <int> <int> <int>
#1:      A     1     2     5
#2:      A     2     3     6
#3:      A     3     4     9
#4:      B     1     2     4
#5:      B     2     3     6
#6:      B     3     4    10

dtA[dtB, on=.(id, start <= time, end > time), .N, by=.EACHI][, .(freq=.N), by=N]
#       N  freq
#   <int> <int>
#1:     1    40

This makes sense in the context of the help("data.table::special-symbols") description:

Its usage is 'by=.EACHI' (or 'keyby=.EACHI') which invokes grouping-by-each-row-of-i

In the DT[i, j, by] logic, dtA then needs to contribute the rows for the grouping.

4
jblood94 On

Update based on comments

Joins can be avoided altogether by using nafill after row-binding dtA to dtB and sorting. The solution below runs potentially orders of magnitude faster than a .EACHI non-equi join at the cost of considerably increased code complexity.

Unlike the previous answer, this makes no assumption about overlapping/non-overlapping intervals.

i <- 1:nrow(dtA)
dtA[
    , c("Min", "Max") := setorder(
      rbindlist(
        list( # list order is important
          dtA[, .(id, time = end, rw = .I, isStart = -1L)],
          dtA[, .(id, time = start, rw = .I, isStart = 1L)],
          dtB[, `:=`(rw = NA_integer_, isStart = 0L)]
        )
      ), id, time
    )[ # set the time at start/end points to NA
      !is.na(rw), time := NA_real_
    ][ # fill forward to get the max in each interval
      isStart <= 0L, time := nafill(time, "locf")
    ][ # fill backward to get the min in each interval
      isStart >= 0L, time := nafill(time, "nocb")
      # get the max and min, ordered by dtA row
    ][!is.na(rw), setorder(.SD, -isStart, rw)[, .(time[i], time[-i])]]
  # remove the no matches
  ][Min >= end | Min < start, c("Min", "Max") := NA_real_]

dtA
#>    id start end      Min      Max
#> 1:  A     1   2 1.011845 1.858419
#> 2:  A     2   3 2.170610 2.970977
#> 3:  A     3   4 3.115194 3.934679
#> 4:  B     1   2 1.022002 1.766286
#> 5:  B     2   3 2.164325 2.925237
#> 6:  B     3   4 3.055509 3.966675

As a function, along with a function for the .EACHI approach from this answer:

fsort <- function(dtA, dtB) {
  i <- 1:nrow(dtA)
  dtA[
    , c("Min", "Max") := setorder(
      rbindlist(
        list(
          dtA[, .(id, time = end, rw = .I, isStart = -1L)],
          dtA[, .(id, time = start, rw = .I, isStart = 1L)],
          dtB[, `:=`(rw = NA_integer_, isStart = 0L)]
        )
      ), id, time
    )[
      !is.na(rw), time := NA_real_
    ][
      isStart <= 0L, time := nafill(time, "locf")
    ][
      isStart >= 0L, time := nafill(time, "nocb")
    ][!is.na(rw), setorder(.SD, -isStart, rw)[, .(time[i], time[-i])]]
  ][Min >= end | Min < start, c("Min", "Max") := NA_real_]
}

fEACHI <- function(dtA, dtB) {
  dtA[
    , c("Min","Max") := dtB[
      dtA,
      on = .(id, time >= start, time < end), 
      .(Min=min(x.time), Max=max(x.time)),
      by=.EACHI][, c("Min","Max")]
  ]
}

Benchmarking a much larger dataset:

dtA <- data.table(id = rep(c("A", "B"), 2e5), start = sort(runif(4e5, 0, 2e5)), end = sort(runif(4e5, 0, 2e5)))
dtA[start > end, c("end", "start") := .(start, end)]
dtB <- data.table(id = rep(c("A", "B"), 1e6), time = sort(runif(2e6, 0, 2e5)))
setorder(dtA, id, start)

microbenchmark::microbenchmark(
  sort = fsort(dta, dtB),
  .EACHI = fEACHI(dta, dtB),
  setup = dta <- copy(dtA),
  check = "equal",
  times = 10
)
#> Unit: milliseconds
#>    expr       min       lq      mean    median        uq       max neval
#>    sort  258.6646  270.124  295.0626  304.3474  308.0906  322.8272    10
#>  .EACHI 2065.2458 2094.445 2115.2688 2104.5641 2152.5112 2157.1872    10

fEACHI gets increasingly slow as the interval widths increase, while fsort scales very nicely. For the example dataset below, fsort still takes a fraction of a second, while fEACHI explodes to 400 seconds.

dtA <- data.table(id = rep(c("A", "B"), 2e5), start = runif(4e5, 0, 2e5), end = runif(4e5, 0, 2e5))
dtA[start > end, c("end", "start") := .(start, end)]
dtB <- data.table(id = rep(c("A", "B"), 1e6), time = sort(runif(2e6, 0, 2e5)))
setorder(dtA, id, start)

system.time(dt1 <- fsort(copy(dtA), dtB))
#>    user  system elapsed 
#>    0.38    0.08    0.26
system.time(dt2 <- fEACHI(copy(dtA), dtB))
#>    user  system elapsed 
#>  520.44   23.03  399.72
identical(dt1, dt2)
#> [1] TRUE
0
r2evans On

Ultimately I think it's not possible to resolve this without doing a full expansion somewhere, so the next wicket would be runtime. With this 6-row sample, thelatemail's response was faster, but as the data scales to larger datasets the use of collapse gains parity, and at my current data it takes half the time (which is still on the order of seconds).

Also, I've changed the code below to use first/last instead of min/max, another aspect of the data I'm using (analogous in almost every way as far as performance and data-management goes).

thelatemail's:

dtA[,
    c("min","max") := dtB[
      dtA,
      on=.(id, time >= start, time < end),
      .(min=first(x.time), max=last(x.time)),
      by=.EACHI][, c("min","max")]
    ]

equivalent in collapse:::

joined <- dtB[, tm := time][dtA, on = .(id, tm >= start, tm < end)]
setnames(joined, c("tm", "tm.1"), c("start", "end"))
g <- collapse::GRP(joined, c("id", "start", "end"))
dtA[
  cbind(collapse::ffirst(joined[,.(time)], g = g),
        collapse::flast(joined[,.(time)], g = g),
        g$groups),
  on = c("id", "start", "end")]

AndreWildberg's recommend for a pure dplyr (>= 1.1.0) version is also notable, though I can't benchmark it at the moment (I'm still working with < 1.1.0).