Non-equi join in tidyverse

2k Views Asked by At

I was wondering whether someone knows if the dplyr extension packages (dbplyr and dtplyr) allow non-equi joins within the usual dplyr workflow? I rarely need data.table, but fast non-equi joins are the only moments where I always need to setDT, then perform the join and then switch back with as_tibble(). I scrolled through the issues in the packages on github but didn't find whether this is something that's planned or already implemented.

3

There are 3 best solutions below

0
On BEST ANSWER

Non-equi joins are available since 1.1.0 via the function join_by. To create non-equi joins, you can use <, >, >=, <=, or helpers between, within, overlaps and nearest.

library(dplyr)

#Example from https://github.com/tidyverse/dplyr/pull/5910.

set.seed(123)
dates <- as.Date("2019-01-01") + 0:4
needles <- tibble(dates = dates, x = sample(length(dates)))

set.seed(123)
lower <- as.Date("2019-01-01") + sample(6, 5, replace = TRUE)
upper <- lower + sample(2, 5, replace = TRUE)
haystack <- tibble(lower = lower, upper = upper, y = sample(length(lower)))

needles
#> # A tibble: 5 x 2
#>   dates          x
#>   <date>     <int>
#> 1 2019-01-01     3
#> 2 2019-01-02     2
#> 3 2019-01-03     5
#> 4 2019-01-04     4
#> 5 2019-01-05     1
haystack
#> # A tibble: 5 x 3
#>   lower      upper          y
#>   <date>     <date>     <int>
#> 1 2019-01-04 2019-01-06     1
#> 2 2019-01-07 2019-01-08     2
#> 3 2019-01-04 2019-01-05     3
#> 4 2019-01-03 2019-01-05     4
#> 5 2019-01-03 2019-01-05     5

# Non-equi join
# For each row in `needles`, find locations in `haystack` matching the condition
left_join(needles, haystack, by = join_by(dates >= lower, dates <= upper))
#> # A tibble: 12 x 5
#>    dates          x lower      upper          y
#>    <date>     <int> <date>     <date>     <int>
#>  1 2019-01-01     3 NA         NA            NA
#>  2 2019-01-02     2 NA         NA            NA
#>  3 2019-01-03     5 2019-01-03 2019-01-05     4
#>  4 2019-01-03     5 2019-01-03 2019-01-05     5
#>  5 2019-01-04     4 2019-01-04 2019-01-06     1
#>  6 2019-01-04     4 2019-01-04 2019-01-05     3
#>  7 2019-01-04     4 2019-01-03 2019-01-05     4
#>  8 2019-01-04     4 2019-01-03 2019-01-05     5
#>  9 2019-01-05     1 2019-01-04 2019-01-06     1
#> 10 2019-01-05     1 2019-01-04 2019-01-05     3
#> 11 2019-01-05     1 2019-01-03 2019-01-05     4
#> 12 2019-01-05     1 2019-01-03 2019-01-05     5
2
On

For dbplyr: While SQL supports non-equi joins, I have not found a dplyr approach that is equivalent. My usual work around is very similar to the r-bloggers link posted by @Waldi to join on the equality conditions and then filter on the inequality conditions.

For example:

output = join(df1, df2, by = c("df1_id" = "df2_id")) %>%
  filter(df1_date <= df2_date)

This translates to SQL similar to:

SELECT *
FROM df1
JOIN df2
ON df1_id = df2_id
WHERE df1_date <= df2_date

Which is not that different from:

SELECT *
FROM df1
JOIN df2
ON df1_id = df2_id
AND df1_date <= df2_date
0
On

There has been a new option for this in dbplyr since the version 1.4.0: sql_on. Citing Kirill Müller:

There is #2240 for dplyr, but it's going to take a while. For databases we already have a workaround [ie. generic SQL joins].

library(dplyr)
library(dbplyr)
tbl1 <- memdb_frame(a = 1:3, b = 4:2)
tbl2 <- memdb_frame(c = 1:3, b = 2:0)
left_join(tbl1, tbl2, sql_on = "LHS.b < RHS.c")