"fuzzy" inner_join in dplyr to keep both rows that do AND not exactly match

364 Views Asked by At

I am working with two datasets that I would like to join based not exact matches between them, but rather approximate matches. My question is similar to this OP.

Here are examples of what my two dataframes look like.

df1 is this one:

 x
 4.8
 12  
 4  
 3.5
 12.5
 18  

df2 is this one:

 x     y
 4.8   6.6
 12    1  
 4.5   1  
 3.5   0.5
 13    1.8
 15    2

I am currently using inner_join(df1, df2, by=c("x") to join the two together.

This gives me:

 x     y
 4.8   6.6
 12    1
 3.5   0.5

However, what I really want to do is join the two dfs based on these conditions:

  1. any exact matches are joined first (exactly like how inner_join() currently works)
  2. BUT, if there are no exact matches, then join to any match ± 0.5

The kind of output I am trying to get would look like this:

 x     y
 4.8   6.6
 12    1
 4     1    #the y value is from x=4.5 in df1
 4     0.5  #the y value is from x=3.5 in df1
 3.5   0.5
 12.5  1    #the y value is from x=12 in df1
 12.5  1.8  #the y value is from x=13 in df1 

I typically work in dplyr, so a dplyr solution would be appreciated. But, I am also open to other suggestions because I don't know if dplyr will be flexible enough to do a "fuzzy" join.

(I am aware of the fuzzyjoin package, but it doesn't seem to get at exactly what I am trying to do here)

2

There are 2 best solutions below

4
On BEST ANSWER

A possible solution, with no join:

library(tidyverse)

df1 %>%
  rename(x1 = x) %>%
  crossing(df2) %>%
  mutate(diff = abs(x1-x)) %>% 
  filter(diff <= 0.5) %>% 
  group_by(x1) %>% 
  mutate(aux = any(diff == 0)) %>% 
  filter(aux*(diff == 0) | !aux) %>% 
  select(-diff, -aux) %>% 
  ungroup

#> # A tibble: 7 × 3
#>      x1     x     y
#>   <dbl> <dbl> <dbl>
#> 1   3.5   3.5   0.5
#> 2   4     3.5   0.5
#> 3   4     4.5   1  
#> 4   4.8   4.8   6.6
#> 5  12    12     1  
#> 6  12.5  12     1  
#> 7  12.5  13     1.8
1
On

You could use {powerjoin}

library(powerjoin)
power_left_join(
  df1, df2,
  by = ~ .x$x == .y$x | ! .x$x %in% .y$x & .x$x <= .y$x +.5 & .x$x >= .y$x -.5, 
  keep = "left")
#>      x   y
#> 1  4.8 6.6
#> 2 12.0 1.0
#> 3  4.0 1.0
#> 4  4.0 0.5
#> 5  3.5 0.5
#> 6 12.5 1.0
#> 7 12.5 1.8

Created on 2022-04-14 by the reprex package (v2.0.1)