R: How to left join two tables based on fuzzy matching strings that are not exactly the same

73 Views Asked by At

I am trying to left join table 1 'Person Name' to table 2 'Name' and get the values from the Work Group column in Table 2

Table 1

Table 2

df1 <- read.table(text="
Person_Name
PEREZ, MINDY
PEREZ, ABA
CLARKE, LINDA
THOMAS, NICOLE", header=T, sep="|") 

df2 <- read.table(text=
'Name Work_Group 
"Perez-Tie, Mindy" "Group A"
"Rulnick-Perez, Aba" "Group C"
"Mcabe-Clarke, Linda" "Group A"
"Thomas, Nicole" "Group B"', header=T)

This is my current code however it does not show any matches. I am not sure what i might be doing wrong but any help would be great!!

ci_str_detect <- function(x, y){str_detect(x, regex(y, ignore_case = TRUE))}

result <- fuzzy_left_join(Table1,Table2, by = c("Person_Name" = "Name"), 
                          match_fun = ci_str_detect   )  %>%
  select(PersoN_Name,Name,Work_Group)
print(result)
2

There are 2 best solutions below

0
Mark On

Here's working code:

library(tidyverse)
# make sure the two dataframes have the same format and the same columns
df1 <- df1 %>% 
mutate(Person_Name = str_to_title(Person_Name)) %>%
separate_wider_delim(Person_Name, delim=", ", names=c("Last_Name", "First_Name"))

df2 <- df2 %>%
mutate(Name = str_to_title(Name)) %>%
separate_wider_delim(Name, delim=", ", names=c("Last_Name", "First_Name"))

# Maybe the simplest way of doing things (but maybe not what you had in mind ;) )
left_join(df1, df2, by="First_Name")

# But this is more robust
cross_join(df1, df2) %>%
filter(str_detect(Last_Name.y, Last_Name.x), str_detect(First_Name.y, First_Name.x)) %>%
select(Last_Name = Last_Name.y, First_Name = First_Name.y, Work_Group)

The problem with the code you have is that in short, the strings don't match. Take for example the first row: "perez, mindy" will not match "perez-tie, mindy", nor the reverse (both have been shown here lowercase to make it more obvious). You want to match the first names and last names separately.

Another thing you can look at is the stringdist_join function from the fuzzyjoin package (I wasn't able to get it working correctly for all the rows, but it might be useful in your case)

4
LMc On

Here is an option using fuzzyjoin that requires a bit of string manipulation before matching:

Updated based on comment from @moodymudskipper

library(fuzzyjoin)

rev_str <- \(x) paste(rev(x), collapse = " ")

df1$rev_name = sapply(strsplit(df1$Person_Name, ", "), rev_str)
df2$rev_name = sapply(strsplit(df2$Name, ", "), rev_str)

stringdist_left_join(df1, df2, by = "rev_name",
                     method = "jw",
                     p = 0.15,
                     distance_col = "distance",
                     ignore_case = TRUE,
                     max_dist = 0.15) |> 
  subset(select = -c(rev_name.x, rev_name.y, distance))

This uses the Jaro–Winkler distance where you can add a weight (p) that gives a bonus to strings that start the same way. This is why I reversed the names first since in your reprex they tended to be accurate. I selected a distance cutoff of 0.15 after looking at the values of distance, which I included in the code just so you could see. You might have to tune this for your real data set.

Note: values closer to zero indicate that strings are more similar.

Output

     Person_Name                Name Work_Group
1   PEREZ, MINDY    Perez-Tie, Mindy    Group A
2     PEREZ, ABA  Rulnick-Perez, Aba    Group C
3  CLARKE, LINDA Mcabe-Clarke, Linda    Group A
4 THOMAS, NICOLE      Thomas, Nicole    Group B