R: How to left join two tables based on fuzzy matching strings that are not exactly the same

Question

R: How to left join two tables based on fuzzy matching strings that are not exactly the same

73 Views Asked by Pxanalyst At 20 July 2023 at 14:54

I am trying to left join table 1 'Person Name' to table 2 'Name' and get the values from the Work Group column in Table 2

df1 <- read.table(text="
Person_Name
PEREZ, MINDY
PEREZ, ABA
CLARKE, LINDA
THOMAS, NICOLE", header=T, sep="|") 

df2 <- read.table(text=
'Name Work_Group 
"Perez-Tie, Mindy" "Group A"
"Rulnick-Perez, Aba" "Group C"
"Mcabe-Clarke, Linda" "Group A"
"Thomas, Nicole" "Group B"', header=T)

This is my current code however it does not show any matches. I am not sure what i might be doing wrong but any help would be great!!

ci_str_detect <- function(x, y){str_detect(x, regex(y, ignore_case = TRUE))}

result <- fuzzy_left_join(Table1,Table2, by = c("Person_Name" = "Name"), 
                          match_fun = ci_str_detect   )  %>%
  select(PersoN_Name,Name,Work_Group)
print(result)

Original Q&A

There are 2 best solutions below

**Mark** · Answer 1 · 2023-07-20T15:38:43.633000

Here's working code:

library(tidyverse)
# make sure the two dataframes have the same format and the same columns
df1 <- df1 %>% 
mutate(Person_Name = str_to_title(Person_Name)) %>%
separate_wider_delim(Person_Name, delim=", ", names=c("Last_Name", "First_Name"))

df2 <- df2 %>%
mutate(Name = str_to_title(Name)) %>%
separate_wider_delim(Name, delim=", ", names=c("Last_Name", "First_Name"))

# Maybe the simplest way of doing things (but maybe not what you had in mind ;) )
left_join(df1, df2, by="First_Name")

# But this is more robust
cross_join(df1, df2) %>%
filter(str_detect(Last_Name.y, Last_Name.x), str_detect(First_Name.y, First_Name.x)) %>%
select(Last_Name = Last_Name.y, First_Name = First_Name.y, Work_Group)

The problem with the code you have is that in short, the strings don't match. Take for example the first row: "perez, mindy" will not match "perez-tie, mindy", nor the reverse (both have been shown here lowercase to make it more obvious). You want to match the first names and last names separately.

Another thing you can look at is the stringdist_join function from the fuzzyjoin package (I wasn't able to get it working correctly for all the rows, but it might be useful in your case)

**LMc** · Answer 2 · 2023-07-20T15:53:59.543000

Here is an option using fuzzyjoin that requires a bit of string manipulation before matching:

Updated based on comment from @moodymudskipper

library(fuzzyjoin)

rev_str <- \(x) paste(rev(x), collapse = " ")

df1$rev_name = sapply(strsplit(df1$Person_Name, ", "), rev_str)
df2$rev_name = sapply(strsplit(df2$Name, ", "), rev_str)

stringdist_left_join(df1, df2, by = "rev_name",
                     method = "jw",
                     p = 0.15,
                     distance_col = "distance",
                     ignore_case = TRUE,
                     max_dist = 0.15) |> 
  subset(select = -c(rev_name.x, rev_name.y, distance))

This uses the Jaro–Winkler distance where you can add a weight (p) that gives a bonus to strings that start the same way. This is why I reversed the names first since in your reprex they tended to be accurate. I selected a distance cutoff of 0.15 after looking at the values of distance, which I included in the code just so you could see. You might have to tune this for your real data set.

Note: values closer to zero indicate that strings are more similar.

Output

     Person_Name                Name Work_Group
1   PEREZ, MINDY    Perez-Tie, Mindy    Group A
2     PEREZ, ABA  Rulnick-Perez, Aba    Group C
3  CLARKE, LINDA Mcabe-Clarke, Linda    Group A
4 THOMAS, NICOLE      Thomas, Nicole    Group B

R: How to left join two tables based on fuzzy matching strings that are not exactly the same

There are 2 best solutions below

Related Questions in R

Related Questions in DATAFRAME

Related Questions in FUZZYJOIN

Trending Questions

Popular # Hahtags

Popular Questions