Fuzzy Matching player names in R

144 Views Asked by At

In R, I have two dataframes, one with full names and one with abbreviated names, I want to dplyr join them to see which one has a flag.

However, it is very hard to get matched names, even when I match last names, there are same last names. I'm giving part of the data as example below. Is there anyway I can efficiently and correctly join these two dataframe without any mismatches?

data_1 like:

Player Flag
W. Fofana 1
N'Golo Kante 1
Frank Anguissa 1

data_2 like:

Player
David Datro Fofana
Wesley Fofana
N'Golo Kante
Andre-Frank Zambo Anguissa

I tried Last name matching:

data_1 <- data_1 %>%
  mutate(player = sub('.*\\s', '', player)) 

data_2 <- data_2 %>%
  mutate(player = sub('.*\\s', '', player)) 

This can match "Andre-Frank Zambo Anguissa" to "Frank Anguissa", but will mismatch W. Fofana and David Datro Fofana.

But if I consider first name, I don't know how to keep cases like "Andre-Frank Zambo Anguissa" and "Frank Anguissa" matched. As that is a quite unique last name, by matching last name I actually can get correct match for it, if involve first name in, I don't think it will work. Any suggestions? Thanks a lot

I tried fuzzyjoin too, but it's hard to choose a max_dist that allows all correct match/join.

0

There are 0 best solutions below