In R, I have two dataframes, one with full names and one with abbreviated names, I want to dplyr join them to see which one has a flag.
However, it is very hard to get matched names, even when I match last names, there are same last names. I'm giving part of the data as example below. Is there anyway I can efficiently and correctly join these two dataframe without any mismatches?
data_1 like:
Player | Flag |
---|---|
W. Fofana | 1 |
N'Golo Kante | 1 |
Frank Anguissa | 1 |
data_2 like:
Player |
---|
David Datro Fofana |
Wesley Fofana |
N'Golo Kante |
Andre-Frank Zambo Anguissa |
I tried Last name matching:
data_1 <- data_1 %>%
mutate(player = sub('.*\\s', '', player))
data_2 <- data_2 %>%
mutate(player = sub('.*\\s', '', player))
This can match "Andre-Frank Zambo Anguissa" to "Frank Anguissa", but will mismatch W. Fofana and David Datro Fofana.
But if I consider first name, I don't know how to keep cases like "Andre-Frank Zambo Anguissa" and "Frank Anguissa" matched. As that is a quite unique last name, by matching last name I actually can get correct match for it, if involve first name in, I don't think it will work. Any suggestions? Thanks a lot
I tried fuzzyjoin
too, but it's hard to choose a max_dist
that allows all correct match/join.