I'm running into an issue with encoding and partial matching.
I have two data frames, A and B. A called in via UTF-8 encoding and B on Latin1. This could already be part of the issue although I'm not sure. This was the only way I knew how to import it properly.
edit: I should clarify. This is just sample data. Both dataframes contain a large number of rows and other columns as well.
A B
ID Name Expense Employee Category
1 Mike Adall 3 Lothar Fiend B2
2 Brian Adams 4 Rohan Sudarsh A2
3 Adrián 1 Adrián Silva A1
4 Floyd Oid 1 Semi Ajayi A1
5 Semi Ajayi 4 Micheal Adall A1
6 Jomu Aké 3 Jomü Ria Aké B1
Brian Adams B2
Floyd Öid Matheus B1
I've been trying to extract the B$Employee$ and partially match them with A$Name to create a new df C that would include B$Category. This is the output that I would like.
edit: With Category, I would also want to include all the other columns of both A & B excluding Employee.
C
ID Name Expense Category
1 Mike Adall 3 A1
2 Brian Adams 4 B2
3 Adrián 1 A1
4 Floyd Oid 1 B1
5 Semi Ajayi 4 A1
6 Jomu Aké 3 B1
So far I've got it to match 80% of the characters using the fuzzyjoin package.
C <- A %>% fuzzy_inner_join(B, by = c(Name = "Employee"))
The main issue seems to be these odd latin characters such as Ö,ß, etc. or sometimes when it occurs at the end of a name like 'Aké'. The results seem to vary from name to name.
How could I get it to partially match all the names?
This method will only result in one match (column
match), becausewhich.minandmax.colare length one even when there are distance ties.It is important to check manually ties. Ties can be checked in data.frame
res, columnminMatchSeveral, or in the second script below.From the
?stringdist-metricsIn addition you can take a look at
stringi::stri_trans_generalEDIT: another way to visualize ties
data