I'm looking for a way to merge two data files based on partial matching of participants' full names that are sometimes entered in different formats and sometimes misspelled. I know there are some different function options for partial matches (eg agrep and pmatch) and for merging data files but I need help with a) combining the two; b) doing partial matching that can ignore middle names; c) in the merged data file store both original name formats and d) retain unique values even if they don't have a match.
For example, I have the following two data files:
File name: Employee Data
Full Name Date Started Orders ANGELA MUIR 6/15/14 25 EILEEN COWIE 6/15/14 44 LAURA CUMMING 10/6/14 43 ELENA POPA 1/21/15 37 KAREN MACEWAN 3/15/99 39
File name: Assessment data
Candidate Leading Factor SI-D SI-I Angie muir I -3 12 Caroline Burn S -5 -3 Eileen Mary Cowie S -5 5 Elena Pope C -4 7 Henry LeFeuvre C -5 -1 Jennifer Ford S -3 -2 Karen McEwan I -4 10 Laura Cumming S 0 6 Mandip Johal C -2 2 Mubarak Hussain D 6 -1
I want to merge them based on names (Full Name in df1 and Candidate in df2) ignoring middle name (eg Eilen Cowie = Eileen Mary Cowie), extra spaces (Laura Cumming = Laura Cumming); misspells (e.g. Elena Popa = Elena Pope) etc.
The ideal output would look like this:
Name Full Name Candidate Date Started Orders Leading Factor SI-D SI-I ANGELA MUIR ANGELA MUIR Angie muir 6/15/14 25 I -3 12 Caroline Burn N/A Caroline Burn N/A N/A S -5 -3 EILEEN COWIE EILEEN COWIE Eileen Mary Cowie 6/15/14 44 S -5 5 ELENA POPA ELENA POPA Elena Pope 1/21/15 37 C -4 7 Henry LeFeuvre N/A Henry LeFeuvre N/A N/A C -5 -1 Jennifer Ford N/A Jennifer Ford N/A N/A S -3 -2 KAREN MACEWAN KAREN MACEWAN Karen McEwan 3/15/99 39 I -4 10 LAURA CUMMING LAURA CUMMING Laura Cumming 10/6/14 43 S 0 6 Mandip Johal N/A Mandip Johal N/A N/A C -2 2 Mubarak Hussain N/A Mubarak Hussain N/A N/A D 6 -1
Any suggestions would be greatly appreciated!
Here's a process that may help. You will have to inspect the results and make adjustments as needed.
I had to adjust the argument
max.distance
in theindex
function. It may not work for your data, but adjust and test if it works. If this doesn't help, there is a package calledstringdist
that may have a more robust matching function inamatch
.Data