R - Merging two data files based on partial matching of inconsistent full name formats

175 Views Asked by At

I'm looking for a way to merge two data files based on partial matching of participants' full names that are sometimes entered in different formats and sometimes misspelled. I know there are some different function options for partial matches (eg agrep and pmatch) and for merging data files but I need help with a) combining the two; b) doing partial matching that can ignore middle names; c) in the merged data file store both original name formats and d) retain unique values even if they don't have a match.

For example, I have the following two data files:

File name: Employee Data

Full Name Date Started Orders ANGELA MUIR 6/15/14 25 EILEEN COWIE 6/15/14 44 LAURA CUMMING 10/6/14 43 ELENA POPA 1/21/15 37 KAREN MACEWAN 3/15/99 39

File name: Assessment data

Candidate Leading Factor SI-D SI-I Angie muir I -3 12 Caroline Burn S -5 -3 Eileen Mary Cowie S -5 5 Elena Pope C -4 7 Henry LeFeuvre C -5 -1 Jennifer Ford S -3 -2 Karen McEwan I -4 10 Laura Cumming S 0 6 Mandip Johal C -2 2 Mubarak Hussain D 6 -1

I want to merge them based on names (Full Name in df1 and Candidate in df2) ignoring middle name (eg Eilen Cowie = Eileen Mary Cowie), extra spaces (Laura Cumming = Laura Cumming); misspells (e.g. Elena Popa = Elena Pope) etc.

The ideal output would look like this:

Name Full Name Candidate Date Started Orders Leading Factor SI-D SI-I ANGELA MUIR ANGELA MUIR Angie muir 6/15/14 25 I -3 12 Caroline Burn N/A Caroline Burn N/A N/A S -5 -3 EILEEN COWIE EILEEN COWIE Eileen Mary Cowie 6/15/14 44 S -5 5 ELENA POPA ELENA POPA Elena Pope 1/21/15 37 C -4 7 Henry LeFeuvre N/A Henry LeFeuvre N/A N/A C -5 -1 Jennifer Ford N/A Jennifer Ford N/A N/A S -3 -2 KAREN MACEWAN KAREN MACEWAN Karen McEwan 3/15/99 39 I -4 10 LAURA CUMMING LAURA CUMMING Laura Cumming 10/6/14 43 S 0 6 Mandip Johal N/A Mandip Johal N/A N/A C -2 2 Mubarak Hussain N/A Mubarak Hussain N/A N/A D 6 -1

Any suggestions would be greatly appreciated!

1

There are 1 best solutions below

0
On

Here's a process that may help. You will have to inspect the results and make adjustments as needed.

df1

#            v1      v2
#1  ANGELA MUIR 6/15/14
#2 EILEEN COWIE 6/15/14
#3 AnGela Smith  5/3/14

df2

#                 u1   u2
#1 Eileen Mary Cowie  I-3
#2        Angie muir -5 5

index <- sapply(df1$v1, function(x) {
  agrep(x, df2$u1, ignore.case=TRUE, max.distance = .5)
}
)
index <- unlist(index)
df2$u1[index] <- names(index)
merge(df1, df2, by.x='v1', by.y='u1')

#            v1      v2   u2
#1  ANGELA MUIR 6/15/14 -5 5
#2 EILEEN COWIE 6/15/14  I-3

I had to adjust the argument max.distance in the index function. It may not work for your data, but adjust and test if it works. If this doesn't help, there is a package called stringdist that may have a more robust matching function in amatch.

Data

v1 <- c('ANGELA MUIR', 'EILEEN COWIE', 'AnGela Smith')
v2 <- c('6/15/14', '6/15/14', '5/3/14')
u1 <- c('Eileen Mary Cowie', 'Angie muir')
u2 <- c('I-3', '-5 5')
df1 <- data.frame(v1, v2, stringsAsFactors=F)
df2 <- data.frame(u1, u2, stringsAsFactors = F)