Order-independent fuzzy matching of "Firstname Lastname"/"Lastname Firstname" in R?

1.2k Views Asked by At

I have two lists of names for the same set of students which have been collected separately. There are numerous typographical errors and I have been using fuzzy matching to link the two lists. I am 99+% there with agrep and similar, but am stuck on the following basic problem: how can I match (for example) the forenames "Adrian Bruce" and "Bruce Adrian"? The Levenshtein edit distance is no good for this particular case as it counts number of substitutions.

This must be a very common problem, but I cannot find any standard R package or routine for addressing it. I presume I am missing something obvious...???

2

There are 2 best solutions below

0
On

The technique I usually use is pretty robust and relatively insensitive to ordering, punctuation, etc.. It's based on objects called "n-grams". If n=2, "bigrams". For instance:

"Adrian Bruce" --> ("Ad","dr","ri","ia","an","n "," B","Br","ru","uc","ce")
"Bruce Adrian" --> ("Br","ru","uc","ce","e "," A","Ad","dr","ri","ia","an")

Each string has 11 bigrams. 9 of them are in common. Thus, the similarity score is very high: 9/11 or 0.818 where 1.000 is a perfect match.

I am not very familiar with R, but if a package does not exist, this technique is very easy to code. You can write a code that loops through the bigrams of string 1 and tallies how many are contained in string 2.

3
On

Well, one fairly easy way is to swap the words and match again...

y=c("Bruce Almighty", "Lee, Bruce", "Leroy Brown")
y2 <- sub("(.*) (.*)", "\\2 \\1", y)

agrep("Bruce Lee", y)  # No match
agrep("Bruce Lee", y2) # Match!