I have two lists of names for the same set of students which have been collected separately. There are numerous typographical errors and I have been using fuzzy matching to link the two lists. I am 99+% there with agrep
and similar, but am stuck on the following basic problem: how can I match (for example) the forenames "Adrian Bruce" and "Bruce Adrian"? The Levenshtein edit distance is no good for this particular case as it counts number of substitutions.
This must be a very common problem, but I cannot find any standard R package or routine for addressing it. I presume I am missing something obvious...???
The technique I usually use is pretty robust and relatively insensitive to ordering, punctuation, etc.. It's based on objects called "n-grams". If n=2, "bigrams". For instance:
Each string has 11 bigrams. 9 of them are in common. Thus, the similarity score is very high: 9/11 or 0.818 where 1.000 is a perfect match.
I am not very familiar with R, but if a package does not exist, this technique is very easy to code. You can write a code that loops through the bigrams of string 1 and tallies how many are contained in string 2.