fuzzy string matching with agrep()

798 Views Asked by At

I´m matching a list of company names against itself with R and agrep() because the data was stored wrong in a legacy system - No 4th normal form, companys were recorded on the same level as customers, which means a new company entry for every new customer, which leads to a lot of differenty company names for one company - which works fine in a lot of cases.

Sometimes, especially for short strings, I get - at least for me - weird matches, for example (ABC is the first company name):

ABC ABAXIS Europe GmbH

ABC ABB Europe

ABC ABB Group

ABC ABB Stotz Kontakt GmbH

ABC ABM Financial News

ABC ABN AMRO Bank NV

ABC AC Klöser GmbH

ABC ACCBank

ABC ACEA S.p.A.

I´m using agrep() with the following parameters:

agrep(vector1, vector2, value = TRUE, ignore.case = FALSE, max.distance = 0.01)

Is there any other way than the max distance to tweak agrep() or a better way to do this?

Thanks in advance

1

There are 1 best solutions below

1
On

For a similar problem, I used the second method described in this article: http://bigdata-doctor.com/fuzzy-string-matching-survival-skill-tackle-unstructured-information-r/#comment-942

It matches each register with the most similar one, which of course is not optimal if having some false positives is a problem for you.

Additionally, you may find useful this function to remove white spaces before and after the names:

  trim <- function (x) gsub("^\\s+|\\s+$", "", x) #Defining function that returns string w/o leading or trailing whitespace

I also used the removewords() function from the "tm" package. In your case, removing ABC " may be useful.