I´m matching a list of company names against itself with R and agrep() because the data was stored wrong in a legacy system - No 4th normal form, companys were recorded on the same level as customers, which means a new company entry for every new customer, which leads to a lot of differenty company names for one company - which works fine in a lot of cases.
Sometimes, especially for short strings, I get - at least for me - weird matches, for example (ABC is the first company name):
ABC ABAXIS Europe GmbH
ABC ABB Europe
ABC ABB Group
ABC ABB Stotz Kontakt GmbH
ABC ABM Financial News
ABC ABN AMRO Bank NV
ABC AC Klöser GmbH
ABC ACCBank
ABC ACEA S.p.A.
I´m using agrep()
with the following parameters:
agrep(vector1, vector2, value = TRUE, ignore.case = FALSE, max.distance = 0.01)
Is there any other way than the max distance to tweak agrep()
or a better way to do this?
Thanks in advance
For a similar problem, I used the second method described in this article: http://bigdata-doctor.com/fuzzy-string-matching-survival-skill-tackle-unstructured-information-r/#comment-942
It matches each register with the most similar one, which of course is not optimal if having some false positives is a problem for you.
Additionally, you may find useful this function to remove white spaces before and after the names:
I also used the removewords() function from the "tm" package. In your case, removing ABC " may be useful.