I read large lists of county names that people wrote down manually and now need to be matched to a predefined list of counties.
I understand that those matches cannot be perfect, but the operator should be shown a list of "wrong" county names together with their best matches, and ideally needs just to click "ok" to proceed and use them.
Until now, I just use levenshtein distance, which is great for catching typos, but not great for abbreviations. Real world examples that would be obvious to a human, but which levenshtein does not match correctly:
- Input: Siegen
- Should be: Siegen-Wittgenstein
- Levenshtein: Hagen
or
- Input: Rhein.-Berg. Kreis
- Should be: Rheinisch-Bergischer Kreis
- Levenshtein: Rhein-Sieg-Kreis
How can I catch those abbreviations as well?
I use PHP, but this more a question about the right algorithm than about PHP.
As you're dealing with geographic data, perhaps a geocoding approach could work? An example API would be Nominatim, but there are many other (paid) options.
If I use your two examples in the debugging interface of Nominatim I get the following:
Example 1:
Example 2:
The quality is mixed, it definitely requires additional filtering. Without any additional filtering you get various resulttypes (administrative regions, a mountain peak, a charity) and also various countries (Germany and France).
However, you can apply filters on the detailed output to return only county names in a specific country to get the output you want.