Match county names to predefined list

Question

Match county names to predefined list

39 Views Asked by J Fabian Meier At 03 December 2023 at 14:32

I read large lists of county names that people wrote down manually and now need to be matched to a predefined list of counties.

I understand that those matches cannot be perfect, but the operator should be shown a list of "wrong" county names together with their best matches, and ideally needs just to click "ok" to proceed and use them.

Until now, I just use levenshtein distance, which is great for catching typos, but not great for abbreviations. Real world examples that would be obvious to a human, but which levenshtein does not match correctly:

Input: Siegen
Should be: Siegen-Wittgenstein
Levenshtein: Hagen

or

Input: Rhein.-Berg. Kreis
Should be: Rheinisch-Bergischer Kreis
Levenshtein: Rhein-Sieg-Kreis

How can I catch those abbreviations as well?

I use PHP, but this more a question about the right algorithm than about PHP.

Original Q&A

There are 1 best solutions below

**Saaru Lindestøkke** · Answer 1 · 2023-12-03T14:46:02.057000

As you're dealing with geographic data, perhaps a geocoding approach could work? An example API would be Nominatim, but there are many other (paid) options.

If I use your two examples in the debugging interface of Nominatim I get the following:

Example 1:

Input: Siegen
Expected output: Siegen-Wittgenstein
Nominatim outputs (link):
- Siegen, Kreis Siegen-Wittgenstein, North Rhine-Westphalia, Germany
- Siegen, Haguenau-Wissembourg, Bas-Rhin, Grand Est, Metropolitan France, 67160, France
- Siegen, Kreis Siegen-Wittgenstein, North Rhine-Westphalia, 57072, Germany

Example 2:

Input: Rhein.-Berg. Kreis
Expected output: Rheinisch-Bergischer Kreis
Nominatim outputs (link):
- Geschäftsstelle DRK Kreisverband Rhein.-Berg. Kreis, 261, Hauptstraße, Heidkamp, Bergisch Gladbach, Rheinisch-Bergischer Kreis, North Rhine-Westphalia, 51465, Germany
- Auf dem Kreis, Bad Berleburg, Kreis Siegen-Wittgenstein, North Rhine-Westphalia, Germany
- Region Rhein-Neckar (HE), Kreis Bergstraße, Hesse, Germany

The quality is mixed, it definitely requires additional filtering. Without any additional filtering you get various resulttypes (administrative regions, a mountain peak, a charity) and also various countries (Germany and France).

However, you can apply filters on the detailed output to return only county names in a specific country to get the output you want.

Match county names to predefined list

There are 1 best solutions below

Related Questions in LEVENSHTEIN-DISTANCE

Trending Questions

Popular # Hahtags

Popular Questions