How to standardize city names inserted by user

298 Views Asked by At

I need to write a small ETL pipeline because I need to move some data from a source database to a target database (a datawarehouse) to perform some analysis on data.

Among those data, I need to clean and conform the name of cities. Cities are inserted manually by international users, conseguently for a single city I can have multiple names (for example London or Londra). In my source database I do not have only big cities but I have also small villages.

Well, if I do not standardize city names, our analysis could be nonsensical.

Which is the best practices to standardize cities in my target database? Have any idea or suggestion I can undertake?

Thank you

1

There are 1 best solutions below

2
On

The only reliable way to do this is to use commercial address validation software - preferably in your source system when the data is being created but it could be integrated into your data pipeline processes.

Assuming you can't afford/justify the use of commercial software, the only other solution is to create your own translation table i.e. a table that holds the values that are entered and what value you want them to be translated to.

While you can build this table based on historic data, there will always be new values that are not in the table, so you would need a process to identify these, add the new record to your translation data and then fix the affected records. You would also need to accept that there would be un-cleansed data in your warehouse for a period of time after each data load