Split address text into components using Machine Learning

118 Views Asked by At

I have a CSV file with each row representing the different components of an address, such as City, Street, House No, etc., and then a column with a combined address in one line, with a predefined format, e.g. Street House No, Zip Code, City.

What I want is to judge the different components from user entered Address Text, e.g. I would like to know if the user has entered all the components, or just the Street Name and the City, etc, and thne what the values for these components are.

Can I achieve this through a Machine Learning technique, so that I teach the model using my CSV file that this is how a Address Text is splitted into different components, and then expect it to provide me the different components based on that training?

Adding more information, as I'm implementing this in .NET, so a solution based on ML.NET or something that can be easily integrated with .NET, would be preferable.

Also, we can look at this problem regardless of the Address parsing context. Shouldn't we be able to teach a model that this is how a text sentence is comprised of different parts in any given context. And then expect the model to suggest the parts from a given new text sentence?

1

There are 1 best solutions below

2
Daniel Perez Efremova On

Before developing a custom model by yourself, I suggest you to give a try to the libpostal project.

(I am going to assume that you are developing in Python)

It has several interesting features already built, such as:

  • international address parsing
  • normalization
  • address detection

There is the example from the doc of pylibpostal

from pylibpostal.expand import expand_address
expand_address('Quatre vingt douze Ave des Champs-Élysées')
['92 avenue des champs-elysees',
 '92 avenue des champs elysees',
 '92 avenue des champselysees']
from pylibpostal.parser import parse_address
parse_address('The Book Club 100-106 Leonard St, Shoreditch, London,EC2A 4RH, UK')
[('the book club', 'house'),
 ('100-106', 'house_number'),
 ('leonard st', 'road'),
 ('shoreditch', 'suburb'),
 ('london', 'city'),
 ('ec2a 4rh', 'postcode'),
 ('uk', 'country')]

But libpostal is not easy to install or to integrate with popular languages such as Python as it is developed purely in C, so you need to install additional dependencies.

In case you limit your scope to USA, CAN or GBR there are other more simple alternatives such as pyap (Python Addresses Parser). But, it is not as general and powerful as libpostal. pyap is based on regex, it is faster and easier to install/maintain.

import pyap
test_address = """
    Lorem ipsum
    225 E. John Carpenter Freeway,
    Suite 1500 Irving, Texas 75062
    Dorem sit amet
    """
addresses = pyap.parse(test_address, country='US')
for address in addresses:
    print(address)
    print(address.as_dict())

>> 225 E. John Carpenter Freeway, Suite 1500 Irving, Texas 75062
>> {
  "full_address": "225 E. John Carpenter Freeway, Suite 1500 Irving, Texas 75062",
  "full_street": "225 E. John Carpenter Freeway, Suite 1500",
  "street_number": "225",
  "street_name": "E. John Carpenter",
  "street_type": "Freeway",
  "route_id": null,
  "post_direction": null,
  "floor": null,
  "building_id": null,
  "occupancy": "Suite 1500",
  "city": "Irving",
  "region1": "Texas",
  "postal_code": "75062",
  "country_id": "US",
  "match_start": 15,
  "match_end": 76
}