Python: How to extract addresses from a sentence/paragraph (non-Regex approach)?

1k Views Asked by At

I was working on a project which needed me to extract addresses from a sentence.

For e.g. Input sentence: Hi, Mr. Sam D. Richards lives here Shop No / 123, 3rd Floor, ABC Building, Behind CDE Mart, Aloha Road, 12345. If you need any help, call me on 12345678

I am trying to extract just the address i.e. Shop No / 123, 3rd Floor, ABC Building, Behind CDE Mart, Aloha Road, 12345

What I have tried so far:

I tried Pyap which also works on Regex so it is not able to generalize it better for addresses of countries other than US/Canada/UK. I realized that we cannot use Regex as there is no pattern to the address or the sentences whatsoever. Also tried locationtagger which only manages to return the country or the city.

Is there any better way of doing it?

2

There are 2 best solutions below

0
David Dale On BEST ANSWER

If there is no obvious pattern for regex, you can try an ML-based approach. There is a well known problem named entity recognition (NER), and it is typically solved as a sequence tagging problem: a model is trained to predict for each token (e.g. a word or a subword) whether it is a part of address or not.

You can look for a model that is already trained to extract addresses (e.g. here https://huggingface.co/models?search=address), or fine-tune a BERT-based model on your own dataset (here is a recipe).

0
habrewning On

Addresses have a well known structure. With a grammar parser it should be possible to parse them. PyParsing has a feature of scanning that searches for pattern without parsing all the rest of the file. You can try this feature. I have an example for you, that detects three addresses in the example string.

#!/bin/python3

from pyparsing import *

GermanWord = Word("ABCDEFGHIJKLMNOPQRSTUVWXYZÄÖÜ", alphas + "ß")
GermanWordComposition = GermanWord + (ZeroOrMore(Optional(Literal("-")) + GermanWord))
GermanName = GermanWordComposition
GermanStreet = GermanWordComposition
GermanHouseNumber = Word(nums) + Optional(Word(alphas, exact=1) + FollowedBy(White()))
GermanAddressSeparator = Literal(",") | Literal("in") 
GermanPostCode = Word(nums, exact=5)
GermanTown = GermanWordComposition

German_Address = GermanName + GermanAddressSeparator + GermanStreet + GermanHouseNumber \
    + GermanAddressSeparator + GermanPostCode + GermanTown


EnglishWord = Word("ABCDEFGHIJKLMNOPQRSTUVWXYZ", alphanums)
EnglishNumber = Word(nums)
EnglishComposition = OneOrMore(EnglishWord)
EnglishExtension = Word("-/", exact=1) + (EnglishComposition | EnglishNumber)
EnglishAddressSeparator = Literal(",")
EnglishFloor = (Literal("1st") | Literal("2nd") | Literal("3rd") | (Combine(EnglishNumber + Literal("th")))) + Literal("Floor")
EnglishWhere = EnglishComposition
EnglishStreet = EnglishComposition


EnglishAddress = EnglishComposition + Optional(EnglishExtension) \
    + EnglishAddressSeparator + Optional(EnglishFloor)           \
    + Optional(EnglishAddressSeparator + EnglishWhere)           \
    + Optional(EnglishAddressSeparator + EnglishWhere)           \
    + EnglishAddressSeparator + EnglishStreet + EnglishAddressSeparator + EnglishNumber

Address = EnglishAddress | German_Address


test_1 = "I am writing to Peter Meyer, Moritzstraße 22, 54543 Musterdorf a letter. But the letter arrived at \
Hubert Figge, Große Straße 14 in 45434 Berlin. In the letter was written: Hi, Mr. Sam D. Richards lives here \
Shop No / 123, 3rd Floor, ABC Building, Behind CDE Mart, Aloha Road, 12345. If you need any help, call       \
me on 12345678."

for i in Address.scanString(test_1):
  print(i)