How to combine the Output of Regex Findall in Pandas

Question

How to combine the Output of Regex Findall in Pandas

321 Views Asked by pljvp At 19 February 2020 at 19:31

I'm exploring regex with pandas in a jupyter notebook. My goal is to extract housenumberadditions from an addressline, using a set of regex patterns.

I'm building upon this post: https://gist.github.com/christiaanwesterbeek/c574beaf73adcfd74997 and I use this for input from a .csv:

Afleveradres
Dorpstraat 2
Dorpstr. 2
Dorpstraat 2
Laan 1933 2
18 Septemberplein 12
Kerkstraat 42-f3
Kerk straat 2b
42nd street, 1337a
1e Constantijn Huigensstraat 9b
Maas-Waalweg 15
De Dompelaar 1 B
Kümmersbrucker Straße 2
Friedrichstädter Straße 42-46
Höhenstraße 5A  
Saturnusstraat 60-75
Saturnusstraat 60 - 75
Plein \'40-\'45 10
Plein 1945 1
Steenkade t/o 56
Steenkade a/b Twee Gezusters
1, rue de l\'eglise
Herestraat 49 BOX1043
Maas-Waalweg 15 15

My goal is to extract the streetnames, housenumbers & housenumberadditions.

So far I basically use:

# get data
file_base_name = 'examples'
dfa = pd.read_csv(''+file_base_name+'.csv', sep=';')

#get number
dfa['num'] = dfa['Afleveradres'].str.extract(r"([,\s]+\d+)\s*")
dfa['num'] = dfa['num'].str.strip()

# split leftover values into street & addition
dfa['tmp']=dfa.Afleveradres.str.replace(r"([,\s]+\d+)\s*", ';')

# new data frame with split value columns 
new = dfa["tmp"].str.split(";", n = 1, expand = True) 
# making separate first name column from new data frame 
dfa["str"]= new[0] 

# making separate last name column from new data frame 
dfa["add"]= new[1] 
dfa.drop(['tmp'], axis=1, inplace=True)

which results in: listing streenames, numbers & addition:

;Afleveradres;str;add;num
0;Dorpstraat 2;Dorpstraat;;2
1;Dorpstr. 2;Dorpstr.;;2
2;Dorpstraat 2;Dorpstraat;;2
3;Laan 1933 2;Laan;2;1933
4;18 Septemberplein 12;18 Septemberplein;;12
5;Kerkstraat 42-f3;Kerkstraat;-f3;42
6;Kerk straat 2b;Kerk straat;b;2
7;42nd street, 1337a;42nd street;a;, 1337
8;1e Constantijn Huigensstraat 9b;1e Constantijn Huigensstraat;b;9
9;Maas-Waalweg 15;Maas-Waalweg;;15
10;De Dompelaar 1 B;De Dompelaar;B;1

So far so good, for now. Next, I'd like to correct for housenumber ranges, like '42-46' and '60 - 65'.

A re.findall returns expected values:

import re

def rem(str):
    pattern = r'[,@\'?\.$%_]'
    if re.match(pattern, str):
        tmp = 'Y'
    else:
        tmp = 'N'
    return tmp

def extract_numrange(row):
    r = ''+row['Afleveradres']
    num_range1 = re.findall(r'([,\s]+\d+\-+\d+)\s*|([,\s]+\d+\s+\-+\s+\d+)\s*',r)

    return num_range1
    # return rem(num_range1)

dfa['excep'] = dfa.apply(extract_numrange, axis=1)
dfa

output re.findall

15  Friedrichstädter Straße 42-46   Friedrichstädter Straße -46 42  [( 42-46, )]
16  Höhenstraße 5A  Höhenstraße A   5   []
17  Saturnusstraat 60-75    Saturnusstraat  -75 60  [( 60-75, )]
18  Saturnusstraat 60 - 75  Saturnusstraat  -;  60  [(, 60 - 75)]

But how do I clean this output, from [( 42-46, )] and [(, 60 - 75)] into something like 42-46 and 60 - 75 in a new column?

Or are there better approaches for my question?

Original Q&A

There are 1 best solutions below

**Wiktor Stribiżew** · Accepted Answer · 2020-02-19T20:44:49.130000

The problem comes from the fact there are two capturing groups. You need to re-vamp the pattern to use only a single capturing group, or get rid of the group altogether.

Your pattern is of the (Group1)\s*|(Group2)\s* type. As you see, all you need is to re-group the parts into (Group1|Group2)\s*.

So, the quickest fix is

([,\s]+\d+\-+\d+|[,\s]+\d+\s+\-+\s+\d+)\s*

See the regex demo.

However, I think you do not need the whitespaces on both ends. Then, move those patterns you do not want to capture out of the grouping:

[,\s]+(\d+\-+\d+|\d+\s+\-+\s+\d+)\s*
^^^^^^

See this regex demo.

Probably, you may reduce this even further to

[,\s](\d+(?:-+|\s+-+\s+)\d+)

See this regex demo, the (?:-+|\s+-+\s+) is a non-capturing group that won't result in additional tuple item.

How to combine the Output of Regex Findall in Pandas

There are 1 best solutions below

Related Questions in REGEX

Related Questions in JUPYTER

Related Questions in FINDALL

Trending Questions

Popular # Hahtags

Popular Questions