I'm exploring regex with pandas in a jupyter notebook. My goal is to extract housenumberadditions from an addressline, using a set of regex patterns.
I'm building upon this post: https://gist.github.com/christiaanwesterbeek/c574beaf73adcfd74997 and I use this for input from a .csv:
Afleveradres
Dorpstraat 2
Dorpstr. 2
Dorpstraat 2
Laan 1933 2
18 Septemberplein 12
Kerkstraat 42-f3
Kerk straat 2b
42nd street, 1337a
1e Constantijn Huigensstraat 9b
Maas-Waalweg 15
De Dompelaar 1 B
Kümmersbrucker Straße 2
Friedrichstädter Straße 42-46
Höhenstraße 5A
Saturnusstraat 60-75
Saturnusstraat 60 - 75
Plein \'40-\'45 10
Plein 1945 1
Steenkade t/o 56
Steenkade a/b Twee Gezusters
1, rue de l\'eglise
Herestraat 49 BOX1043
Maas-Waalweg 15 15
My goal is to extract the streetnames, housenumbers & housenumberadditions.
So far I basically use:
# get data
file_base_name = 'examples'
dfa = pd.read_csv(''+file_base_name+'.csv', sep=';')
#get number
dfa['num'] = dfa['Afleveradres'].str.extract(r"([,\s]+\d+)\s*")
dfa['num'] = dfa['num'].str.strip()
# split leftover values into street & addition
dfa['tmp']=dfa.Afleveradres.str.replace(r"([,\s]+\d+)\s*", ';')
# new data frame with split value columns
new = dfa["tmp"].str.split(";", n = 1, expand = True)
# making separate first name column from new data frame
dfa["str"]= new[0]
# making separate last name column from new data frame
dfa["add"]= new[1]
dfa.drop(['tmp'], axis=1, inplace=True)
which results in: listing streenames, numbers & addition:
;Afleveradres;str;add;num
0;Dorpstraat 2;Dorpstraat;;2
1;Dorpstr. 2;Dorpstr.;;2
2;Dorpstraat 2;Dorpstraat;;2
3;Laan 1933 2;Laan;2;1933
4;18 Septemberplein 12;18 Septemberplein;;12
5;Kerkstraat 42-f3;Kerkstraat;-f3;42
6;Kerk straat 2b;Kerk straat;b;2
7;42nd street, 1337a;42nd street;a;, 1337
8;1e Constantijn Huigensstraat 9b;1e Constantijn Huigensstraat;b;9
9;Maas-Waalweg 15;Maas-Waalweg;;15
10;De Dompelaar 1 B;De Dompelaar;B;1
So far so good, for now. Next, I'd like to correct for housenumber ranges, like '42-46' and '60 - 65'.
A re.findall returns expected values:
import re
def rem(str):
pattern = r'[,@\'?\.$%_]'
if re.match(pattern, str):
tmp = 'Y'
else:
tmp = 'N'
return tmp
def extract_numrange(row):
r = ''+row['Afleveradres']
num_range1 = re.findall(r'([,\s]+\d+\-+\d+)\s*|([,\s]+\d+\s+\-+\s+\d+)\s*',r)
return num_range1
# return rem(num_range1)
dfa['excep'] = dfa.apply(extract_numrange, axis=1)
dfa
15 Friedrichstädter Straße 42-46 Friedrichstädter Straße -46 42 [( 42-46, )]
16 Höhenstraße 5A Höhenstraße A 5 []
17 Saturnusstraat 60-75 Saturnusstraat -75 60 [( 60-75, )]
18 Saturnusstraat 60 - 75 Saturnusstraat -; 60 [(, 60 - 75)]
But how do I clean this output, from [( 42-46, )] and [(, 60 - 75)] into something like 42-46 and 60 - 75 in a new column?
Or are there better approaches for my question?
The problem comes from the fact there are two capturing groups. You need to re-vamp the pattern to use only a single capturing group, or get rid of the group altogether.
Your pattern is of the
(Group1)\s*|(Group2)\s*type. As you see, all you need is to re-group the parts into(Group1|Group2)\s*.So, the quickest fix is
See the regex demo.
However, I think you do not need the whitespaces on both ends. Then, move those patterns you do not want to capture out of the grouping:
See this regex demo.
Probably, you may reduce this even further to
See this regex demo, the
(?:-+|\s+-+\s+)is a non-capturing group that won't result in additional tuple item.