Simpler, safer string manipulation Python

141 Views Asked by At

I do a lot of amateur data cleaning and scrubbing with Python - it's a lot faster than using Excel. But I feel like I must be doing everything the hard way. The biggest pain is that I don't know how to safely get from list indexes or string indexes without getting errors or littering my code with layer after layer of unreadable try/except.

Here's an example of what I just now came up with to clean up Trulia profile URLs for the city/state combo. Sometimes they don't give a state, but the patterns are pretty standardized.

checkstr = 'http://www.trulia.com/profile/agent-name-agent-orlando-fl-24408364/'

state = ''
citystrs = re.findall('-agent-(.*)-\d', checkstr)[0:1]
print citystrs
for citystr in citystrs:
    if '-' in citystr:
        if len(citystr.split('-')[-1]) == 2:
            state = citystr.split('-')[-1].upper().strip()
            city = string.replace(citystr.upper(), state, '')
            city = string.replace(city, '-', ' ').title().strip()
        else:
            city = string.replace(citystr, '-', ' ').title().strip()
    else:
        city = citystr.title().strip()

print city, state

I have no need for multiple "answers," but I use the slice [0:1] and for because I don't want the error to stop my code (doing this ~2 million times) whenever the pattern doesn't fit for findall[0].

Can I get a few pointers for the pythonic (and efficient) way to do this more simply?

EDIT 1: I'm not looking for nonconforming strings. I'm hoping to be safe enough to let it run through everything and "do the best it can" (ie, more conforming > less)

EDIT 2: One very obvious detail left out of the example: Cities of multiple words have interior dashes ('-'). E.G. agent-name-los-angeles-82348233/

3

There are 3 best solutions below

0
On
  • When you are looking for only the first match it would be clearer to use re.search instead of findall.
  • If multiple matches are possible (as suggested by your use of [0:1]), note that .* is greedy. For example from the string -agent-orlando-fl-24408364-agent-orlando-fl-24408364 your regex captures orlando-fl-24408364-agent-orlando-fl. Use .*? instead.
  • The rpartition string method splits at the last occurrence of the separator and always returns three strings, which makes it easier to deal with corner cases.

Proposed code:

m = re.search('-agent-(.*?)-\d', checkstr)
if m:
    citystr = m.group(1)
    city, _, state = citystr.rpartition('-')
    if len(state) <> 2:
        city = citystr
        state = ''
    city = city.replace('-', ' ').title()
    state = state.upper()
0
On

Why not use slices all the way?

if '-' in citystr:
    sep_index = citystr.find('-')
    city = citystr[0:sep_index].title()
    state = citystr[sep_index+1:].upper()
else:
    city = citystr.title()

Using timeit(number=10000):

yours : 3.56353430347
mine :  1.04823075931
1
On

Here is the way I will do it:

import re

reg = re.compile(r'-agent-(?P<city>[^-]*)(?:-(?P<state>[^-]*))?-\d')    

checkstr = 'http://www.trulia.com/profile/agent-name-agent-orlando-fl-24408364/'

m = reg.search(checkstr)

city = m.group('city').title()
state = m.group('state').upper() if (m.group('state')) else ''

print city, state

If you need to use the pattern several times, you can compile it once and for all with re.compile

Instead of using .* that is very permissive and generates backtracking, I use [^-]* (all that is not a dash zero or more times) that stops before the first dash.

The state and the previous dash are in an optional group: (?:-(?P<state>[^-]*))?. So, even if the string doesn't have the state part, the pattern succeeds.

With this change re.findall is no more needed, you can use re.search that returns a single result. Note that if you are unsure of the string format, you can always add a test to check that there is a match.

To make the code more readable, I use named captures (?P<name>...). So in this way you can easily retrieve the content of a group: m.group('name'). However, if you want a little gain of speed, you can use numbered groups (but it isn't very significant).