I do a lot of amateur data cleaning and scrubbing with Python - it's a lot faster than using Excel. But I feel like I must be doing everything the hard way. The biggest pain is that I don't know how to safely get from list indexes or string indexes without getting errors or littering my code with layer after layer of unreadable try/except.
Here's an example of what I just now came up with to clean up Trulia profile URLs for the city/state combo. Sometimes they don't give a state, but the patterns are pretty standardized.
checkstr = 'http://www.trulia.com/profile/agent-name-agent-orlando-fl-24408364/'
state = ''
citystrs = re.findall('-agent-(.*)-\d', checkstr)[0:1]
print citystrs
for citystr in citystrs:
if '-' in citystr:
if len(citystr.split('-')[-1]) == 2:
state = citystr.split('-')[-1].upper().strip()
city = string.replace(citystr.upper(), state, '')
city = string.replace(city, '-', ' ').title().strip()
else:
city = string.replace(citystr, '-', ' ').title().strip()
else:
city = citystr.title().strip()
print city, state
I have no need for multiple "answers," but I use the slice [0:1] and for
because I don't want the error to stop my code (doing this ~2 million times) whenever the pattern doesn't fit for findall[0].
Can I get a few pointers for the pythonic (and efficient) way to do this more simply?
EDIT 1: I'm not looking for nonconforming strings. I'm hoping to be safe enough to let it run through everything and "do the best it can" (ie, more conforming > less)
EDIT 2: One very obvious detail left out of the example: Cities of multiple words have interior dashes ('-'). E.G. agent-name-los-angeles-82348233/
re.search
instead offindall
.[0:1]
), note that.*
is greedy. For example from the string-agent-orlando-fl-24408364-agent-orlando-fl-24408364
your regex capturesorlando-fl-24408364-agent-orlando-fl
. Use.*?
instead.rpartition
string method splits at the last occurrence of the separator and always returns three strings, which makes it easier to deal with corner cases.Proposed code: