I'm processing strings using regexes in a bunch of files in a directory. To each line in a file, I apply a series of try-statements to match a pattern and if they do, then I transform the input. After I have analyzed each line, I write it to a new file. I have a lot of these try-else followed by if-statements (I only included two here as an illustration). My issue here is that after processing a few files, the script slows down so much that it almost stalls the process completely. I don't know what in my code is causing the slowing down but I have a feeling it is the combination of try-else + if-statements. How can I streamline the transformations so that the data is processed at a reasonable speed?
Or is it that I need a more efficient iterator that does not tax memory to the same extent?
Any feedback would be much appreciated!
import re
import glob
fileCounter = 0
for infile in glob.iglob(r'\input-files\*.txt'):
fileCounter += 1
outfile = r'\output-files\output_%s.txt' % fileCounter
with open(infile, "rb") as inList, open(outfile, "wb") as outlist:
for inline in inlist:
inword = inline.strip('\r\n')
#apply some text transformations
#Transformation #1
try: result = re.match('^[AEIOUYaeiouy]([bcćdfghjklłmnńprsśtwzżź]|rz|sz|cz|dz|dż|dź|ch)[aąeęioóuy](.*\[=\].*)*', inword).group()
except: result = None
if result == inword:
inword = re.sub('(?<=^[AEIOUYaeiouy])(?=([bcćdfghjklłmnńprsśtwzżź]|rz|sz|cz|dz|dż|dź|ch)[aąeęioóuy])', '[=]', wbWord)
#Transformation #2 etc.
try: result = re.match('(.*\[=\].*)*(\w?\w?)[AEIOUYaąeęioóuy]\[=\][ćsśz][ptkbdg][aąeęioóuyrfw](.*\[=\].*)*', inword).group()
except: result = None
if result == inword:
inword = re.sub('(?<=[AEIOUYaąeęioóuy])\[=\](?=[ćsśz][ptkbdg][aąeęioóuyrfw])', '', inword)
inword = re.sub('(?<=[AEIOUYaąeęioóuy][ćsśz])(?=[ptkbdg][aąeęioóuyrfw])', '[=]', inword)
outline = inword + "\n"
outlist.write(outline)
print "Processed file number %s" % fileCounter
print "*** Processing completed ***"
try/except is indeed not the most efficient way (nor the most readable one) to test for the result of a
re.match()
, but the penalty hit should still be (more or less) constant - the performance should not degrade during execution (until perhaps there's some worst case happening due to your data but well) - so chances are the problem is elsewhere.FWIW you can start by replacing your try/except blocks with the appropriate canonical solution, ie instead of:
you want:
This will slightly improve perfs but, most importantly, make your code more readable and much more maintainable - at least it won't hide any unexpected error.
As a side note, never use a bare except clause, always only catch expected exceptions (here it would have been an
AttributeError
sincere.match()
returnsNone
when nothing matched andNone
has of course no attributegroup
).This will very probably NOT solve your problem but at least you'll then know the issue is elsewhere.