streamlining series of try-except + if-statements for faster processing in Python

127 Views Asked by At

I'm processing strings using regexes in a bunch of files in a directory. To each line in a file, I apply a series of try-statements to match a pattern and if they do, then I transform the input. After I have analyzed each line, I write it to a new file. I have a lot of these try-else followed by if-statements (I only included two here as an illustration). My issue here is that after processing a few files, the script slows down so much that it almost stalls the process completely. I don't know what in my code is causing the slowing down but I have a feeling it is the combination of try-else + if-statements. How can I streamline the transformations so that the data is processed at a reasonable speed?

Or is it that I need a more efficient iterator that does not tax memory to the same extent?

Any feedback would be much appreciated!

import re
import glob

fileCounter = 0 

for infile in glob.iglob(r'\input-files\*.txt'):

    fileCounter += 1
    outfile = r'\output-files\output_%s.txt' % fileCounter

    with open(infile, "rb") as inList, open(outfile, "wb") as outlist:

        for inline in inlist:

            inword = inline.strip('\r\n')

            #apply some text transformations
            #Transformation #1
            try: result = re.match('^[AEIOUYaeiouy]([bcćdfghjklłmnńprsśtwzżź]|rz|sz|cz|dz|dż|dź|ch)[aąeęioóuy](.*\[=\].*)*', inword).group()
            except: result = None

            if result == inword:
                inword = re.sub('(?<=^[AEIOUYaeiouy])(?=([bcćdfghjklłmnńprsśtwzżź]|rz|sz|cz|dz|dż|dź|ch)[aąeęioóuy])', '[=]', wbWord)

            #Transformation #2 etc.
            try: result = re.match('(.*\[=\].*)*(\w?\w?)[AEIOUYaąeęioóuy]\[=\][ćsśz][ptkbdg][aąeęioóuyrfw](.*\[=\].*)*', inword).group()
            except: result = None

            if result == inword:   
                inword =  re.sub('(?<=[AEIOUYaąeęioóuy])\[=\](?=[ćsśz][ptkbdg][aąeęioóuyrfw])', '', inword)
                inword =  re.sub('(?<=[AEIOUYaąeęioóuy][ćsśz])(?=[ptkbdg][aąeęioóuyrfw])', '[=]', inword)

            outline = inword + "\n"
            outlist.write(outline)

    print "Processed file number %s" % fileCounter          
print "*** Processing completed ***" 
1

There are 1 best solutions below

1
On BEST ANSWER

try/except is indeed not the most efficient way (nor the most readable one) to test for the result of a re.match() , but the penalty hit should still be (more or less) constant - the performance should not degrade during execution (until perhaps there's some worst case happening due to your data but well) - so chances are the problem is elsewhere.

FWIW you can start by replacing your try/except blocks with the appropriate canonical solution, ie instead of:

try:
    result = re.match(someexp, yourline).group()
except:
    result = None

you want:

match = re.match(someexp, yourline)
result = match.group() if match else None

This will slightly improve perfs but, most importantly, make your code more readable and much more maintainable - at least it won't hide any unexpected error.

As a side note, never use a bare except clause, always only catch expected exceptions (here it would have been an AttributeError since re.match() returns None when nothing matched and None has of course no attribute group).

This will very probably NOT solve your problem but at least you'll then know the issue is elsewhere.