How to avoid creating unnecessary lists?

515 Views Asked by At

I keep coming across situations where I pull some information from a file or wherever, then have to massage the data to the final desired form through several steps. For example:

def insight_pull(file):
    with open(file) as in_f:
        lines = in_f.readlines()

        dirty = [line.split('    ') for line in lines]
        clean = [i[1] for i in dirty]
        cleaner = [[clean[i],clean[i + 1]] for i in range(0, len(clean),2)]
        cleanest = [i[0].split() + i[1].split() for i in cleaner]


        with open("Output_File.txt", "w") as out_f:
            out_f.writelines(' '.join(i) + '\n' for i in cleanest)

As per the example above:

    # Pull raw data from file splitting on '   '.
    dirty = [line.split('    ') for line in lines]

    # Select every 2nd element from each nested list.
    clean = [i[1] for i in dirty]

    # Couple every 2nd element with it's predecessor into a new list.
    cleaner = [[clean[i],clean[i + 1]] for i in range(0, len(clean),2)]

    # Split each entry in cleaner into the final formatted list.
    cleanest = [i[0].split() + i[1].split() for i in cleaner]

Seeing as I can't put all of the edits into one line or loop (since each edit depends on the edit before it), is there a better way to structure code like this?

Apologies if the question is a bit vague. Any input is much appreciated.

4

There are 4 best solutions below

1
On BEST ANSWER

Generator expressions

You are correct in not wanting to create multiple lists. Your list comprehension's create an entire new list, wasting memory, and you are looping over each list!

@VPfB's idea of using gererators is a good solution if you have other places in your code to reuse the generators. If you don't have a need to reuse generators use, generator expressions.

Generator expressions are lazy, like generators, so when chained together, as here, the loop will evaluate once at the end, when writelines is called.

def insight_pull(file):
    with open(file) as in_f:
        dirty = (line.split('    ') for line in in_f)    # Combine with next
        clean = (i[1] for i in dirty)
        cleaner = (pair for pair in zip(clean,clean))    # Redundantly silly
        cleanest = (i[0].split() + i[1].split() for i in cleaner)

        # Don't build a single (possibily huge) string with join
        with open("Output_File.txt", "w") as out_f:
            out_f.writelines(' '.join(i) + '\n' for i in cleanest)

Leaving the above as it directly matches your question, You can go further:

def insight_pull(file):
    with open(file) as in_f:
        clean = (line.split('    ')[0] for line in in_f)
        cleaner = zip(clean,clean)
        cleanest = (i[0].split() + i[1].split() for i in cleaner)

        with open("Output_File.txt", "w") as out_f:
            for line in cleanest:
                out_f.write(line + '\n')
1
On

I am assuming from your example that only the cleanest list is of any practical value to you, the rest are just intermediary steps and can be discarded without concern.

Assuming that is the case, why not just reuse the same variable with each intermediate step, so that way you are not holding multiple lists in memory?

def insight_pull(file):
    with open(file) as in_f:
        my_list = in_f.readlines()

        my_list = [line.split('    ') for line in my_list]
        my_list = [i[1] for i in my_list]
        my_list = [[my_list[i],my_list[i + 1]] for i in range(0, len(my_list),2)]
        my_list = [i[0].split() + i[1].split() for i in my_list]


    with open("Output_File.txt", "w") as out_f:
        out_f.writelines(' '.join(i) + '\n' for i in my_list)
0
On

If you are thinking in terms of performance, you are looking for generators. Generators are much like lists, but they are evaluated lazily, meaning that each element is only produced once it's needed. For example, in the following sequence, I don't actually create 3 full lists, each element is only evaluated once. The below is just an example use of generators (as I understood that your code was just an example of the issue you run into, and not a concrete problem):

# All even values from 2-18
even = (i*2 for i in range(1, 10))

# Only those divisible by 3
multiples_of_3 = (val for val in even if val % 3 == 0)

# And finally, we want to evaluate the remaining values as hex
hexes = [hex(val) for val in multiples_of_3]
# output: ['0x6', '0xc', '0x12']

The two first expressions are generators, and the last is just a list comprehension. This will save a lot of memory when there are a lot of steps, as you don't create intermediate lists. Do note that generators cannot be indexed, and they can only be evaluated once (they're just streams of values).

3
On

To achieve the goal, I would recommend pipeline processing. I found an article which expains the technique: generator pipelines.

Here is my try for a direct transformation of your loop into a pipeline. The code is untested (because we've got no data to test) and may contain bugs.

The leading f in func names stands for filter.

def fromfile(name):
    # see coments
    with open(name) as in_f:
        for line in in_f:
            yield line

def fsplit(pp):
    for line in pp: 
        yield line.split('    ')

def fitem1(pp):
    for item in pp: 
        yield item[1]

def fpairs(pp):
    # edited
    for x in pp:
        try:
            yield [x, next(pp)]
        except StopIteration:
            break

def fcleanup(pp):
    for i in pp: 
        yield i[0].split() + i[1].split()

pipeline = fcleanup(fpairs(fitem1(fsplit(fromfile(NAME)))))

output = list(pipeline)

For real-world usage I would aggregate the first 3 filters and also the next 2 ones.