Python: push item vs creating empty list (efficiency)

156 Views Asked by At

I have done two algorithms and I want to check which one of them is more 'efficient' and uses less memory. The first one creates a numpy array and modifies the array. The second one creates a python empty array and pushes values into this array. Who's better? First program:

 f = open('/Users/marcortiz/Documents/vLex/pylearn2/mlearning/classify/files/models/model_training.txt')
        lines = f.readlines()
        f.close()
        zeros = np.zeros((60343,4917))

        for l in lines:
            row = l.split(",")
            for element in row:
                zeros[lines.index(l), row.index(element)] = element

        X = zeros[1,:]
        Y = zeros[:,0]
        one_hot = np.ones((counter, 2))

The second one:

 f = open('/Users/marcortiz/Documents/vLex/pylearn2/mlearning/classify/files/models/model_training.txt')
        lines = f.readlines()
        f.close()
        X = []
        Y = []

        for l in lines:
            row = l.split(",")
            X.append([float(elem) for elem in row[1:]])
            Y.append(float(row[0]))

        X = np.array(X)
        Y = np.array(Y)
        one_hot = np.ones((counter, 2))

My theory is that the first one is slower but uses less memory and it's more 'stable' while working with large files. The second one it's faster but uses a lot of memory and its not so stable while working with large files (543MB, 70,000 lines)

Thanks!

3

There are 3 best solutions below

1
On BEST ANSWER

Well finally I made some changes thanks to the answers. My two programs:

f = open('/Users/marcortiz/Documents/vLex/pylearn2/mlearning/classify/files/models/model_training.txt')
    zeros = np.zeros((60343,4917))
    counter = 0

    start = timeit.default_timer()
    for l in f:
        row = l.split(",")
        counter2 = 0
        for element in row:
            zeros[counter, counter2] = element
            counter2 += 1
        counter = counter + 1
    stop = timeit.default_timer()  
    print stop - start 
    f.close()

Time of the first program--> 122.243036032 seconds

Second program:

f = open('/Users/marcortiz/Documents/vLex/pylearn2/mlearning/classify/files/models/model_training.txt')
    zeros = np.zeros((60343,4917))
    counter = 0

    start = timeit.default_timer()
    for l in f:
        row = l.split(",")
        counter2 = 0
        zeros[counter, :] = [i for i in row]
        counter = counter + 1
    stop = timeit.default_timer()
    print stop - start
    f.close()

Time of the second program: 102.208696127 seconds! Thanks.

1
On

The problem with both codes is that you're loading the whole file in memory first using file.readlines(), you should iterate over the file object directly to get one line at a time.

from itertools import izip
#generator function
def func():
   with open('filename.txt') as f:
       for line in f:
          row = map(float, l.split(","))
          yield row[1:], row[0]

X, Y = izip(*func())
X = np.array(X)
Y = np.array(Y)
...

I am sure a pure numpy solution is going to be faster than this.

0
On

Python has a useful profiler in its default library. It's really easy to use: just wrap your code in a function and call cProfile.run in the following fashion:

import cProfile
cProfile.run('my_function()')

One advice for the both cases: you really do not need to read all the lines to a list. Instead, if you just iterate over the file, you'll get the lines without storing them in memory:

f = open('some_file.txt')
for line in f:
    # Do something

In terms of memory usage, numpy array is significantly better than list.