Efficient way of computing statistics for large/imprecise amount of data

2.8k Views Asked by At

I have over 65 million numeric values stored in a text file. I need to compute the maximum, minimum, average, standard deviation, as well as the 25, 50, and 75 percentiles.

Normally I would use the attached code, but I need a more efficient way to compute these metrics because i cannot store all value p in a list. How can I more effectively calculate these values in Python?

import numpy as np

np.average(obj)
np.min(mylist)
np.max(mylist)
np.std(mylist)
np.percentile(obj, 25)
np.percentile(obj, 50)
np.percentile(obj, 75)

maxx = float('-inf')
minx = float('+inf')
sumz = 0
for index, p in enumerate(open("foo.txt", "r")):
    maxx = max(maxx, float(p))
    minx = min(minx, float(p))
    sumz += float(p)
index += 1
my_max = maxx 
my_min = minx 
my_avg = sumz/index
5

There are 5 best solutions below

0
On

I think you are on the right track, by iterating over the file and keeping track of max and min values. To calculate the std, you should keep a sum of squares inside the loop: sum_of_squares += z**2. You then can calculate std = sqrt(sum_of_squares / n - (sumz / n)**2) after the loop, see formula here (but this formula might suffer from numerical problems). For performance, you might want to iterate over the file in some decent size chunks of data.

To calculate the median and percentiles in a 'continuous' way, you could build up a histogram inside your loop. After the loop, you can get approximate percentiles and median by converting the histogram to the CDF, the error will depend on the number of bins.

2
On

Use binary file. Then you can use numpy.memmap to map it to memory and can perform all sorts of algorithms, even if the dataset was larger than RAM.

You can even use the numpy.memmap to create a memory mapped array, and read your data in from the text file... you can work on it and when you are done, you also have the data in binary format.

3
On

To get the percentiles, sort the text file using a command line program. Use the line count (index in your program) to find the line numbers of the percentiles (index // 4, etc.) Then retrieve those lines from the file.

0
On

As Antti Haapala says, the easiest and most efficient way to do this will be to stick with numpy, and just use a memmapped binary file instead of a text file. Yes, converting from one format to the other will take a bit of time—but it'll almost certainly save more time than it costs (because you can use numpy vectorized operations instead of loops), and it will also make your code a lot simpler.

If you can't do that, Python 3.4 will come with a statistics module. A backport to 2.6+ will hopefully be available at some point after the PEP is finalized; at present I believe you can only get stats, the earlier module it's based on, which requires 3.1+. Unfortunately, while stats does do single-pass algorithms on iterators, it doesn't have any convenient way to run multiple algorithms in parallel on the same iterator, so you have be clever with itertools.tee and zip to force it to interleave the work instead of pulling the whole thing into memory.

And of course there are plenty of other modules out there if you search PyPI for "stats" and/or "statistics" and/or "statistical".

Either way, using a pre-built module will mean someone's already debugged all the problems you're going to run into, and they may have also optimized the code (maybe even ported it to C) to boot.

0
On

Most of these operations can be expressed easily in terms of simple arithmetic. In that case, it can actually (surprisingly) be quite efficient to process simple statistics directly from the Linux command line using awk and sed, e.g. as in this post: < http://www.unixcl.com/2008/09/sum-of-and-group-by-using-awk.html >.

If you need to generalize to more advanced operations, like weighted percentiles, then I'd recommend using Python Pandas (notably the HDFStore capabilities for later retrieval). I've used Pandas with a DataFrame of over 25 million records before (10 columns by 25 million distinct rows). If you're more memory constrained, you could read the data in in chunks, calculate partial contributions from each chunk, and store out intermediate results, then finish off the calculation by just loading the intermediate results, in a serialized sort of map-reduce kind of framework.