I have over 65 million numeric values stored in a text file. I need to compute the maximum, minimum, average, standard deviation, as well as the 25, 50, and 75 percentiles.
Normally I would use the attached code, but I need a more efficient way to compute these metrics because i cannot store all value p in a list. How can I more effectively calculate these values in Python?
import numpy as np
np.average(obj)
np.min(mylist)
np.max(mylist)
np.std(mylist)
np.percentile(obj, 25)
np.percentile(obj, 50)
np.percentile(obj, 75)
maxx = float('-inf')
minx = float('+inf')
sumz = 0
for index, p in enumerate(open("foo.txt", "r")):
maxx = max(maxx, float(p))
minx = min(minx, float(p))
sumz += float(p)
index += 1
my_max = maxx
my_min = minx
my_avg = sumz/index
I think you are on the right track, by iterating over the file and keeping track of max and min values. To calculate the std, you should keep a sum of squares inside the loop:
sum_of_squares += z**2
. You then can calculatestd = sqrt(sum_of_squares / n - (sumz / n)**2)
after the loop, see formula here (but this formula might suffer from numerical problems). For performance, you might want to iterate over the file in some decent size chunks of data.To calculate the median and percentiles in a 'continuous' way, you could build up a histogram inside your loop. After the loop, you can get approximate percentiles and median by converting the histogram to the CDF, the error will depend on the number of bins.