Mean value and standard deviation of a very huge data set

36.6k Views Asked by At

I am wondering if there is an algorithm that calculates the mean value and standard deviation of an unbound data set.

for example, I am monitoring an measurement value, say, electric current. I would like to have the mean value of all historical data. Whenever a new value come, update the mean and stdev? Because the data is too big to store, I hope it can just update the mean and stdev on the fly without storing the data.

Even data is stored, the standard way (d1+...+dn)/n, doesn't work, the sum will blow out the data representation.

I through about sum(d1/n + d2/n + ... d3/n), if n is hugh, the error is too big and accumulated. Besides, n is unbound in this case.

The number of data is definitely unbound, whenever it comes, it requires to update the value.

Does anybody know if there is an algorithm for it?

5

There are 5 best solutions below

1
On

So, I'm not sure if this is an algorithm that has been used before, but I'll provide it anyway. I started with the idea of calculating the standard deviation with the wrong mean and then correcting based on the real mean. Here is a picture of something I wrote about it

1
On

In fact, even on smaller data sets when computing the standard deviation, you should not compute the sum of squares. The problem is called catastrophic cancellation (Link is to Wikipedia).

Wikipedia has also two articles that help you get out of this problem:

  • Kahan summation algorithm which has a carry-over to avoid systematic error when summing a large number of very small values (e.g. when summing all x/n values)
  • Algorithms for calculating variance, in particular the "online" version should be appropriate for large data sets. It will incrementally update the mean for each observation, so the value remains in the scale of your data! You may need to use the higher order version for the variance, because the first online algorithm still computes sum-of-sqaured-deviations-from-mean, so for large n, this may bust your value range. The M2 in the higher order version should contain the average-squared-deviation-from-mean, which is on the scale of your output.

This is probably one of the most common problems with naive statistical computations.

Note that the problem does not occur, when the mean remains around 0, much smaller than the variance.

8
On

[did the question change? maybe i only read the start. i have updated/edited to give a better reply:]

there is no perfect solution (in constant memory) that i know of, but i can give various approaches.

first, for the basic calculation you only need the sum of all values (sum_x), the sum of their squares (sum_x2), and the total count (n). then:

mean = sum_x / n
stdev = sqrt( sum_x2/n - mean^2 )

and all these values (sum_x, sum_x2, n) can be updated from a stream.

the problem (as you say) is dealing with overflow and / or limited precision. you can see this if you consider floating point when sum_x2 is so large that the internal representation doesn't include values of the magnitude of a single squared value.

a simple way to avoid the problem is to use exact arithmetic, but that will be increasingly slow (and also uses O(log(n)) memory).

another way that can help is to normalise values - if you know that values are typically X then you can do calculations on x - X which makes the sums smaller (obviously you then add X to the mean!). that helps postpone the point at which precision is lost (and can/should be combined with any other method here - when binning, for example, you can use the mean of the previous bin). see this algorithm (knuth's method) for how to do this progressively.

if you don't mind a (small constant factor) O(n) memory cost you could restart every N values (eg million - smarter still would be to adapt this value by detecting when precision is too low), storing previous mean and stdev and then combine for the final result (so your mean is the appropriately weighted value from the recent running total and the old binned values).

the binning approach could probably be generalised to multiple levels (you start binning the bins) and would reduce to O(log(n)) memory use, but i haven't worked out the details.

finally, a more practical solution would probably be to do the initial approach for, say, 1000 values, and then start a new sum in parallel. you could display a weighted average of the two and, after another 1000 values, drop the old sums (after gradually decreasing their weight) and start a new set. so you always have two sets of sums, and display a weighted average between them, which gives continuous data that reflect the last 1000 values (only). in some cases that will be good enough, i imagine (it's not an exact value, since it's only for "recent" data, but it's smooth and representative, and uses a fixed amount of memory).

ps, something that occurred to me later - really, doing this "forever" doesn't make much sense anyway, because you'll get to a point where the values are absolutely dominated by the old data. it would be better to use a "running mean" which gives decreased weight to old values. see for example https://en.wikipedia.org/wiki/Moving_average - however, i don't know of a common equivalent to stdev.

2
On

Interesting question.

Let's discuss the mean first, if only because it's a little simpler.

You are right about the rounding error on a running total. It'll kill your accuracy for large enough a data set. You'd like to presort the data, totaling the small data first; but of course this is impossible in your case. However, you can achieve most of the benefit of sorted data by keeping several running totals.

A conceptual example, C- or C++-style:

const double max_small  =    0.001;
const double max_medium = 1000.0;

double total_small;
double total_medium;
double total_large;

while(true) {
    const double datum = get_datum(); // (Use here whatever function you use to get a datum.)
    if (!is_datum_valid()) break;
    if (abs(datum) <= max_small) total_small += datum;
    else if (abs(datum) <= max_medium) total_medium += datum;
    else total_large += datum;
}

double total = 0.0;
total += total_small;
total += total_medium;
total += total_large;

In realistic code, you'll probably keep more than three running totals -- and of course you'll also keep running totals of the squares of the data -- but the above example conveys the idea. You can fill in details.

Also, adapting @andrewcooke's idea, you can expand the loop in something like this manner:

while(true) {
    const double datum = get_datum();
    if (!is_datum_valid()) break;
    if (abs(datum) <= max_small) {
        total_small += datum;
        if (abs(total_small) > max_medium) {
            total_large += total_small;
            total_small = 0.0;
        }
    }
    else if (abs(datum) <= max_medium) total_medium += datum;
    else total_large += datum;
}

Again, you can fill in details. Good luck.

APPENDIX: THE PRACTICAL CALCULATION OF THE STANDARD DEVIATION

A good question has been raised in the various comment threads here regarding how a standard deviation is to be calculated without prior knowledge of the mean. Fortunately, a trick is known to calculate the standard deviation so. I have prepared two pages of notes that explain the trick here (in PDF).

At any rate, all that is necessary to include the standard deviation among the running statistics is to total not just the data but also the squares of the data; and, of course, the squares can be totaled in like manner as the data, themselves, following the same pattern as in the code above.

6
On

No.

(I thought: No, but I've been proven wrong).

You can carry the sum and the count, so that

sum(i)=500, count(i)=50, => avg:=10
next value = 20
sum=520, count=51 => avg:= 10.19

but the stddev can't be build that way. You have to produce the delta for every value to the new mean, and square those, and only after that you divide by N. However: The question is, what kind of values are those (from a mathematical viewpoint - keep away with physics! :) ). In normal circumstances, I wouldn't expect values to change after 2000 elements. Else it might be questionable to build mean and stddev in the first place.

And for 2000 elements, it should be possible to calculate the value fast.

Maybe you can use a buffer, and calculate always the avg and stddev for the last 2000 values every 2000 values. Whether this is meanigful data is something you have to decide.


We can't continue our discussion in chat so well, ...

because it hasn't elaborated markdown. So I use my post to clarify my position, which is spread over comments with thb, mainly, but andrew seems to believe in sliding calculation of the stddev too.

Here is a wide table to make calculation obvious and easy to follow. The columns are:

  • i: the running index. We first calculate for the values 1-3, then for the values 1-5.
  • x(i) is the data, arbitrarily chosen by me. 3,4,5 and 4,6
  • sum is just to what they sum up. The interesting one is the last for the group: 12 and 22. Note: We don't take the sum for for 3 values and 2 values but for the first 3 and then for the first 5.
  • The avg is just 12/3 or 22/5. The avg can be calculated sliding, if you know i and sum. sum(i+1) = (sum (i)+x(i))/i+1 No dispute so far.
  • To calculate the stddev., we have to take the difference for each value to the avg, and square it (thereby loosing the sign, which else would invalidate the difference - it would always be 0). A second effect is, that few big distances lead to a bigger stddev than many small distances. Distance (1,1,-1,-1)=> 4*1² = 4. In contrast: (2,-2)=> 2² + -2² = 4+4 = 8. The first column is for the 3 values, the second for the 5 values (to follow the calculation).
  • The next column (last)² does the squaring.
  • sum it up
  • divide by n-1
  • take the square root

spreadsheed with calculation (oocalc screenshot)

Maybe we can agree, that this is a valid way to calculate the stddev. Now the question is, how to calculate it if you know the complete line 3 (except x(3)=5), and now you get two individual values (4, 6) as shown in the sheet, but without knowing (x(i) for i = 1, 2, 3.

My claim failed: You can.

Okay - Tried to use your formula.

ð² = 1/(N-1) (Sum (xi²) - 1/N (Sum (xi))²)

So for the 4 values I get

  • N=5
  • sum(xi) = 22
  • sum(xi²) = 102

Inserted in your formula:

ð² = 1/(N-1) (Sum (x<sub>i</sub>²) - 1/N (Sum (x<sub>i</sub>))²)
ð² = 1/4 (102 - 1/5 (22²))
ð² = 1/4 (102 - 1/5 (484))
ð² = 1/4 (102 - 96.8)
ð² = 1/4 (5.2)
ð² = 1.3
ð  = 1.140

My result was 1.14, your's is 1.14 So there is a shortcut. Very interesting - I'm still surprised.