Calculating the mean and standard deviation of an infinite stream for a sliding window larger than the available memory

110 Views Asked by At

Consider an infinite stream of numbers. I want to calculate the mean and standard deviation of the stream for the latest N numbers in O(1) memory complexity (because N is larger than the memory that I have). On top of that, the stream is distributed and the average and standard deviation could only be calculated in a distributed manner as well (there's no single point that the whole stream passes through it).

I understand this is impossible. But how about approximate solutions?

2

There are 2 best solutions below

4
Matt Timmermans On

To calculate the mean and standard deviation of a sequence of numbers, All you need to collect is:

  1. The number of values n,
  2. The sum of the values sum(x)
  3. The sum of the squares of the values sum(x2)

It's easy to collect these values for your whole stream in constant space. If the stream is partitioned into substreams, then you can even collect them independently for each substream, and then add them together.

Your difficulty is that you only want a sliding window, so you have to subtract the sums for all the values older than N samples ago. This takes O(N) memory if you need to update every sample, because you need to remember all the sums to subtract.

If N is very large, however, then usually you don't need updates every sample. If you only need updates every N/1000 samples, then you only need to remember the count and sums for 1000 blocks of N/1000 samples, and that is constant space.

0
btilly On

I am very unclear on how you intend to do the calculation for a fixed last N when that number is divided up between sources, and (presumably) those sources are receiving samples somewhat randomly. So for each one you don't know how many are in the last N.

Still I can offer some general principles. The well-known E(x^2) - (E(x))^2 formula is numerically unstable because squaring introduces large numbers that then cancel. And therefore roundoff error in the large numbers can be left as a significant error in the result.

However Welford's algorithm is a numerically stable online algorithm for calculating standard deviation on a stream. There are even parallel versions of it, that can take multiple calculations on multiple devices and combine them. Though those versions will sometimes struggle with numerical stability.

However this does not solve the "last N" problem. Depending on your use case, I'm going to suggest that you solve a different problem, namely "last n seconds". And make it approximate.

In particular you can use an Exponentially Decaying Moving Average. The idea is that an observation t seconds ago has a weight 1/k^t. The result is a reasonable approximation to an average taken over the last k seconds. This is how, for instance, Linux implements the load average.

Literally if s is the time since the last observation, you divide the weight of what you had before by k^s, and give the current observation weight 1, then use Welford's algorithm. Then use a variant of Chan's algorithm to produce global averages out of the per machine averages.

This is not perfect, but it is what I would do.