Divide a stream into bins with equal counts

2k Views Asked by At

Ideally, I want the following without reading the data from hard disk for too many times. The data is big and memory can't hold all of the data at the same time.

  1. Input is a stream x[t] from hard disk. The stream of numbers contains N elements.
  2. It is possible to have a histogram of x with m bins.
  3. The n bins are defined by the bin edges e0 < e1, ..., < em. For example, if ei =< x[0] < ei+1, then x[0] belongs to the ith bin.
  4. Find the bin edges that makes the bin holds an nearly equal number of elements from the stream. The number of elements in each bin should be ideally within some threshold percent from N/m. This is because if we evenly distribute N elements among m bins, each bins should hold about N/m elements.

Current solution:

import numpy as np


def test_data(size):
    x = np.random.normal(0, 0.5, size // 2)
    x = np.hstack([x, np.random.normal(4, 1, size // 2)])
    return x


def bin_edge_as_index(n_bin, fine_hist, fine_n_bin, data_size):
    cum_sum = np.cumsum(fine_hist)
    bin_id = np.empty((n_bin + 1), dtype=int)

    count_per_bin = data_size * 1.0 / n_bin

    for i in range(1, n_bin):
        bin_id[i] = np.argmax(cum_sum > count_per_bin * i)

    bin_id[0] = 0
    bin_id[n_bin] = fine_n_bin
    return bin_id


def get_bin_count(bin_edge, data):
    n_bin = bin_edge.shape[0] - 1
    result = np.zeros((n_bin), dtype=int)
    for i in range(n_bin):
        cmp0 = (bin_edge[i] <= data)
        cmp1 = (data < bin_edge[i + 1])
        result[i] = np.sum(cmp0 & cmp1)
    return result


# Test Setting
test_size = 10000
n_bin = 6
fine_n_bin = 2000  # use a big number and hope it works

# Test Data
x = test_data(test_size)

# Fine Histogram
fine_hist, fine_bin_edge = np.histogram(x, fine_n_bin)

# Index of the bins of the fine histogram that contains
# the required bin edges (e_1, e_2, ... e_n)
bin_id = bin_edge_as_index(
    n_bin, fine_hist, fine_n_bin, test_size)

# Find the bin edges
bin_edge = fine_bin_edge[bin_id]
print("bin_edges:")
print(bin_edge)

# Check
bin_count = get_bin_count(bin_edge, x)
print("bin_counts:")
print(bin_count)
print("ideal count per bin:")
print(test_size * 1.0 / n_bin)

Output of program:

bin_edges:
[-1.86507282 -0.22751473  0.2085489   1.30798591  3.57180559  4.40218207
  7.41287669]
bin_counts:
[1656 1675 1668 1663 1660 1677]
ideal count per bin:
1666.6666666666667

Problem:

I can't specify a threshold s, and expect the bin counts are at most s% different from the ideal counts per bin.

2

There are 2 best solutions below

0
On BEST ANSWER

Assuming that the distribution is not outrageously skewed (like 10000 values between 1.0000001 and 1.0000002 and 10000 others between 9.0000001 and 9.0000002), you can proceed as below.

Compute a histogram with a sufficient resolution, say K bins, which covers the whole range (hopefully known beforehand). This will take a single pass over the data.

Then compute the cumulated histogram and as you go, identify the m+1 quantile edges (where the cumulated counts cross multiples of N/m).

The accuracy that you will get is dictated by the maximum number of elements in a bin of the original histogram.

For N elements, using an histogram of K bins and assuming some "nonuniformity factor" (equal to a few units for reasonable distributions), the maximum error will be f.N/K.


You can improve accuracy if you like by considering m+1 auxiliary histograms which only accumulate the values that fall in the quantile bins of the global histogram. Then you can refine the quantiles to the resolution of these auxiliary histograms.

This will cost you an extra pass, but the error will be reduced to f.N/(K.K'), using K then m.K' histogram space only, instead of K.K'.

0
On

Iff you can assume your data is random with a defined distribution (that is: taking any non-trivial percentage of your data in sequence is going to "sketch" the same distribution as the entire data, only with a coarser precision), I imagine there are a number of options:

  1. read a part of your data in some oversampled histogram. Based on this, choose an approximation for the bin edges the way you do now (as explained in your question), then uniformly oversample these bins, then read another chunk of your data into the new bins and so on. If you have enough data, processing them in chunks 0f 10% would allow for 10 iterations to improve your bins structure in a single pass.

  2. start with a number of bins and accumulate some (not all) data. Look over them and if one has bin_width*count disproportionately higher then the neighbours (maybe this is where the precision/error may come into play), divide that bin in two and heuristically assign the old bin count into the newly created bins (one possible heuristic - proportional with the count of the neighbours). At the end, you should have a division somehow controlled by the acceptable error from which to sort-of interpolate your distribution.

Of course, the above are only ideas of approaches, can't offer any warranty about how well they'll work.