Ideally, I want the following without reading the data from hard disk for too many times. The data is big and memory can't hold all of the data at the same time.
- Input is a stream
x[t]
from hard disk. The stream of numbers containsN
elements. - It is possible to have a histogram of
x
withm
bins. - The n bins are defined by the bin edges e0 < e1, ..., < em. For example, if ei =< x[0] < ei+1, then x[0] belongs to the ith bin.
- Find the bin edges that makes the bin holds an nearly equal number of elements from the stream. The number of elements in each bin should be ideally within some threshold percent from
N/m
. This is because if we evenly distributeN
elements among m bins, each bins should hold aboutN/m
elements.
Current solution:
import numpy as np
def test_data(size):
x = np.random.normal(0, 0.5, size // 2)
x = np.hstack([x, np.random.normal(4, 1, size // 2)])
return x
def bin_edge_as_index(n_bin, fine_hist, fine_n_bin, data_size):
cum_sum = np.cumsum(fine_hist)
bin_id = np.empty((n_bin + 1), dtype=int)
count_per_bin = data_size * 1.0 / n_bin
for i in range(1, n_bin):
bin_id[i] = np.argmax(cum_sum > count_per_bin * i)
bin_id[0] = 0
bin_id[n_bin] = fine_n_bin
return bin_id
def get_bin_count(bin_edge, data):
n_bin = bin_edge.shape[0] - 1
result = np.zeros((n_bin), dtype=int)
for i in range(n_bin):
cmp0 = (bin_edge[i] <= data)
cmp1 = (data < bin_edge[i + 1])
result[i] = np.sum(cmp0 & cmp1)
return result
# Test Setting
test_size = 10000
n_bin = 6
fine_n_bin = 2000 # use a big number and hope it works
# Test Data
x = test_data(test_size)
# Fine Histogram
fine_hist, fine_bin_edge = np.histogram(x, fine_n_bin)
# Index of the bins of the fine histogram that contains
# the required bin edges (e_1, e_2, ... e_n)
bin_id = bin_edge_as_index(
n_bin, fine_hist, fine_n_bin, test_size)
# Find the bin edges
bin_edge = fine_bin_edge[bin_id]
print("bin_edges:")
print(bin_edge)
# Check
bin_count = get_bin_count(bin_edge, x)
print("bin_counts:")
print(bin_count)
print("ideal count per bin:")
print(test_size * 1.0 / n_bin)
Output of program:
bin_edges:
[-1.86507282 -0.22751473 0.2085489 1.30798591 3.57180559 4.40218207
7.41287669]
bin_counts:
[1656 1675 1668 1663 1660 1677]
ideal count per bin:
1666.6666666666667
Problem:
I can't specify a threshold s, and expect the bin counts are at most s% different from the ideal counts per bin.
Assuming that the distribution is not outrageously skewed (like 10000 values between 1.0000001 and 1.0000002 and 10000 others between 9.0000001 and 9.0000002), you can proceed as below.
Compute a histogram with a sufficient resolution, say
K
bins, which covers the whole range (hopefully known beforehand). This will take a single pass over the data.Then compute the cumulated histogram and as you go, identify the
m+1
quantile edges (where the cumulated counts cross multiples ofN/m
).The accuracy that you will get is dictated by the maximum number of elements in a bin of the original histogram.
For
N
elements, using an histogram ofK
bins and assuming some "nonuniformity factor" (equal to a few units for reasonable distributions), the maximum error will bef.N/K
.You can improve accuracy if you like by considering
m+1
auxiliary histograms which only accumulate the values that fall in the quantile bins of the global histogram. Then you can refine the quantiles to the resolution of these auxiliary histograms.This will cost you an extra pass, but the error will be reduced to
f.N/(K.K')
, usingK
thenm.K'
histogram space only, instead ofK.K'
.