Most efficient way to make a Histogram from an array in Java

2.5k Views Asked by At

I want to count the frequency of occurrence of numbers in a double array by binning (example array below). Essentially the same functionality provided by Python numpy's histogram(). I'm on a constrained environment and have access to basic Java Math and jblas library, but nothing else and no other third party libraries like colt are installable.

double[] x1 = {1, 1, 2, 2, 1, 3, 2}

I have a separate sorted array which marks the start and end of the binEdges and looks like the following:

binEdges = [4.9E-324, 1.0, 1.0, 1.0, 2.0, 2.0, 2.0, 2.0, 2.0, 3.0, 4.0, 4.0, 5.0, 5.0, 7.0, 1.7976931348623157E308]

Note that the binEdges array may have repeated elements and I would like to keep them such. Hence, with the given binEdges array the result of frequency count will looking like:

binCounts = [0.0, 0.0, 0.0, 3.0, 0.0, 0.0, 0.0, 0.0, 3.0, 1.0, 0.0, 0.0, 0.0, 0.0, 0.0]

The binCounts array combined with the binEdges, when read from left to right is as follows, note the braces on the bin intervals:

Bin interval frequency [4.9E-324, 1.0) 0 [1.0, 1.0) 0 [1.0, 1.0) 0 [1.0, 2.0) 3 (since we have 3 ones in x1) . . . . . .

I currently have the following implementation, which runs in O(nlgn), assuming the sorting takes O(nlgn). I'm wondering if this can be trimmed down to something lower than O(nlgn). I have looked around in jblas as well and don't know of a library function for binning, if folks over here have any other insights on other native Java tricks or clever indexing scheme that they can point me to. Other suggestions on improving the code which cut down running time are also welcome.

Cutting the time down is important because the data at hand is huge.

public static double [] binCounts(double[] x, double[] binEdges){
    double [] ret = new double[binEdges.length - 1];
    Arrays.sort(x); // takes O(nlgn), the loop below is effectively O(n)
    int k = 0;
    for (int i = 0; i < binEdges.length - 1; i++) {    
        if (binEdges[i] == binEdges[i+1])
            continue;
        for (int j = k; j < x.length; j++){
            if (x[j] >= binEdges[i+1])
                break;
            else if (x[j] >= binEdges[i] && x[j] < binEdges[i+1]){
                ret[i] += 1;
                k++;
            }
        }
    }
    return ret;
}
3

There are 3 best solutions below

3
On BEST ANSWER

You can binary search for binEdges using TreeMap:

public static double[] binCounts(double[] x, double[] binEdges) {
    int binEdgesSize = binEdges.length;
    NavigableMap<Double, Integer> binEdgesMap = new TreeMap<>();
    for (int i = 0; i < binEdgesSize; ++i)
        binEdgesMap.put(binEdges[i], i);
    double [] ret = new double[binEdgesSize - 1];
    for (double d : x) {
        Entry<Double, Integer> e = binEdgesMap.ceilingEntry(d);
        if (e != null)
            ++ret[e.getValue()];
    }
    return ret;
}
0
On

If you take a look at your data, you can try to recognize if they have any patterns, you can figure out any best case sorting algorithm can fits in, or get some insight on how image compress.

When considering video game objects, the coordination update on every frame update may be a little adjustment only, thus we can simply apply bubble sort and mostly it turns out to be best case in time complexity.

If you have data that the possible values are a small set of numbers, consider something like one pass, and do the counting on the fly. So that you don't really need to having a sorting step.

A side note: My experience when data size is huge mostly also related to the space complexity as well, think about a machine with limited RAM but with a big hard disk. In that case, I would consider the bottleneck is on the hard disk reading and writing, or in a distributed system can be on the network storage. Something like your new double[binEdges.length - 1] may result OutOfMemory.

Also, try to use HashSet or similar.

0
On

@saka1029 thank for showing the NavigableMap container class (I didn't know about it). It seems this could be simplified by eliminating the ret object and using the key directly. Since the binCount map's value is an Integer, we can increment it:

public static double[] binCounts(double[] x, double[] binEdges) {
    int binEdgesSize = binEdges.length;
    // binCount: Key = lower edge of bin; Value = item count
    NavigableMap<Double, Integer> binCount = new TreeMap<>();
    for (int i = 0; i < binEdgesSize; ++i)
        binCount.put(binEdges[i], 0);  // Initialize count to zero
    for (double item : x) {
        Double edge = binCount.floorKey(item);
        if (edge != null)
            binCount.get(edge)++;
    }
    return binCount.values();
}