I want to count the frequency of occurrence of numbers in a double array by binning (example array below). Essentially the same functionality provided by Python numpy's histogram()
. I'm on a constrained environment and have access to basic Java Math
and jblas library, but nothing else and no other third party libraries like colt are installable.
double[] x1 = {1, 1, 2, 2, 1, 3, 2}
I have a separate sorted array which marks the start and end of the binEdges
and looks like the following:
binEdges = [4.9E-324, 1.0, 1.0, 1.0, 2.0, 2.0, 2.0, 2.0, 2.0, 3.0, 4.0, 4.0, 5.0, 5.0, 7.0, 1.7976931348623157E308]
Note that the binEdges
array may have repeated elements and I would like to keep them such. Hence, with the given binEdges
array the result of frequency count will looking like:
binCounts = [0.0, 0.0, 0.0, 3.0, 0.0, 0.0, 0.0, 0.0, 3.0, 1.0, 0.0, 0.0, 0.0, 0.0, 0.0]
The binCounts
array combined with the binEdges
, when read from left to right is as follows, note the braces on the bin intervals:
Bin interval frequency
[4.9E-324, 1.0) 0
[1.0, 1.0) 0
[1.0, 1.0) 0
[1.0, 2.0) 3 (since we have 3 ones in x1)
. .
. .
. .
I currently have the following implementation, which runs in O(nlgn)
, assuming the sorting takes O(nlgn)
. I'm wondering if this can be trimmed down to something lower than O(nlgn)
. I have looked around in jblas as well and don't know of a library function for binning, if folks over here have any other insights on other native Java tricks or clever indexing scheme that they can point me to. Other suggestions on improving the code which cut down running time are also welcome.
Cutting the time down is important because the data at hand is huge.
public static double [] binCounts(double[] x, double[] binEdges){
double [] ret = new double[binEdges.length - 1];
Arrays.sort(x); // takes O(nlgn), the loop below is effectively O(n)
int k = 0;
for (int i = 0; i < binEdges.length - 1; i++) {
if (binEdges[i] == binEdges[i+1])
continue;
for (int j = k; j < x.length; j++){
if (x[j] >= binEdges[i+1])
break;
else if (x[j] >= binEdges[i] && x[j] < binEdges[i+1]){
ret[i] += 1;
k++;
}
}
}
return ret;
}
You can binary search for binEdges using TreeMap: