Find the "center" of some histograms with EMD as the distance metric

868 Views Asked by At

Given some histograms of the same number of buckets, I need to find the "center" of those histograms. The "center" is a histogram such that the sum of Earth Mover's Distances to it from all other histograms is the minimum.

For example, given 4 histogram A, B, C, D, the algorithm needs to output a new histogram X such that EMD(X, A) + EMD(X, B) + EMD(X, C) + EMD(X, D) is the minimum.

Simple arithmetic mean cannot find the "center", here is an example.

I need to calculate the "center" of millions of histograms, so how can I find the "center" efficiently. If no fast algorithm exists, is there any good approximate ?

=== edit ===

Added an example to clarify my problem.

2

There are 2 best solutions below

2
On

If by "center" you are referring to the median, that would require sorting of the data set; in this case histograms are sorted already. It is understood that the data of the histograms will likely not be in list form; however, as no alternative is noted the answer will be structured as such.

raw data: list of values histogram data:list of tuples "((min, max), quantity)"

raw data available:

#all data from histogram w/o bucket approximation
data = []
#sorting algorithm should be optimized for the size of the dateset 
#skip this step if data set sorted
data.sort()
median = (data[int(len(data)/2)]+data[round(len(data)/2)])/2

histogram data:

#data from histogram only
data = []
#sort by min
#skip this step if histogram is already sorted
sort(key=lambda x:x[0][0])
#n is the data quantity
n = sum([bucket[1] for bucket in data])

#loop checks to see if median is captured in current bucket 
#by expanding the "net" by the current bucket size
rollingQuant = 0
median = None
for bucket in data:
    rollingQuant += bucket[1]
    #if median captured in bucket True
    if rollingQuant >= n/2:
        median = bucket[0]

issue:

  • median value with only histogram data is impossible to calculate can be estimated; one example of estimation would be defining linear equation with min max and the number of points in the bucket.
0
On

The earth mover distance, informally, treats each distribution or in this case histogram as a pile of dirt. Finding the center according to EMD then consists of finding the point that minimizes the product of each unit of histogram area times the distance it must be moved. E.g. moving a histogram bin with 7 members one unit has the same EMD as moving a histogram bin with 1 member 7 units. I assume your distribution only has one dimension for simplicity, but if not this center point can be found independently along each dimension.

The earth mover distance is actually quite analogous to a concept in physics or statics; the first moment. The first moment for a shape is defined as the product of each unit of area times the distance of that unit of area from the center point. The center point is found such that the total area on each side of the center point (along any principal dimension) is the same.

Fans of statics will recall that this center point is indeed actually just the mean of area distribution along each dimension. The EMD definition yields a similar result. Thus, to find the EMD centroid for your group of histograms, do the following:

Treat each member of each bin as a single unit. Assign each unit the value associated with that bin. Thus, if for a given dimension bin 0-10 has 5 entries, you have 5 units each with value 5 (the mean of the bin). Find the median value across all units, and this is your centroid value for that dimension. Repeat for all dimensions if there are more than one, and that's it!

In this trivial example with 2 histograms, after treating each bin element as one unit, the median (and thus center) is 4. Obviously, moving the center either above or below 4 will result in increasing the EMD by 5 and decreasing it by 4.

Getting to the heart of your question, you can do slightly better than sorting all histogram elements to find the median instead using QuickSelect to obtain O(n) on average time complexity, with O(nlogn) worst-case.