Can scipy.stats.wasserstein_distance be used with empirical distributions of different (unequal) sizes?

524 Views Asked by At

For the evaluation of a system, I have measured a metric of interest across three distinct configurations (settings). I thus have three arrays of observations, observations_setting_1, observations_setting_2, and observations_setting_3, for example, looking like this:

# len(observations_setting_1): 90,000.
observations_setting_1 = [1.56, 23.7782, 10.46799, 3.013, ..., 15.522]

# len(observations_setting_2): 90,000.
observations_setting_2 = [11.8242, 3.998, 3.427, 13.324, ..., 8.01]

# len(observations_setting_3): 82,129.
observations_setting_3 = [4.2532, 19.75, 12.851, 9.0032, ..., 1.296]

The setting which resulted in observations_setting_1 is considered the baseline, while the latter two settings modify some environmental conditions in order to see how the system's performance changes. As you can see from my example, for one of the settings, I had to remove a number of observations due to experiment collection errors (I cannot repeat the experiments at this point).

I would now like to quantify how much the empirical distributions of the metric of interest obtained from settings 2 and 3 deviate from the baseline. The 1st Wasserstein distance (also known as Earth Mover's distance) appears well suited for this. SciPy provides a function to compute the distance: scipy.stats.wasserstein_distance.

My question: Given the differences in number of observations between settings 1 and 2 and setting 3, can I still use the value computed by scipy.stats.wasserstein_distance to make statements about how much setting 3 diverges from setting 1?

In other words, given:

2_diverges_from_1 = scipy.stats.wasserstein_distance(observations_setting_1, observations_setting_2)
3_diverges_from_1 = scipy.stats.wasserstein_distance(observations_setting_1, observations_setting_3)

can I make statements about how much setting 3 diverges from baseline setting 1 compared to how much setting 2 diverges from baseline setting 1 despite the difference in lengths of the input arrays? Am I making a statistical mistake if I use SciPy's 1st Wasserstein distance in this way? If that was the case, is there a way for me to fix this?

I would have expected SciPy to reject my input arrays if equal size was a constraint, no error is returned, however.

I appreciate any help on this, thanks already.

0

There are 0 best solutions below