Wasserstein distance between two distributions python

878 Views Asked by At

I have distributions of some data pre and post an event occurrence. I want to find the distance between these two distributions. To put it differently, how much would I need to scale pre-event distribution to come close to the post-event distribution? I think Wasserstein distance seems like a good fit to my problem but I have some doubts :

  1. The distribution is : X axis is days, and Y axis is number of data points on that day. How do I pass these two columns as input to scipy.stats.wasserstein_distance ?
  2. Post event distribution is more long tailed than pre event distribution. What is the best distance metric to measure the magnitude change on X axis, as well as the increase in Y axis ?
>>> df.head()
   day  number
0    7       1
1    8       1
2   10       2
3   11       1
4   15       4
>>> df_after.head()
   day  number
0    6       1
1   19       1
2   20       1
3   21       1
4   22       2
>>> wasserstein_distance(df['number'], df_after['number']) #looks at only one column of DF- how do I pass the distribution?
0.8674329501915711

Here is a sample plot of the real dataset, blue is pre-event occurring and orange is post-event occurrence. My end goal is to learn from such distributions and generalize a scaling factor, i.e. how much do I need to scale my pre-event distribution to get to post-event distribution?

1: Two distributions of the same object. Blue is pre-event occurring and orange is post-event occurrence

0

There are 0 best solutions below