I have distributions of some data pre and post an event occurrence. I want to find the distance between these two distributions. To put it differently, how much would I need to scale pre-event distribution to come close to the post-event distribution? I think Wasserstein distance seems like a good fit to my problem but I have some doubts :
- The distribution is : X axis is days, and Y axis is number of data points on that day. How do I pass these two columns as input to scipy.stats.wasserstein_distance ?
- Post event distribution is more long tailed than pre event distribution. What is the best distance metric to measure the magnitude change on X axis, as well as the increase in Y axis ?
>>> df.head()
day number
0 7 1
1 8 1
2 10 2
3 11 1
4 15 4
>>> df_after.head()
day number
0 6 1
1 19 1
2 20 1
3 21 1
4 22 2
>>> wasserstein_distance(df['number'], df_after['number']) #looks at only one column of DF- how do I pass the distribution?
0.8674329501915711
Here is a sample plot of the real dataset, blue is pre-event occurring and orange is post-event occurrence. My end goal is to learn from such distributions and generalize a scaling factor, i.e. how much do I need to scale my pre-event distribution to get to post-event distribution?
1: