How to plot the difference between two histograms

986 Views Asked by At

I'm plotting two distributions as histplots, and would like to visualize the difference between them. The distributions are rather similar:

my plots

The code I am using to generate one of these plots looks like this:

sns.histplot(
    data=dfs_downvoted_percentages["only_pro"],
    ax=axes[0],
    x="percentage_downvoted",
    bins=30,
    stat="percent",
)

My supervisor suggested plotting the difference between the normalized distributions, basically displaying the subtraction of one plot form the other. The end result should be a plot where some bins go below 0 (if the bins in plot 2 are larger than in plot 1). Thus, similarities between the plots are erased and differences highlighted.

  1. Does this make sense? The plots are part of a paper which will hopefully be published; I haven't seen such a plot before, but as he explained it, it makes sense to me. Are there better ways to visualize what I want to express? I already have another plot where I filter out all values with x=0, so that the other ones become more visible.
  2. Is there an easy way to achieve this utilizing seaborn?

If not: I know how I can normalize the data and calculate percentage for each bin by hand. But what I couldn't find is a kind of plot that consists of bins and offers the possibility to have negative bins. I know how I could create a lineplot with 30 data points showing the calculated difference, but I'd rather have it visually similar to the original plots with bins instead of a line. What kind of plot could I use for that?

1

There are 1 best solutions below

0
On BEST ANSWER
  • Use np.histogram, which returns hist and bin_edges.
    • The same bin_edges must be used for both function calls.
    • Subtract the hist of each dataframe, and plot it against bin_edges.
  • Plot h_diff as a bar plot.
    • There is one more bin_edge than there are bars, so select all but the last value, bin_edges[:-1], for the x-axis labels passed to x=.
    • The x-ticks of a sns.barplot are 0-indexed, so reset the ticks with an extra tick, off-set them by -0.5, and relabel the ticks with all the bin_edges.
import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt
import numpy as np

# sample data
np.random.seed(2023)
a = np.random.normal(50, 15, (100,))
b = np.random.normal(30, 8, (100,))

# dataframe from sample distributions
df = pd.DataFrame({'a': a, 'b': b})

# calculate the histogram for each distribution
bin_edges = np.arange(10, 91, 10)

a_hist, _ = np.histogram(df.a, bins=bin_edges) 
b_hist, _ = np.histogram(df.b, bins=bin_edges) 

# calculate the difference
h_diff = a_hist - b_hist

# plot
fig, ax = plt.subplots(figsize=(7, 5))
sns.barplot(x=bin_edges[:-1], y=h_diff, color='tab:blue', ec='k', width=1, alpha=0.8, ax=ax)
ax.set_xticks(ticks=np.arange(0, 9)-0.5, labels=bin_edges)
ax.margins(x=0.1)
_ = ax.set(title='Difference between Sample A and B: hist(a) - hist(b)', ylabel='Difference', xlabel='Bin Ranges')

enter image description here

  • An alternate option, which I think is a better presentation of the data, and serves the purpose of showing the distribution of both data sets, is to plot the histograms together with dodged bars.
fig, ax = plt.subplots(figsize=(7, 5))
sns.histplot(data=df, multiple='dodge', common_bins=True, ax=ax, bins=bin_edges)

enter image description here