Difference between KDE and Histogram Frequency

8.5k Views Asked by At

The result what i observed from sns density plot is quite confusing.

the result for :

sns.distplot(subset['difference_ratio'], kde = True, label =label ,hist=False).set(xlim=(0,1))

is below:

enter image description here

And the result for :

sns.distplot(subset['difference_ratio'], kde = False, label =label ,hist=True).set(xlim=(0,1))

is below:

enter image description here

How this plots can be explained as similar behavior ?

2

There are 2 best solutions below

0
On

The default y-axis of a histogram shows the number of samples into each bin. The y-axis of the kdeplot has everything normalized such that the total area under the curve is one. Setting norm_hist=True does something similar to the y-axis: all values get scaled such that the areas of the bars would sum to one.

A histogram puts all samples between the boundaries of each bin will fall into the bin. It doesn't differentiate whether the value falls close the left, to the right or the center of the bin.

A kde plot, on the other hand, takes each individual sample value and draws a small gaussian bell curve over it. Then, all bell curves are summed together to form the final curve. A bell curve has some width, making the kde curve a bit wider than the histogram. In general, a kdeplot supposes the underlying distribution is quite smooth and goes slowly to zero near the edges.

The following plot compares the histogram and the kdeplot for a typical sample. The samples are shown in red, with their position on the x-axis and a random y-value (to avoid too much overlap).

from matplotlib import pyplot as plt
import numpy as np
import seaborn as sns

samples = np.clip(0.5 + np.random.uniform(-.2, .2, (10, 10)).cumsum(axis=0).ravel(), 0, 1)

ax = sns.distplot(samples)

x, y = ax.lines[-1].get_data() # get the coordinates of the kde curve
ax.scatter(samples, [np.random.uniform(0, np.interp(samp, x, y)) for samp in samples], color='crimson')
plt.show()

example plot

Notice that the kde curve smooths things out much more than the histogram, and that the kde curve doesn't go abrupt to zero.

PS: To exactly align the bins for two (or more) distributions, note that the number of bins is calculated from the number of samples. And that the boundaries are taken from the sample data. In case you are certain that both sample sets have exactly the same maximum and minimum, you can just set bins= to the same number.

But, in general the extremes are different for continuous distributions. In that case you could explicitly calculate the bins:

xmin = min(min(samples['Detractor']), min(samples['Promoter']))
xmax = max(max(samples['Detractor']), max(samples['Promoter']))
bins = np.linspace(xmin, xmax, 10)
1
On

The different behavior observed for the same data is because of the total number of bins are different in sns(seaborn) kde plot and sns histogram plot. The seaborn distplot by default uses Freedman-Diaconis rule to calculate the bins, hence due to the difference in bin size changed the plot shapes to appear different.

Now if I use:

 sns.distplot(subset['difference_ratio'],bins=10, kde = False, label =label ,hist=True).set(xlim=(0,1))

The output plot is as same as kde plot:

enter image description here