kdeplot produces unexpected results

1.1k Views Asked by At

I created a simple seaborn kde plots and wonder whether this is a bug.

My code is:

import matplotlib.pyplot as plt
import seaborn as sns
import numpy as np

sns.kdeplot(np.array([1,2]), cmap="Reds",  shade=True,  bw=0.01)
sns.kdeplot(np.array([2.4,2.5]), cmap="Blues", shade=True,  bw=0.01)
plt.show()

The blue and red lines show the kde's of 2 points. If the points are close together, the densities are much narrower compared to the points being further apart. I find this very counter intuitive, at least to the extent that can be seen. I am wondering whether this might be a bug. I also could not find a resource describing how the densities are computed from a set of given points. Any help is appreciated.

Plot shows the result of the above code

1

There are 1 best solutions below

0
On BEST ANSWER

The bw_method= (called bw= in older versions), is directly passed to scipy.stats.gaussian_kde. The docs there write "If a scalar, this will be used directly as kde.factor". The explanation of kde.factor tells "The square of kde.factor multiplies the covariance matrix of the data in the kde estimation." So, it is a kind of scaling factor. If still more details are needed, you could dive into scipy's source code, or into the research papers referenced in the docs.

If you really want to counter the scaling, you could divide it away: sns.kdeplot(np.array(data), ..., bw_method=0.01/np.std(data)).

Or you could create your own version of a gaussian kde, with a bandwidth in data coordinates. It just sums some gauss curves and normalizes (total area under the curve should be 1) via dividing by the number of curves.

Here is some example code, with kde curves for 1, 2 or 20 input points:

import matplotlib.pyplot as plt
import seaborn as sns
import numpy as np

def gauss(x, mu=0.0, sigma=1.0):
    return np.exp(-((x - mu) / sigma) ** 2 / 2) / (sigma * np.sqrt(2 * np.pi))

def kde(xs, data, sigma=1.0):
    return gauss(xs.reshape(-1, 1), data.reshape(1, -1), sigma).sum(axis=1) / len(data)

sns.set()
sigma = 0.03
xs = np.linspace(0, 4, 300)
fig, ax = plt.subplots(figsize=(12, 5))

data1 = np.array([1, 2])
kde1 = kde(xs, data1, sigma=sigma)
ax.plot(xs, kde1, color='crimson', label=f'dist of 1, σ={sigma}')
ax.fill_between(xs, kde1, color='crimson', alpha=0.3)

data2 = np.array([2.4, 2.5])
kde2 = kde(xs, data2, sigma=sigma)
ax.plot(xs, kde2, color='dodgerblue', label=f'dist of 0.1, σ={sigma}')
ax.fill_between(xs, kde2, color='dodgerblue', alpha=0.3)

data3 = np.array([3])
kde3 = kde(xs, data3, sigma=sigma)
ax.plot(xs, kde3, color='limegreen', label=f'1 point, σ={sigma}')
ax.fill_between(xs, kde3, color='limegreen', alpha=0.3)

data4 = np.random.normal(0.01, 0.1, 20).cumsum() + 1.1
kde4 = kde(xs, data4, sigma=sigma)
ax.plot(xs, kde4, color='purple', label=f'20 points, σ={sigma}')
ax.fill_between(xs, kde4, color='purple', alpha=0.3)

ax.margins(x=0)  # remove superfluous whitespace left and right
ax.set_ylim(ymin=0)  # let the plot "sit" onto y=0
ax.legend()
plt.show()

kde curves with bandwidth in data coordinates