I have this problem that I want to plot a data distribution where some values occur frequently while others are quite rare. The number of points in total is around 30.000. Rendering such a plot as png or (god forbid) pdf takes forever and the pdf is much too large to display.
So I want to subsample the data just for the plots. What I would like to achieve is to remove a lot of points where they overlap (where the density is high), but keep the ones where the density is low with almost probability 1.
Now, numpy.random.choice
allows one to specify a vector of probabilities, which I've computed according to the data histogram with a few tweaks. But I can't seem to get my choice so that the rare points are really kept.
I've attached an image of the data; the right tail of the distribution has orders of magnitude fewer points, so I'd like to keep those. The data is 3d, but the density comes from only one dimension, so I can use that as a measure for how many points are in a given location
One possible approach is using kernel density estimation (KDE) to build an estimated probability distribution of the data, then sample according to the inverse of the estimated probability density of each point (or some other function that becomes smaller the bigger the estimated probability density is). There are a few tools to compute a (KDE) in Python, a simple one is
scipy.stats.gaussian_kde
. Here is an example of the idea:Output: