Python: matplotlib - probability mass function as histogram

9.8k Views Asked by At

I want to draw a histogram and a line plot at the same graph. However, to do that I need to have my histogram as a probability mass function, so I want to have on the y-axis a probability values. However, I don't know how to do that, because using the normed option didn't helped. Below is my source code and a sneak peek of used data. I would be very grateful for all suggestions.

data = [12565, 1342, 5913, 303, 3464, 4504, 5000, 840, 1247, 831, 2771, 4005, 1000, 1580, 7163, 866, 1732, 3361, 2599, 4006, 3583, 1222, 2676, 1401, 2598, 697, 4078, 5016, 1250, 7083, 3378, 600, 1221, 2511, 9244, 1732, 2295, 469, 4583, 1733, 1364, 2430, 540, 2599, 12254, 2500, 6056, 833, 1600, 5317, 8333, 2598, 950, 6086, 4000, 2840, 4851, 6150, 8917, 1108, 2234, 1383, 2174, 2376, 1729, 714, 3800, 1020, 3457, 1246, 7200, 4001, 1211, 1076, 1320, 2078, 4504, 600, 1905, 2765, 2635, 1426, 1430, 1387, 540, 800, 6500, 931, 3792, 2598, 5033, 1040, 1300, 1648, 2200, 2025, 2201, 2074, 8737, 324]
plt.style.use('ggplot')
plt.rc('xtick',labelsize=12)
plt.rc('ytick',labelsize=12)
plt.xlabel("Incomes")
plt.hist(data, bins=50, color="blue", alpha=0.5, normed=True)
plt.show() 
2

There are 2 best solutions below

6
On BEST ANSWER

As far as I know, matplotlib does not have this function built-in. However, it is easy enough to replicate

    import numpy as np
    heights,bins = np.histogram(data,bins=50)
    heights = heights/sum(heights)
    plt.bar(bins[:-1],heights,width=(max(bins) - min(bins))/len(bins), color="blue", alpha=0.5)

Edit: Here is another approach from a similar question:

     weights = np.ones_like(data)/len(data)
     plt.hist(data, bins=50, weights=weights, color="blue", alpha=0.5, normed=False) 
1
On

This is old, but since I found it and was about to use it before I noticed some mistakes, I figured I'd add a comment for a couple of fixes I noticed. In the example @mmdanziger uses the bin edges in plt.bar, however, you need to actually use the centers of the bin. Also they assume that the bins are of equal width, which is fine "most" of the time. But you can also pass it an array of widths, which keep you from inadvertently forgetting and making a mistake. So here's a more complete example:

import numpy as np
heights, bins = np.histogram(data, bins=50)
heights = heights/sum(heights)
bin_centers = 0.5*(bins[1:] + bins[:-1])
bin_widths = np.diff(bins)
plt.bar(bin_centers, heights, width=bin_widths, color="blue", alpha=0.5)

@mmdanziger other option of passing weights = np.ones_like(data)/len(data) to plt.hist() also does the same thing, and for many is an easier approach.