I am trying to use Expectation Maximization (EM) to classify my data. But due to my limited experience with Python and algorithms, when I writing the E-step of the algorithm, the method multivariate_normal.pdf() returns zeros and I have no idea why does this happen and how do I fix it.
My parameters set:
n_clusters = 2
n_points = len(X)
m1 = X[:,0].mean()
m2 = X[:,1].mean()
Mu = [[m1, m2*1.1], [m1, m2*0.9]]
Var = [[1, 1], [1, 1]]
Pi = [1 / n_clusters] * 2
W = np.ones((n_points, n_clusters)) / n_clusters
Pi = W.sum(axis=0) / W.sum()
Function for E-step:
def update_W(X, Mu, Var, Pi):
n_points, n_clusters = len(X), len(Pi)
pdfs = np.zeros(((n_points, n_clusters)))
for i in range(n_clusters):
pdfs[:, i] = Pi[i] * multivariate_normal.pdf(X, Mu[i], np.diag(Var[i]))
W = pdfs / pdfs.sum(axis=1).reshape(-1, 1)
return W
The problem is that the code
multivariate_normal.pdf(X, Mu[i], np.diag(Var[i]))
returns zeros and it does not make sense to the algorithm.
My data is here:
X = np.array([[5.04729, 37744.57848238945],
[5.0008, 35899.206357479095],
[4.98011, 33056.83955574036],
[4.9931, 33704.46286725998],
[4.96448, 32990.76251792908],
[4.99008, 32283.805765628815],
[5.04159, 34409.80894064903],
[5.00075, 35270.9743578434],
[5.00899, 36284.97508478165],
[4.97215, 35458.088497161865],
[4.9763, 35349.08898703754],
[5.02148, 38928.51481676102],
[4.96585, 33278.32470035553],
[5.02545, 41023.1229429245],
[5.00402, 39417.85598278046],
[5.02445, 38158.13270044327],
[5.02238, 36516.55680179596],
[4.98606, 34646.19884157181],
[5.0033, 50677.906705379486],
[5.02733, 49949.25080680847],
[5.02716, 39155.43921852112],
[5.04183, 39191.40325546265],
[4.96896, 38266.83190345764],
[4.99652, 36774.488584041595],
[5.03478, 33207.51669681072],
[5.03832, 34724.067808151245],
[5.02347, 32981.52464199066],
[4.96275, 33158.869076013565],
[4.97127, 33810.3498980999],
[5.01482, 35844.073690891266],
[4.97365, 36869.271273851395],
[5.00685, 34672.39159619808],
[4.98033, 34688.13522028923],
[5.00265, 32825.98570096493],
[5.00083, 34160.511184334755],
[5.01027, 33884.14221894741],
[4.95678, 38457.50146174431],
[4.96867, 36320.28945875168],
[5.00721, 33037.772767663],
[5.04357, 33117.379173755646],
[5.0004, 34557.091692984104],
[5.02337, 34421.044409394264],
[5.01924, 36974.69707453251],
[5.04621, 38812.16114796698],
[4.96225, 33006.868394851685],
[5.03166, 33958.025827884674],
[4.95124, 34349.11486387253],
[5.02567, 34259.3874104023],
[5.04868, 41895.62332487106],
[5.00072, 34009.417558074],
[5.04818, 35927.81018912792],
[4.97368, 33997.19724833965],
[4.97671, 33999.519283890724],
[4.95128, 31926.064946889877],
[4.95742, 33177.98467195034],
[4.9803, 36231.32753157616],
[4.99018, 35190.21499049879],
[5.00053, 34551.94658563426],
[4.96793, 33994.06255364418],
[4.98345, 32120.49098587036],
[5.04564, 30271.260227680206],
[4.95421, 30653.037763834],
[5.03442, 31966.675320148468],
[4.9722, 33757.07542133331],
[5.01377, 33405.48862147331],
[5.03417, 33558.4126021862],
[5.02044, 30911.697856664658],
[4.98318, 33883.324236392975],
[4.99876, 36436.7568192482],
[4.98475, 34489.98227977753],
[5.00355, 30205.769625544548],
[5.03646, 33422.97079265118],
[5.04884, 34117.2653195858],
[5.02259, 34993.6039069891],
[4.9737, 47981.5806927681],
[5.00593, 56685.62618494034],
[5.00434, 54395.169895286555],
[4.95768, 35611.35310435295],
[5.02978, 37536.10489606857],
[4.95214, 35577.900700092316],
[4.99925, 33014.68128633499],
[5.03029, 29865.81333065033],
[4.95128, 28760.52792918682],
[5.02051, 37038.269605994225],
[5.04974, 36642.584582567215],
[4.96896, 29041.81211209297],
[4.97972, 37114.91027057171],
[4.99165, 36424.69217658043],
[4.95779, 33859.12481832504],
[5.01495, 32957.517973423004],
[4.95671, 32964.28135430813],
[5.00732, 34051.169529914856],
[4.98396, 32881.329072237015],
[5.02704, 34020.69495886564],
[4.98323, 34851.629052996635],
[5.02463, 34762.618599653244],
[4.95826, 31249.48672771454],
[5.02317, 35223.82992994785],
[4.98121, 31181.02133178711],
[5.00584, 28375.87954646349],
[4.98506, 31617.50852251053],
[5.02145, 31114.64755296707],
[5.03362, 31182.910351395607],
[5.03503, 30093.33342540264],
[4.95797, 36398.88142120838],
[4.97227, 35468.531022787094],
[5.00365, 35569.601973861456],
[4.96697, 33233.918072104454],
[5.01004, 31831.104349255562],
[5.01838, 36176.26664984226],
[5.01747, 35596.35544002056],
[5.0358, 32386.7973613739],
[4.973, 33052.33617353439],
[5.03017, 32395.227126598358],
[4.98371, 34437.29500854015],
[5.02143, 31013.464814782143],
[5.02339, 36368.26537466049],
[5.02942, 36371.447618722916],
[5.04456, 38037.0499355793],
[5.02219, 40177.06857061386],
[4.95168, 37964.482201099396],
[4.99589, 38579.15203022957],
[4.99336, 34731.92454767227],
[4.98333, 34161.41907787323],
[4.95841, 30232.777291665203],
[4.95133, 30595.70202767849],
[5.0252, 30708.32647585869],
[5.04505, 35107.191153526306],
[4.98019, 35513.3441824913],
[4.97158, 30879.74526977539],
[4.96096, 33175.533081412315]])
Thanks in advance!
Let's have a look at some actual numbers:
Now, let's try the following value and see what we get:
You get a really really low probability value but at least it's not zero.
Let's ignore the first dimension for a second as it isn't the problematic one:
38561
is about30
standard deviations away from38591
.For a univariate normal distribution we know that the probability of being more than
3
standard deviations away from the mean is just0.3%
and it decreases very very fast:Look at the values stored in
X
- they are much much further away fromMu[0]
than 30 standard deviations. These values are so unlikely to come from this distribution that as far as Python is concerned, the probability of that happening is0.0
.A better idea would be to start with the actual covariance of your data:
Then suddenly you get decent probabilities :)
P.S. I am adding 1 to the covariance matrix to avoid singularity.