I am trying to vectorize the following Expectation-Maximization / clustering equation for a 2-dimensional Gaussian distribution using numpy. I have a naive approach that I will include at the end of my question:
For context, the variables and dimensions are defined as follows:
= data point index (i.e. 1-1000)
= cluster index (i.e. 1-3)
= a conditional probability that datapoint
is in cluster
(in [0,1])
= value of datapoint
(shape (2,))
= current estimated multi-variate mean of cluster
(shape (2,))
The end product is a numerator that is a sum of (2, 2) shape matrices and the denominator is a scalar. The final value is a (2, 2) covariate matrix estimate. This must also be done for each value of "k" (1, 2, 3).
I've achieved a vectorized approach for other values by defining the following numpy arrays:
= est. probability values for each datapoint, cluster
= multivariate data matrix
= est. cluster means
My naive code is as follows:
for kk in range(k):
numsum = 0
for ii in range(X.shape[0]):
diff = (X[ii, :]-mu[kk, :]).reshape(-1, 1)
numsum = numsum + Z[ii, kk]*np.matmul(diff, diff.T)
sigma[kk] = numsum / np.sum(Z[:, kk])
Long story long - is there any better way to do this?

You can use
np.einsum: