How to set clusters centers manually in KMeans and predict probabilities instead of classes? (or GMM)

1.1k Views Asked by At

I'm following this example here:

https://www.stackoverflow.com/questions/60205100/define-cluster-centers-manually

He sets the initial position of the centroids and run one iteration only, so the centroids end up being the initially set ones. I was able to reproduce in my code.

I am also looking for the probabilities as result, I was able using:

https://scikit-learn.org/0.16/modules/generated/sklearn.mixture.GMM.html

I tried to use the same approach (init) used on KMeans but I don't think there's a way using GMM.

So how can I do it? Are there other algorithms/ways?

PS: I understand that they are different algorithms, I'm only trying to interpret the data better.

1

There are 1 best solutions below

0
On

It's not very clear what you are trying to achieve here. Kmeans works by minimizing the elucidean distance within clusters, so there's not so much of a probability here. To calculate a probability, you need to make certain assumptions, for example, the data within the cluster follows a multivariate gaussian. Below is a rough estimation and it really depends on your data.

Note that with 1 iteration, the means can change slightly depending on your dataset, for example:

from sklearn import datasets
from sklearn.cluster import KMeans
from sklearn.mixture import GaussianMixture
import matplotlib.pyplot as plt

cts = np.array([[1,1],[2,2],[3,3]])

X, y_true = datasets.make_blobs(n_samples=100,
                       centers=cts,
                       cluster_std=0.30,
                       random_state=0)

plt.scatter(X[:,0],X[:,1],c=y_true)

enter image description here

Now if we run kmeans like in that post, the means will change (slightly):

kmeans = KMeans(n_clusters=3, random_state=0,
                init = cts,
               n_init=1).fit(X)

kmeans.cluster_centers_

array([[0.99526578, 1.00152973],
       [1.99987588, 2.10819314],
       [2.94674517, 2.96792463]])

And to answer your question, to use GMM to kind of obtain a rough probability based on the kmeans results, we can do:

clf = GaussianMixture(n_components=3, covariance_type='spherical',
                      means_init = kmeans.cluster_centers_ ,n_init= 1 ,max_iter=1)
                      
clf.fit(X)
clf.predict_proba(X)