K- means defining Initial Centers for tf-idf matrix

1.1k Views Asked by At

I am using k-means for clustering articles and it is working perfectly. Now I want to define initial centers to get more reasonable results.

My Python code:

tfidf_matrix = tfidf_vectorizer.fit_transform(articles)
X = np.array([[-19.67480000,  -8.546],
            [22.010807000,-10.9737],
            [11.959700000,19.2701],
            [12.254700000, 11.2381],
            [16.649700000,-15.2251],
            [19.859700000, 13.2601]] , np.float64)
km = KMeans(n_clusters=6,init=X, n_init=1).fit(tfidf_matrix)

when I am trying to define initial centroids, I get the following error:

ValueError: The number of features of the initial centers 2 does not match the number of features of the data 4602.

From the error I get the idea that the dimensions are not equal. How can I transform my initial centers to satisfy the dimensions of the sparse matrix?

1

There are 1 best solutions below

1
On

The number of features in the centroids should be the same as the number of features in the data.

Your input data (tfidf_matrix) is (1111, 8262) i.e. 1111 samples with 8262 features. Then, your 6 centroids should also have 8262 features. The shape of X should be (6,8262).