How to choose the initial clusters for K-mean from Tf-IDF vectors

602 Views Asked by Darsh At 24 June 2025 at 12:12

I'm working with text clustering. I want to select specific documents (as a vector) to be a centroID fo k-means.

I have created the TF-IDF for my dataset by using Mahout, and I would like to choose the initial clusters from TFIDF vectors.

Anyone has an idea how I can specify the initial centroids in Mahout?

Original Q&A

There are 2 best solutions below

Rajkumar On 18 November 2014 at 08:30

bin/mahout kmeans
-c input clusters directory
-k optional number of initial clusters to sample from input vectors

If the -k argument is supplied, any clusters in the -c directory will be overwritten and -k random points will be sampled from the input vectors to become the initial cluster centers.

Reference: https://mahout.apache.org/users/clustering/k-means-clustering.html

Felipe Martins Melo On 23 April 2015 at 14:57

One possibility could be using Cosine similarity instead of TF-IDF, by looking at documents that are the farthest away from one another. Something like this:

Pick a document 1.
Pick the farthest document 2 from document 1.
Pick the farthest document from documents 1 and 2.
etc

Taking a look at this might help as well.

How to choose the initial clusters for K-mean from Tf-IDF vectors

There are 2 best solutions below

Related Questions in CLUSTER-ANALYSIS

Related Questions in MAHOUT

Related Questions in K-MEANS

Related Questions in TEXT-MINING

Related Questions in TF-IDF

Trending Questions

Popular # Hahtags

Popular Questions