Is kmeans repeatable?

2.3k Views Asked by At

I wanted to know if we get roughly the same centroid points for the exact same data set given that the initial centroid points are chosen randomly.

I'm writing a test kmeans program, and they don't seem to match. I wanted to know if what I'm doing is right.

3

There are 3 best solutions below

0
On BEST ANSWER

The k-means algorithm requires some initialization of the centroid positions. For most algorithms, these centroids are randomly initialized with some method such as the Forgy method or random partitioning, which means that repeated iterations of the algorithm can converge to vastly different results.

Remember that k-means is iterative, and at each "move centroid" step, each centroid is moved to a position that minimizes its distance from its constituent points. This makes it heavily dependent on the starting position.

Because of this, it's usually advisable to run k-means several times, and select the clustering that minimizes the error.

0
On

Many k-means implementations allow fixing the random number generator to make results reproducible.

ELKI: -kmeans.seed parameter

Weka: -s parameter

In others, you can usually provide the initial centers yourself, and then use reproducible pseudo-random seeding to choose them yourself.

0
On

No it is not guaranteed.

Consider a simple case of 2-means with 4 points: (1, 1), (-1, 1), (1, -1), (-1, -1) (a square in a 2D plane) then the 2 centroids may be {(0, 1), (0, -1)} or {(1, 0), (-1, 0)}, two very different results.