KMeans evaluation metric not converging. Is this normal behavior or no?

451 Views Asked by Chris At 06 January 2017 at 16:30

I'm working on a problem that necessitates running KMeans separately on ~125 different datasets. Therefore, I'm looking to mathematically calculate the 'optimal' K for each respective dataset. However, the evaluation metric continues decreasing with higher K values.

For a sample dataset, there are 50K rows and 8 columns. Using sklearn's calinski-harabaz score, I'm iterating through different K values to find the optimum / minimum score. However, my code reached k=5,600 and the calinski-harabaz score was still decreasing!

Something weird seems to be happening. Does the metric not work well? Could my data be flawed (see my question about normalizing rows after PCA)? Is there another/better way to mathematically converge on the 'optimal' K? Or should I force myself to manually pick a constant K across all datasets?

Any additional perspectives would be helpful.

Original Q&A

There are 3 best solutions below

Prune On 06 January 2017 at 23:23 BEST ANSWER

SUMMARY

The metric decreases with each increase of K; this strongly suggests that you do not have a natural clustering upon the data set.

DISCUSSION

CH scores depend on the ratio between intra- and inter-cluster densities. For a relatively smooth distribution of points, each increase in K will give you clusters that are slightly more dense, with slightly lower density between them. Try a lattice of points: vary the radius and do the computations by hand; you'll see how that works. At the extreme end, K = n: each point is its own cluster, with infinite density, and 0 density between clusters.

OTHER METRICS

Perhaps the simplest metric is sum-of-squares, which is already part of the clustering computations. Sum the squares of distances from the centroid, divide by n-1 (n=cluster population), and then add/average those over all clusters.

I'm looking for a particular paper that discusses metrics for this very problem; if I can find the reference, I'll update this answer.

N.B. With any metric you choose (as with CH), a failure to find a local minimum suggests that the data really don't have a natural clustering.

WHAT TO DO NEXT?

Render your data in some form you can visualize. If you see a natural clustering, look at the characteristics; how is it that you can see it, but the algebra (metrics) cannot? Formulate a metric that highlights the differences you perceive.

I know, this is an effort similar to the problem you're trying to automate. Welcome to research. :-)

Ted Petrou On 06 January 2017 at 20:55

I don't know anything about the calinski-harabaz score but some score metrics will be monotone increasing/decreasing with respect to increasing K. For instance the mean squared error for linear regression will always decrease each time a new feature is added to the model so other scores that add penalties for increasing number of features have been developed.

There is a very good answer here that covers CH scores well. A simple method that generally works well for these monotone scoring metrics is to plot K vs the score and choose the K where the score is no longer improving 'much'. This is very subjective but can still give good results.

Chris On 11 January 2017 at 18:53

The problem with my question is that the 'best' Calinski-Harabaz score is the maximum, whereas my question assumed the 'best' was the minimum. It is computed by analyzing the ratio of between-cluster dispersion vs. within-cluster dispersion, the former/numerator you want to maximize, the latter/denominator you want to minimize. As it turned out, in this dataset, the 'best' CH score was with 2 clusters (the minimum available for comparison). I actually ran with K=1, and this produced good results as well. As Prune suggested, there appears to be no natural grouping within the dataset.

KMeans evaluation metric not converging. Is this normal behavior or no?

There are 3 best solutions below

Related Questions in ALGORITHM

Related Questions in MACHINE-LEARNING

Related Questions in SCIKIT-LEARN

Related Questions in K-MEANS

Related Questions in EVALUATION

Trending Questions

Popular # Hahtags

Popular Questions