sklearn Clustering: Fastest way to determine optimal number of cluster on large data sets

10.1k Views Asked by At

I use KMeans and the silhouette_score from sklearn in python to calculate my cluster, but on >10.000 samples with >1000 cluster calculating the silhouette_score is very slow.

  1. Is there a faster method to determine the optimal number of cluster?
  2. Or should I change the clustering algorithm? If yes, which is the best (and fastest) algorithm for a data set with >300.000 samples and lots of clusters ?
3

There are 3 best solutions below

3
On BEST ANSWER

Most common method to find number of cluster is elbow curve method. But it will require you to run KMeans algorithm multiple times to plot graph. https://en.wikipedia.org/wiki/Determining_the_number_of_clusters_in_a_data_set wiki page mentions some common methods to determine number of clusters.

0
On

The silhouette score, while one of the more attractive measures, iw O(n^2). This means, computing the score is much more expensive than computing the k-means clustering!

Furthermore, these scores are only heuristics. They will not yield "optimal" clusterings by any means. They only give a hint on how to choose k, but very often you will find that other k is much better! So don't trust these scores blindly.

0
On

MiniBatchKmeans is one of the popular option you can try https://scikit-learn.org/stable/modules/generated/sklearn.cluster.MiniBatchKMeans.html