Labels obtained from clustering seem visually incorrect

248 Views Asked by At

I have the following distance matrix based on 10 datapoints:

import numpy as np

distance_matrix = np.array([[0.        , 0.00981376, 0.0698306 , 0.01313118, 0.05344448,
                             0.0085152 , 0.01996724, 0.14019663, 0.03702411, 0.07054652],
                            [0.00981376, 0.        , 0.06148157, 0.00563764, 0.04473798,
                             0.00905327, 0.01223233, 0.13140022, 0.03114453, 0.06215728],
                            [0.0698306 , 0.06148157, 0.        , 0.05693448, 0.02083512,
                             0.06390897, 0.05107812, 0.07539802, 0.04003773, 0.00703263],
                            [0.01313118, 0.00563764, 0.05693448, 0.        , 0.0408836 ,
                             0.00787845, 0.00799949, 0.12779965, 0.02552774, 0.05766039],
                            [0.05344448, 0.04473798, 0.02083512, 0.0408836 , 0.        ,
                             0.04846382, 0.03638932, 0.0869414 , 0.03579818, 0.0192329 ],
                            [0.0085152 , 0.00905327, 0.06390897, 0.00787845, 0.04846382,
                             0.        , 0.01284173, 0.13540522, 0.03010677, 0.0646998 ],
                            [0.01996724, 0.01223233, 0.05107812, 0.00799949, 0.03638932,
                             0.01284173, 0.        , 0.12310601, 0.01916205, 0.05188323],
                            [0.14019663, 0.13140022, 0.07539802, 0.12779965, 0.0869414 ,
                             0.13540522, 0.12310601, 0.        , 0.11271352, 0.07346808],
                            [0.03702411, 0.03114453, 0.04003773, 0.02552774, 0.03579818,
                             0.03010677, 0.01916205, 0.11271352, 0.        , 0.04157886],
                            [0.07054652, 0.06215728, 0.00703263, 0.05766039, 0.0192329 ,
                             0.0646998 , 0.05188323, 0.07346808, 0.04157886, 0.        ]])

I transform the distance_matrix to an affinity_matrix by using the following

delta = 0.1 
np.exp(- distance_matrix ** 2 / (2. * delta ** 2))

Which gives

affinity_matrix = np.array([[1.        , 0.99519608, 0.7836321 , 0.99141566, 0.86691389,
                             0.99638113, 0.98026285, 0.37427863, 0.93375682, 0.77970427],
                            [0.99519608, 1.        , 0.82778719, 0.99841211, 0.90477015,
                             0.9959103 , 0.99254642, 0.42176757, 0.95265821, 0.82433657],
                            [0.7836321 , 0.82778719, 1.        , 0.85037594, 0.97852875,
                             0.81528476, 0.8777015 , 0.75258369, 0.92297697, 0.99753016],
                            [0.99141566, 0.99841211, 0.85037594, 1.        , 0.91982353,
                             0.99690131, 0.99680552, 0.44191509, 0.96794184, 0.84684633],
                            [0.86691389, 0.90477015, 0.97852875, 0.91982353, 1.        ,
                             0.88919645, 0.93593511, 0.68527137, 0.9379342 , 0.98167476],
                            [0.99638113, 0.9959103 , 0.81528476, 0.99690131, 0.88919645,
                             1.        , 0.9917884 , 0.39982486, 0.95569077, 0.81114925],
                            [0.98026285, 0.99254642, 0.8777015 , 0.99680552, 0.93593511,
                             0.9917884 , 1.        , 0.46871776, 0.9818083 , 0.87407117],
                            [0.37427863, 0.42176757, 0.75258369, 0.44191509, 0.68527137,
                             0.39982486, 0.46871776, 1.        , 0.52982057, 0.76347268],
                            [0.93375682, 0.95265821, 0.92297697, 0.96794184, 0.9379342 ,
                             0.95569077, 0.9818083 , 0.52982057, 1.        , 0.91719051],
                            [0.77970427, 0.82433657, 0.99753016, 0.84684633, 0.98167476,
                             0.81114925, 0.87407117, 0.76347268, 0.91719051, 1.        ]])

I transform the distance_matrix into a heatmap to get a better visual of the data

import seaborn as sns
distance_matrix_df = pd.DataFrame(distance_matrix)
distance_matrix_df.columns = [x + 1 for x in range(10))]
distance_matrix_df.index = [x + 1 for x in range(10)]
sns.heatmap(distance_matrix_df, cmap='RdYlGn_r', annot=True, linewidths=0.5)

enter image description here

Next I want to cluster the affinity_matrix in 3 clusters. Before running the actual clustering, I inspect the heatmap to forecast the clusters. Clearly #8 is an outlier and will be a cluster on its own.

Next I run the actual clustering.

from sklearn.cluster import SpectralClustering    
clustering = SpectralClustering(n_clusters=3,
                            assign_labels='kmeans',
                            affinity='precomputed').fit(affinity_matrix)
clusters = clustering.labels_.copy()
clusters = clusters.astype(np.int32) + 1 

The outputs yields

[1, 1, 2, 1, 2, 1, 1, 2, 3, 2]

So, #8 is part of cluster 2 which consists of three other data points. Initially, I would assume that it would be a cluster on its own. Did I do something wrong? Or can someone show me why #8 looks like #3, #5 and #10. Please advice.

1

There are 1 best solutions below

0
On

When we are moving away from relatively simple clustering algorithms, say like k-means, whatever intuition we may carry along regarding algorithms results and expected behaviors breaks down; indeed, the scikit-learn documentation on spectral clustering gives an implicit warning about that:

Apply clustering to a projection of the normalized Laplacian.

In practice Spectral Clustering is very useful when the structure of the individual clusters is highly non-convex or more generally when a measure of the center and spread of the cluster is not a suitable description of the complete cluster. For instance when clusters are nested circles on the 2D plane.

Now, even if one pretends to understand exactly what "a projection of the normalized Laplacian" means (I won't), the rest of the description arguably makes clear enough that here we should not expect results similar with more intuitive, distance-based clustering algorithms like k-means.

Nevertheless, your own intuition is not unfounded, and it shows if you just try a k-means clustering instead of a spherical one; using your exact data, we get

from sklearn.cluster import KMeans    
clustering = KMeans(n_clusters=3, random_state=42).fit(affinity_matrix)
clusters = clustering.labels_.copy()
clusters = clusters.astype(np.int32) + 1 
clusters
# result:
array([2, 2, 1, 2, 1, 2, 2, 3, 2, 1], dtype=int32)

where indeed sample #8 stands out as an outlier in a cluster of its own (#3).

Nevertheless, the same intuition is not necessarily applicable or useful with other clustering algorithms, whose value is arguably exactly that they can uncover regularities of different kinds in the data - arguably they would not be that useful if they just replicated results from existing algorithms like k-means, would they?

The scikit-learn vignette Comparing different clustering algorithms on toy datasets might be useful to get an idea of how different clustering algorithms behave on some toy 2D datasets; here is the summary finding:

enter image description here