Spectral clustering based on an affinity matrix

923 Views Asked by At
    import csv
    import numpy as np
    from sklearn.cluster import SpectralClustering
    reader = csv.reader(open("/Users/Desktop/user2.csv","rU"),
                        dialect=csv.excel_tab)
    x = list(reader)
    result = np.array(x).astype('float')
    lena = result.reshape(182, 182)
    type(lena)
    # Apply spectral clustering (this step goes much faster if you have pyamg
    # installed)
    label = SpectralClustering(n_clusters=5, 
                               affinity='precomputed').fit_predict(lena) 

I have an affinity matrix with 182 users. I want to cluster the users based on the similarity matrix. But the result seems to cluster almost all the users to one cluster. Can anyone explain this?

For the similarity matrix. with 182*182 entries. 6510 entries > 0.001 , max value > 0.97. the dialog matrix is 0. Does this similarity matrix have a problem or is Spectral clustering not suitable for this situation? Are there any other clustering methods which you would recommend?

1

There are 1 best solutions below

0
On

The statistics you give aren't complete enough to give a definitive opinion, but I can make some guesses from the result. I suspect that the problem comes from having a series of well-connected entries strung across your space.

Visualize a simple line of 182 items at 1-unit intervals, with the affinity being the inverse of the cube of the linear distance. Your nearest neighbors have affinity 1/1; the next are 1/8, then 1/27, 1/64, ...

This gives each entity only 20 neighbors (10 on each side) with an affinity >= 0.001, a total of 3640 (if you permit wraparound) plus the main diagonal. This is fewer than your actual example. However, since the immediate neighbor has an affinity of 1, the natural concatenation of almost any clustering algorithm will eventually merge all of these into a single class. This set is highly connected, but not particularly compact.


Thus, the question becomes, "what kind of clustering is appropriate for the shape of your data?" Can you somehow characterize the connectivity of individual entries, their compactness, and their connectivity? Do you see any natural clumping and cut points?