Distance calculation in hierarchical clustering "complete" linkage

1.7k Views Asked by At

I am not able to understand how SciPy Hierarchical Clustering computes distance between original points or clusters in dendogram.

import scipy.cluster.hierarchy as hclus
import numpy
import cPickle

distmatrix = cPickle.load(open("mydistmatrix.pkl", "rb"))
print distmatrix

dendogram = hclus.linkage(distmatrix, method="complete")
numpy.savetxt("mydendogram.txt", dendogram, fmt='%.1f')

DistMatrix is as below, correctly printed. I also print mydendogram.txt, which I cannot make sense of.

Distance Matrix - I've appended i- as line number, this is not part of matrix.

0- [[ 0 11 68 60 60 60 61  7 17 73]
1- [11  0 68 52 52 51 55 17  6 73]
2- [68 68  0 90 90 91 94 73 73  6]
3- [60 52 90  0 10 11 36 62 55 92]
4- [60 52 90 10  0  2 36 63 55 92]
5- [60 51 91 11  2  0 36 63 54 93]
6- [61 55 94 36 36 36  0 63 57 96]
7- [ 7 17 73 62 63 63 63  0 11 68]
8- [17  6 73 55 55 54 57 11  0 68]
9- [73 73  6 92 92 93 96 68 68  0]]

Dendogram - I've appended step number i and new node n+i at end for readability, not part of dendogram.

0 - 4.0 5.0 3.6 2.0 - 10
1 - 2.0 9.0 13.7 2.0 - 11
2 - 1.0 8.0 15.0 2.0 - 12
3 - 0.0 7.0 15.7 2.0 - 13
4 - 3.0 10.0 17.6 3.0 - 14
5 - 12.0 13.0 33.1 4.0 - 15
6 - 6.0 14.0 66.9 4.0 - 16
7 - 15.0 16.0 148.1 8.0 - 17
8 - 11.0 17.0 208.9 10.0 - 18

Now, I can understand node 4 and 5 will merge first, as distance between them is smallest in distance matrix, but distance in matrix is 2 but 3.6 in dendogram. Similarly distance between nodes 2 and 9 is 6, but dendogram shows 13.7. Maximum distance is 208.9 of dendogram even if maximum number in distance matrix is 96. It appears that order of merging is correct, but I don't understand how distance is computed, and that is important to me identify suitable point to cut the tree to obtain the clusters. Documentation (http://docs.scipy.org/doc/scipy/reference/generated/scipy.cluster.hierarchy.linkage.html#scipy.cluster.hierarchy.linkage) doesn't help.

Please explain.

1

There are 1 best solutions below

1
On BEST ANSWER

This helps - Use Distance Matrix in scipy.cluster.hierarchy.linkage()?

import scipy.spatial.distance as ssd
distmatrix = ssd.squareform(distmatrix + distmatrix.T)

(Not sure if question should be deleted, or kept as handy reference)