How can I calculate the distance of each observation from the centroids created by scipy.cluster.hierarchy?

403 Views Asked by At

I have a vector x which represents text data transformed with tf-idf. Then I calculate the distance between all points of the vector using cosine_similarity() function of sklearn and create the linkage_matrix of the ward distance using scipy.cluster.hierarchy. This creates an hierarchical clustering, but I can not figure out how to calculate the distance of each observation from each centroid.

When using kmeans from sklearn I figured out that I can caluclate this by calling the transform() method for the x vector, which then returns a matrix with the euclidean distance between each observation and each cluster. I would like to do something similar using scipy.cluster.hierarchy algorithm.

I have tried examining the linkage_matrix returned, as well as the scipy.spatial.distance.pdist, but it does not seem to be what I need.

Is there any way to achieve this?

1

There are 1 best solutions below

0
amol goel On
Z = fcluster(ward(X), threshold, criterion )

threshold : maximum distance between two points in a cluster

Z is 1-D array which assigns cluster number to each point. Now you can estimate distance:

  • Take a cluster
  • Find its centroid
  • Find the distance (same as criterion, I am not sure what criterions mean in scipy library) of centroid from points

Furter REading: https://docs.scipy.org/doc/scipy/reference/generated/scipy.cluster.hierarchy.fcluster.html#scipy.cluster.hierarchy.fcluster