I have a precomputed distance matrix that I want to find the medoids for. According to the scikit-learn docs, there's a parameter and attribute that you have to set and call in order to retrieve these medoids. When I set the parameter store_centers="medoid" and call the attribute .medoids_ I receive this error:
Traceback (most recent call last):
File "C:\Users\Desktop\Clustering\Model.py", line 163, in <module>
cluster(df, 'test.txt')
File "C:\Users\Desktop\Clustering\Model.py", line 139, in cluster
clustering = hdb.fit(distance_matrix.tocsr())
File "C:\Users\Desktop\Clustering\venv\lib\site-packages\sklearn\cluster\_hdbscan\hdbscan.py", line 854, in fit
self._weighted_cluster_center(X)
in _weighted_cluster_center
dist_mat = pairwise_distances(
File "C:\Users\Desktop\Clustering\venv\lib\site-packages\sklearn\metrics\pairwise.py", line 2157, in pairwise_distances
X, _ = check_pairwise_arrays(
File "C:\Users\Desktop\Clustering\venv\lib\site-packages\sklearn\metrics\pairwise.py", line 184, in check_pairwise_arrays
raise ValueError(
ValueError: Precomputed metric requires shape (n_queries, n_indexed). Got (9, 2292) for 9 indexed.
I'm unsure as to how my square precomputed matrix is producing a 9x2292 array. Otherwise, the model works fine and I have no issues manually retrieving the medoids through a mse operation. The reason I want to produce the medoid's this way is in hopes of finding the variable eps for each cluster so that I can fit more data to the clusters.
EDIT: My Code with nonreproducible example:
from fuzzywuzzy import fuzz
from sklearn.cluster import HDBSCAN
from scipy.sparse import lil_matrix
import itertools
def dis_matrix(word_list):
count = 0
kw_index = {}
index_kw = {}
n = len(word_list)
distance_matrix = lil_matrix((n, n))
for kw in word_list:
kw_index[kw] = count
index_kw[count] = kw
count += 1
for x, y in itertools.product(word_list,word_list):
d = fuzz.ratio(x,y) / 100
distance = 1 - d if d <= 1 else 0.00000000000001
index1 = kw_index[x]
index2 = kw_index[y]
distance_matrix[index1, index2] = distance
distance_matrix[index2, index1] = distance
return distance_matrix, index_kw
CLUSTERING_MIN_SAMPLES = 2
x = ['apple', 'app', 'banana', 'bannana', 'applesauce', 'peaches', 'peach', "appban"]
distance_matrix, index_kw = dis_matrix(x)
hdb = HDBSCAN(cluster_selection_epsilon=.1, metric='precomputed', n_jobs=8, min_samples=CLUSTERING_MIN_SAMPLES,store_centers='medoid')
clustering = hdb.fit(distance_matrix.tocsr())
print(clustering.medoids_)```