I have the task to cluster utterances to a chatbot based on sentence similarity in order to find out which are topics users ask about and how important those topics are. I am converting the utterances into sentence embeddings using the "all-mpnet-base-v2". The vector is quite large with 768 dimensions and gets reduced with umap into 13 dimension. Those reduced vectors then get clustered with HDBSCAN:
umap_embeddings = (umap.UMAP(n_neighbors = 3,
n_components = 15,
metric = 'cosine',
random_state=random_state)
.fit_transform(message_embeddings))
clusters = hdbscan.HDBSCAN(min_cluster_size = 9,
min_samples = min_samples,
metric='euclidean',
gen_min_span_tree=True,
cluster_selection_method='eom').fit(umap_embeddings)
To make this more simpler I reduced the embeddings further into 2D and you can see that there are utterances in the cluster that are obviously not belonging here and I dont get why HDBSCAN is clustering them in that way. Wrong clustering
Does anyone have an idea on why this is happening?
I have tried different parameters for the umap algorithm and for the hdbscan paramters but the problem persists sadly. Still there are utterances that are obviously wrong here.