HDBSCAN clusters sentence embeddings in one cluster that are way too far apart

304 Views Asked by At

I have the task to cluster utterances to a chatbot based on sentence similarity in order to find out which are topics users ask about and how important those topics are. I am converting the utterances into sentence embeddings using the "all-mpnet-base-v2". The vector is quite large with 768 dimensions and gets reduced with umap into 13 dimension. Those reduced vectors then get clustered with HDBSCAN:

umap_embeddings = (umap.UMAP(n_neighbors = 3,
                            n_components = 15,
                            metric = 'cosine',
                            random_state=random_state)
                        .fit_transform(message_embeddings))

clusters = hdbscan.HDBSCAN(min_cluster_size = 9,
                           min_samples = min_samples,
                           metric='euclidean',
                           gen_min_span_tree=True,
                           cluster_selection_method='eom').fit(umap_embeddings)

To make this more simpler I reduced the embeddings further into 2D and you can see that there are utterances in the cluster that are obviously not belonging here and I dont get why HDBSCAN is clustering them in that way. Wrong clustering

Does anyone have an idea on why this is happening?

I have tried different parameters for the umap algorithm and for the hdbscan paramters but the problem persists sadly. Still there are utterances that are obviously wrong here.

0

There are 0 best solutions below