Can Kmeans Clustering using cosine distance in sklearn?

44 Views Asked by Rakha At 20 February 2024 at 10:42

I want to clustering my document using BERT embedding from Sentence Transoformer especially bert-base-nli-mean tokens, and i want to cluster that embedding with kmeans clustering but i have a problem, can i using kmeans clustering using cosine distance?

solution and the code for this problem?

Original Q&A

There are 1 best solutions below

Krish On 21 February 2024 at 00:06

Yes, you can use K-Means clustering with BERT embeddings obtained from Sentence Transformers like bert-base-nli-mean-tokens. However, the standard implementation of K-Means in libraries like scikit-learn uses Euclidean distance, not cosine distance. To cluster embeddings using cosine distance, you have a few option.

!pip install sentence-transformers scikit-learn
from sentence_transformers import SentenceTransformer
from sklearn.preprocessing import normalize
from sklearn.cluster import KMeans

# Load BERT model
model = SentenceTransformer('bert-base-nli-mean-tokens')

# Your document texts
documents = ["Document 1 text...", "Document 2 text...", "..."]

# Generate BERT embeddings
embeddings = model.encode(documents)

# Normalize the embeddings
normalized_embeddings = normalize(embeddings)

# Define the K-Means model
num_clusters = 5  # Adjust the number of clusters
kmeans = KMeans(n_clusters=num_clusters, random_state=42)

# Fit the model
kmeans.fit(normalized_embeddings)

# Get cluster labels
labels = kmeans.labels_

# Output the cluster labels for your documents
print(labels)

Can Kmeans Clustering using cosine distance in sklearn?

There are 1 best solutions below

Related Questions in CLUSTER-ANALYSIS

Related Questions in BERT-LANGUAGE-MODEL

Related Questions in EMBEDDING

Related Questions in COSINE-SIMILARITY

Trending Questions

Popular # Hahtags

Popular Questions