Faiss - Determine cluster size after kmeans clustering

2.4k Views Asked by At

I have a set of around 180K sentence embeddings. I have indexed them using faiss IndexIVFFlat index and clustered them using faiss k-means clustering functionality. I have 20 clusters. Now I would like to determine the size of the clusters - i.e. how many elements each contains.

I would also like to classify each element of the cluster, so essentially I need to:

  1. determine the size of the cluster
  2. access each element in the cluster and perform classification.

So far I have only managed to look up elements closest to centroids. Here is my code:

niter = 10
verbose = True
d = sentence_embeddings.shape[1]
kmeans = faiss.Kmeans(d, ncentroids, niter=niter, verbose=verbose, gpu=True)
kmeans.train(sentence_embeddings)

nlist = 20  # how many cells
quantizer = faiss.IndexFlatL2(d)
index = faiss.IndexIVFFlat(quantizer, d, nlist)

index.train(sentence_embeddings)

index.add (sentence_embeddings)
D, I = index.search (kmeans.centroids, 10)
1

There are 1 best solutions below

0
On

Once you have trained your kmeans, you could obtain the closest centroid to each element in your sentence embeddings. You could do something like:

# I contains nearest centroid to each embedding
_, I = kmeans.index.search(sentence_embeddings, 1)
# flattening the result
I_flat = [i[0] for i in I]

Not sure what you mean by the second question but each of your embedding is already now clustered and the label is the corresponding entry in I_flat