I have a set of around 180K sentence embeddings. I have indexed them using faiss IndexIVFFlat index and clustered them using faiss k-means clustering functionality. I have 20 clusters. Now I would like to determine the size of the clusters - i.e. how many elements each contains.
I would also like to classify each element of the cluster, so essentially I need to:
- determine the size of the cluster
- access each element in the cluster and perform classification.
So far I have only managed to look up elements closest to centroids. Here is my code:
niter = 10
verbose = True
d = sentence_embeddings.shape[1]
kmeans = faiss.Kmeans(d, ncentroids, niter=niter, verbose=verbose, gpu=True)
kmeans.train(sentence_embeddings)
nlist = 20 # how many cells
quantizer = faiss.IndexFlatL2(d)
index = faiss.IndexIVFFlat(quantizer, d, nlist)
index.train(sentence_embeddings)
index.add (sentence_embeddings)
D, I = index.search (kmeans.centroids, 10)
Once you have trained your kmeans, you could obtain the closest centroid to each element in your sentence embeddings. You could do something like:
Not sure what you mean by the second question but each of your embedding is already now clustered and the label is the corresponding entry in
I_flat