How to intrepret Clusters results after using Doc2vec?

995 Views Asked by At

I am using doc2vec to convert the top 100 tweets of my followers in vector representation (say v1.....v100). After that I am using the vector representation to do the K-Means clusters.

model = Doc2Vec(documents=t, size=100, alpha=.035, window=10, workers=4, min_count=2)

I can see that cluster 0 is dominated by some values (say v10, v12, v23, ....). My question is what does these v10, v12 ... etc represents. Can I deduce that these specific column clusters specific keywords of document.

3

There are 3 best solutions below

3
On

Don't use the individual variables. They should be only analyzed together because of the way these embeddings are trained.

For a starter, find

  1. The most similar document vectors to your centroid to see typical cluster members
  2. The most similar term vectors from the embedding for typical words to describe the cluster
  3. Note the distances to see how good your fit is.
2
On

The clusters themselves does not mean anything specific. You can have as many clusters as you want and all the clustering algorithm would do is try to distribute all your vectors among these clusters. If you are aware of all the tweets and know how many different topics you want them to be separated in, try to clean them or have features in them such that the clustering algorithm can use those to segregate them in the clusters of your choice.

Also if you meant topic modeling, that is different from clustering and you should also look that up.

2
On

These values represent the coordinates of the individual tweets (or documents) that you want to represent in a cluster. I am assuming that v1 to v100 represent the vectors for tweets 1 to 100, otherwise this won't make sense.So if suppose cluster 0 has v1,v5 and v6, this means that tweets 1, 5 and 6 with vector representation v1,v5 and v6 respectively (or the tweets with vectors v1, v5 and v6 as their representation) belong to the cluster 0.