Annotating a few points on a tSNE plot - if possible, a couple of points per cluster

1.7k Views Asked by At

I have a list of ~500 embedding vectors (each embedding vector is length 400, too long to post, but this is an example of the start of one of them:

[-1.5425615, -0.52326035, 0.48309317, -1.3839878, -1.3774203, -0.44861528, 3.026304, -0.23582345, 4.3516054, -2.1284392, -3.0056703, 1.4997623, 0.51767087, -2.3668504, 0.9771546, -2.5286832, -1.1869463, -1.2889853, -4.272979...]

(so there are ~500 of these vector lists in a list called 'list_of_vectors')

There is also a list_of_labels, where each vector list is assigned to a label.

I want to plot them on a t-SNE plot, so I wrote:

tsne = TSNE(n_components=2)
X_tsne = tsne.fit_transform(list_of_vectors)

The output is:

So there are ~500 dots in the below plot, each one has one label (from list_of_labels)

You can see the dots are very roughly clustered, and I want to just add a couple of labels to each rough cluster, so I know which cluster is which, or can I can colour the clusters differently and have a legend with a sample word from that cluster in the legend?

Is there a way for me to annotate/label a couple of the dots in each cluster?

Or any method that would add say 5/10 labels to the below graph, so I can understand the plot better?

It doesn't have to be super exact, I'm just trying to broadly understand the plot better?

1

There are 1 best solutions below

0
On

If I understand correctly, you want to annotate some points in your graph based on the group they belong to. And you want to annotate them with the group label. If that's the case, just iterate over the groups and annotate some randomly selected points. You could do it as I did in the first script or you can just plot the scatterplot with eg seaborn with hue and then add the loop over the points with annotation (second solution). But it would be much easier to read if you also assigned different colours to your groups:

# how many samples to annotate
m = 4

#create a new figure
plt.figure(figsize=(10,10))

#loop through labels and plot each cluster separately
for label in data.label.unique():

    # plot the given group
    plt.scatter(x=data.loc[data['label']==label, 'x'], y=data.loc[data['label']==label,'y'], alpha=0.5)
    
    # randomly sample
    tmp = data.loc[data['label']==label].sample(m)
    
    #add label to some random points per group
    for _,row in tmp.iterrows():
        plt.annotate(label, (row['x'], row['y']), size=10, weight='bold', color='k') 
    

with seaborn

sns.scatterplot(x="x", y="y", hue="label", data=data)

#loop through labels and plot each cluster
for label in data.label.unique():
    
    # randomly sample
    tmp = data.loc[data['label']==label].sample(m)
    
    #add label to some random points per group
    for _,row in tmp.iterrows():
        plt.annotate(label, (row['x'], row['y']), size=10, weight='bold', color='k')

enter image description here