I've generated a 100D word2vec model using my domain text corpus, merging common phrases, for example (good bye => good_bye). Then I've extracted 1000 vectors of desired words.
So I have a 1000 numpy.array like so:
[[-0.050378,0.855622,1.107467,0.456601,...[100 dimensions],
[-0.040378,0.755622,1.107467,0.456601,...[100 dimensions],
...
...[1000 Vectors]
]
And words array like so:
["hello","hi","bye","good_bye"...1000]
I have ran K-Means on my data, and the results I got made sense:
X = np.array(words_vectors)
kmeans = KMeans(n_clusters=20, random_state=0).fit(X)
for idx,l in enumerate(kmeans.labels_):
print(l,words[idx])
--- Output ---
0 hello
0 hi
1 bye
1 good_bye
0 = greeting 1 = farewell
However, some words made me think that hierarchical clustering is more suitable for the task. I've tried using AgglomerativeClustering, Unfortunately ... for this Python nobee, things got complicated and I got lost.
How can I cluster my vectors, so the output would be a dendrogram, more or less, like the one found on this wiki page?
I had the same problem till now! After finding always your post after searching it online (keyword = hierarchy clustering on word2vec). I had to give you a perhaps valid solution.