I have a trained word2vec models in geinsim with 300 dimensions and would like to cut the dimensions to 100 (simply drop the last 200 dimensions). What is the easiest and most efficient way using python?
Gensim Word2Vec model: Cut dimensions
1.5k Views Asked by pexmar At
2
There are 2 best solutions below
0

You should be able to trim the dimensions inside a KeyedVectors
instance, then save it – so you don't have to do anything special with the format on disk. For example:
kv = w2v_model.wv
kv.vectors = kv.vectors[:,0:100] # keeps just 1st 100 dims
kv.vector_size = 100
Now kv
can be saved (as either gensim
's native .save()
or the interchange format .save_word2vec_format()
), or just operated on as a subset of the original dimensions.
(While any 100 dimensions of a larger embedding are as likely to be as good as any other, you'll be losing some of the 300-dimensions' expressiveness, in arbitrary ways. Re-training with 100 dimensions to begin with might do better, or using some sort of dimensionality-reduction algorithm which might, in effect, be sure to leave you with the "most expressive" 100 dimensions.)
You could save the output model in the word2vec format. Make sure to save it as a text file (.txt). The word2vec format is as follows
First line is
<vocabulary_size> <embedding_size>
. In your case the<embedding_size>
will be300
. Rest of the lines will be<word><TAB><300 floating point numbers space separated>
. Now you can easily parse this file in python and discard the last 200 floating points from each of the lines. Make sure to update the<embedding_size>
in your first line. Save this as a new file (optional). Now you can load this new file as a fresh word2vec model using load_word2vec_format().