I have 1000 documents.
For some purpose I need to keep specific words in the vocab. I tokenize the 1000 documents and I design a word_freq dict. e.g. {"word1":100, "word2": 2000, ...}
Now I want to build a doc2vec model using this word_freq.
from gensim.models.doc2vec import Doc2Vec, TaggedDocument
model = Doc2Vec(vector_size= 300,
window=300,
min_count=0,
alpha=0.01,
min_alpha=0.0007,
sample=1e-4,
negative=5,
dm=1,
epochs=20,
workers=16)
model.build_vocab_from_freq(word_freq=tf_idf_vocab, keep_raw_vocab=False, corpus_count=1000, update=False)
N = 942100020 # is the total number of words in the whole 1000 docs.
model.train(corpus_file=train_data,
total_examples=model.corpus_count,
total_words=N,
epochs=model.epochs)
To boost the training time, I used corpus file (SentenceLine) for training where each document is a line (document' words are separated by space).
Each document is expected to be tagged with its number in the corpus file (i.e. numeric tag.)
As a test, I trained the model for few epochs. To get the most similar word to a given document with e.g. tag=0, I use:
doc_vector = model_copy.dv[tag]
sims = model_copy.wv.most_similar([doc_vector], topn=20)
I got an error in doc_vector = model_copy.dv[tag]
said that tag=0 does not exist!
I debug and it seems that model.dv is empty!
model.dv.expandos # {}
I checked the code of build_vocab()
, at some point it call _scan_vocab() where it set the model.dv
with tags.
However, in build_vocab_from_freq()
it does not call _scan_vocab() and there is no tagging!?
def _scan_vocab(...):
....
for t, dt in doctags_lookup.items():
self.dv.key_to_index[t] = dt.index
self.dv.set_vecattr(t, 'word_count', dt.word_count)
self.dv.set_vecattr(t, 'doc_count', dt.doc_count)
Note that when I used model.build_vocab(corpus_file=train_data, progress_per=1000)
to build the vocab internally, the documents are tagged with numeric numbers as I explained above!