How to train doc2vec with pre-built vocab in gensim

88 Views Asked by At

I have 1000 documents.

For some purpose I need to keep specific words in the vocab. I tokenize the 1000 documents and I design a word_freq dict. e.g. {"word1":100, "word2": 2000, ...}

Now I want to build a doc2vec model using this word_freq.

from gensim.models.doc2vec import Doc2Vec, TaggedDocument

model = Doc2Vec(vector_size= 300,
                window=300,
                min_count=0, 
                alpha=0.01, 
                min_alpha=0.0007,
                sample=1e-4,
                negative=5,
                dm=1,
                epochs=20,
                workers=16)

model.build_vocab_from_freq(word_freq=tf_idf_vocab, keep_raw_vocab=False, corpus_count=1000, update=False)

N = 942100020 # is the total number of words in the whole 1000 docs.
model.train(corpus_file=train_data,
                total_examples=model.corpus_count,
                total_words=N,
                epochs=model.epochs)

To boost the training time, I used corpus file (SentenceLine) for training where each document is a line (document' words are separated by space).

Each document is expected to be tagged with its number in the corpus file (i.e. numeric tag.)

As a test, I trained the model for few epochs. To get the most similar word to a given document with e.g. tag=0, I use:

doc_vector = model_copy.dv[tag]

sims = model_copy.wv.most_similar([doc_vector], topn=20)

I got an error in doc_vector = model_copy.dv[tag] said that tag=0 does not exist! I debug and it seems that model.dv is empty!

model.dv.expandos # {}

I checked the code of build_vocab(), at some point it call _scan_vocab() where it set the model.dv with tags.

However, in build_vocab_from_freq() it does not call _scan_vocab() and there is no tagging!?

def _scan_vocab(...):
    ....
    for t, dt in doctags_lookup.items():
            self.dv.key_to_index[t] = dt.index
            self.dv.set_vecattr(t, 'word_count', dt.word_count)
            self.dv.set_vecattr(t, 'doc_count', dt.doc_count)

Note that when I used model.build_vocab(corpus_file=train_data, progress_per=1000) to build the vocab internally, the documents are tagged with numeric numbers as I explained above!

0

There are 0 best solutions below