I am a little new to doc2vec algorithm and using gensim for its implementation in python.
Following the gensim tutorial "Gensim Doc2vec Tutorial on the IMDB Sentiment Dataset" I have built vocab and trained a doc2vec model, and stored it on the disc using :
model = Doc2Vec(dm=0, dbow_words=1, size=300, window=8, min_count=2, iter=10, workers=cores, alpha=0.025, min_alpha=0.025)
model.build_vocab(art_shuffle, progress_per=10000)
model.train(art_shuffle, total_examples=len(art_shuffle), epochs=10)
model.save('doc2vec_model')
It creates the following four files in my directory:
doc2vec_model
doc2vec_model.docvecs.doctag_syn0.npy
doc2vec_model.syn1neg.npy
doc2vec_model.wv.syn0.npy
I load the model back, using the same filename I used to save it i.e.
model = Doc2Vec.load('doc2vec_model')
After that, if I use this model to create a vector for my document I get an error
model.infer_vector(tokenize(doc_text))
Traceback (most recent call last):
File "C:\Users\vipul\Documents\NLP_testing\python-nlp\doc2vec_trials\story_prediction_doc2vec.py", line 394, in <module>
inferred_vector = model.infer_vector(tokenize(doc_text))
File "C:\Python27\lib\site-packages\gensim\models\doc2vec.py", line 743, in infer_vector
doctag_vectors=doctag_vectors, doctag_locks=doctag_locks)
File "gensim\models\doc2vec_inner.pyx", line 272, in gensim.models.doc2vec_inner.train_document_dbow (./gensim/models/doc2vec_inner.c:3535)
_word_vectors = <REAL_t *>(np.PyArray_DATA(word_vectors))
TypeError: Cannot convert list to numpy.ndarray
Where am I going wrong?
Note : tokenize() function is returning a list of words using the nltk wordpunct_tokenizer