BACKGROUND
At the beginning of my project, the focus was to compare requests/questions received in terms of how they differ in terms of content. I trained a Doc2Vec
model and the results were pretty good (for reference, my data included 14 million requests).
class PhrasingIterable():
def __init__(self, my_phraser, texts):
self.my_phraser = my_phraser
self.texts = texts
def __iter__(self):
return iter(self.my_phraser[self.texts])
docs = DocumentIterator()
bigram_transformer = Phrases(docs, min_count=1, threshold=10)
bigram = Phraser(bigram_transformer)
corpus = PhrasingIterable(bigram, docs)
sentences = [TaggedDocument(doc, [i]) for i, doc in enumerate(corpus)]
model = Doc2Vec(window=5,
vector_size=300,
min_count=10,
workers = multiprocessing.cpu_count(),
epochs = 10,
compute_loss=True)
model.build_vocab(sentences)
model.train(sentences, total_examples=model.corpus_count, epochs=model.epochs)
However, in a second stage, the focus of analysis shifted from requests to individuals per week. To measure how individuals requests differ from week to week I extracted all words from requests in a given week t and compared with all words from requests in the previous week t-1 using d2v_model.wv.n_similarity
. Since I need to replicate this in other areas, occurred to me that I was wasting to much memory and time training Doc2Vec
models when I could use Word2Vec
to get the same measure. Thus, I trained the following Word2Vec
model:
docs = DocumentIterator()
bigram_transformer = gensim.models.Phrases(docs, min_count=1, threshold=10)
bigram = gensim.models.phrases.Phraser(bigram_transformer)
sentences = PhrasingIterable(bigram, docs)
model = Word2Vec(window=5,
size=300,
min_count=10,
workers = multiprocessing.cpu_count(),
iter = 10,
compute_loss=True)
model.build_vocab(sentences)
model.train(sentences, total_examples=model.corpus_count, epochs=model.epochs)
I used again the cosine similarity to compare the content from week to week w2v_model.wv.n_similarity
. As a sanity check, I compared the similarities generated by Word2Vec
and Doc2vec
, the correlation coefficient among is around 0.70 and the scale differs a lot. My implied assumption is that comparing sets of extracted words using d2v_model.wv.n_similarity
was taking advantage of the Word2Vec
within the trained Doc2Vec
.
MY QUESTION
Should cosine similarity measures between two sets of extracted words differ as we trade from Doc2Vec
to Word2Vec
? If so, why? I not, any suggestions on my code?