Different cosine similarity coefficients from Doc2Vec and Word2Vec

291 Views Asked by At

BACKGROUND

At the beginning of my project, the focus was to compare requests/questions received in terms of how they differ in terms of content. I trained a Doc2Vec model and the results were pretty good (for reference, my data included 14 million requests).

class PhrasingIterable():
    def __init__(self, my_phraser, texts):
        self.my_phraser = my_phraser 
        self.texts = texts
    def __iter__(self):
        return iter(self.my_phraser[self.texts])

docs = DocumentIterator()
bigram_transformer = Phrases(docs, min_count=1, threshold=10)
bigram = Phraser(bigram_transformer)
corpus = PhrasingIterable(bigram, docs)
sentences = [TaggedDocument(doc, [i]) for i, doc in enumerate(corpus)]    
model = Doc2Vec(window=5,
                        vector_size=300,
                        min_count=10,
                        workers = multiprocessing.cpu_count(),
                        epochs = 10, 
                        compute_loss=True)
model.build_vocab(sentences)
model.train(sentences, total_examples=model.corpus_count, epochs=model.epochs)

However, in a second stage, the focus of analysis shifted from requests to individuals per week. To measure how individuals requests differ from week to week I extracted all words from requests in a given week t and compared with all words from requests in the previous week t-1 using d2v_model.wv.n_similarity. Since I need to replicate this in other areas, occurred to me that I was wasting to much memory and time training Doc2Vec models when I could use Word2Vec to get the same measure. Thus, I trained the following Word2Vec model:

docs = DocumentIterator()
bigram_transformer = gensim.models.Phrases(docs, min_count=1, threshold=10)
bigram = gensim.models.phrases.Phraser(bigram_transformer)
sentences = PhrasingIterable(bigram, docs)
model = Word2Vec(window=5,
                        size=300,
                        min_count=10,
                        workers = multiprocessing.cpu_count(),
                        iter = 10, 
                        compute_loss=True)
model.build_vocab(sentences)
model.train(sentences, total_examples=model.corpus_count, epochs=model.epochs)

I used again the cosine similarity to compare the content from week to week w2v_model.wv.n_similarity. As a sanity check, I compared the similarities generated by Word2Vec and Doc2vec, the correlation coefficient among is around 0.70 and the scale differs a lot. My implied assumption is that comparing sets of extracted words using d2v_model.wv.n_similarity was taking advantage of the Word2Vec within the trained Doc2Vec.

MY QUESTION

Should cosine similarity measures between two sets of extracted words differ as we trade from Doc2Vec to Word2Vec? If so, why? I not, any suggestions on my code?

0

There are 0 best solutions below