Word2vec gensim - Calculating similarity between word isn't working when using phrases

1k Views Asked by At

Using gensim word2vec model in order to calculate similarities between two words. Training the model with a 250mb Wikipedia text gave a good result - about 0.7-0.8 similarity score for a related pair of words.

The problem is when I am using the Phraser model to add up phrases the similarity score drops to nearly zero for the same exact words.

Results with the phrase model:

speed - velocity - 0.0203503432178
high - low - -0.0435703782446
tall - high - -0.0076987978333
nice - good - 0.0368784716958
computer - computational - 0.00487748035808

That probably means I am not using the Phraser model correctly.

My Code:

    data_set_location = **
    sentences = SentenceIterator(data_set_location)

    # Train phrase locator model
    self.phraser = Phraser(Phrases(sentences))

    # Renewing the iterator because its empty
    sentences = SentenceIterator(data_set_location)

    # Train word to vector model or load it from disk
    self.model = Word2Vec(self.phraser[sentences], size=256, min_count=10, workers=10)



class SentenceIterator(object):
    def __init__(self, dirname):
        self.dirname = dirname

    def __iter__(self):
        for fname in os.listdir(self.dirname):
            for line in open(os.path.join(self.dirname, fname), 'r', encoding='utf-8', errors='ignore'):
                yield line.lower().split()

Trying the pharser model alone looks like it worked fine:

>>>vectorizer.phraser['new', 'york', 'city', 'the', 'san', 'francisco'] ['new_york', 'city', 'the', 'san_francisco']

What can cause such behavior?

Trying to figure out the solution:

according to gojomo answer, I've tried to create a PhraserIterator:

import os

class PhraseIterator(object):
def __init__(self, dirname, phraser):
    self.dirname = dirname
    self.phraser = phraser

def __iter__(self):
    for fname in os.listdir(self.dirname):
        for line in open(os.path.join(self.dirname, fname), 'r', encoding='utf-8', errors='ignore'):
            yield self.phraser[line.lower()]

using this iterator I've tried to train my Word2vec model.

phrase_iterator = PhraseIterator(text_dir, self.phraser)
self.model = Word2Vec(phrase_iterator, size=256, min_count=10, workers=10

Word2vec training log:

    Using TensorFlow backend.
2017-06-30 19:19:05,388 : INFO : collecting all words and their counts
2017-06-30 19:19:05,456 : INFO : PROGRESS: at sentence #0, processed 0 words and 0 word types
2017-06-30 19:20:30,787 : INFO : collected 6227763 word types from a corpus of 28508701 words (unigram + bigrams) and 84 sentences
2017-06-30 19:20:30,793 : INFO : using 6227763 counts as vocab in Phrases<0 vocab, min_count=5, threshold=10.0, max_vocab_size=40000000>
2017-06-30 19:20:30,793 : INFO : source_vocab length 6227763
2017-06-30 19:21:46,573 : INFO : Phraser added 50000 phrasegrams
2017-06-30 19:22:22,015 : INFO : Phraser built with 70065 70065 phrasegrams
2017-06-30 19:22:23,089 : INFO : saving Phraser object under **/Models/word2vec/phrases_model, separately None
2017-06-30 19:22:23,441 : INFO : saved **/Models/word2vec/phrases_model
2017-06-30 19:22:23,442 : INFO : collecting all words and their counts
2017-06-30 19:22:29,347 : INFO : PROGRESS: at sentence #0, processed 0 words, keeping 0 word types
2017-06-30 19:33:06,667 : INFO : collected 143 word types from a corpus of 163438509 raw words and 84 sentences
2017-06-30 19:33:06,677 : INFO : Loading a fresh vocabulary
2017-06-30 19:33:06,678 : INFO : min_count=10 retains 95 unique words (66% of original 143, drops 48)
2017-06-30 19:33:06,679 : INFO : min_count=10 leaves 163438412 word corpus (99% of original 163438509, drops 97)
2017-06-30 19:33:06,683 : INFO : deleting the raw counts dictionary of 143 items
2017-06-30 19:33:06,683 : INFO : sample=0.001 downsamples 27 most-common words
2017-06-30 19:33:06,683 : INFO : downsampling leaves estimated 30341972 word corpus (18.6% of prior 163438412)
2017-06-30 19:33:06,684 : INFO : estimated required memory for 95 words and 256 dimensions: 242060 bytes
2017-06-30 19:33:06,685 : INFO : resetting layer weights
2017-06-30 19:33:06,724 : INFO : training model with 10 workers on 95 vocabulary and 256 features, using sg=0 hs=0 sample=0.001 negative=5 window=5
2017-06-30 19:33:14,974 : INFO : PROGRESS: at 0.00% examples, 0 words/s, in_qsize 0, out_qsize 0
2017-06-30 19:33:23,229 : INFO : PROGRESS: at 0.24% examples, 607 words/s, in_qsize 0, out_qsize 0
2017-06-30 19:33:31,445 : INFO : PROGRESS: at 0.48% examples, 810 words/s, 
...
2017-06-30 20:19:00,864 : INFO : PROGRESS: at 98.57% examples, 1436 words/s, in_qsize 0, out_qsize 1
2017-06-30 20:19:06,193 : INFO : PROGRESS: at 99.05% examples, 1437 words/s, in_qsize 0, out_qsize 0
2017-06-30 20:19:11,886 : INFO : PROGRESS: at 99.29% examples, 1437 words/s, in_qsize 0, out_qsize 0
2017-06-30 20:19:17,648 : INFO : PROGRESS: at 99.52% examples, 1438 words/s, in_qsize 0, out_qsize 0
2017-06-30 20:19:22,870 : INFO : worker thread finished; awaiting finish of 9 more threads
2017-06-30 20:19:22,908 : INFO : worker thread finished; awaiting finish of 8 more threads
2017-06-30 20:19:22,947 : INFO : worker thread finished; awaiting finish of 7 more threads
2017-06-30 20:19:22,947 : INFO : PROGRESS: at 99.76% examples, 1439 words/s, in_qsize 0, out_qsize 8
2017-06-30 20:19:22,948 : INFO : worker thread finished; awaiting finish of 6 more threads
2017-06-30 20:19:22,948 : INFO : worker thread finished; awaiting finish of 5 more threads
2017-06-30 20:19:22,948 : INFO : worker thread finished; awaiting finish of 4 more threads
2017-06-30 20:19:22,948 : INFO : worker thread finished; awaiting finish of 3 more threads
2017-06-30 20:19:22,948 : INFO : worker thread finished; awaiting finish of 2 more threads
2017-06-30 20:19:22,948 : INFO : worker thread finished; awaiting finish of 1 more threads
2017-06-30 20:19:22,949 : INFO : worker thread finished; awaiting finish of 0 more threads
2017-06-30 20:19:22,949 : INFO : training on 817192545 raw words (4004752 effective words) took 2776.2s, 1443 effective words/s
2017-06-30 20:19:22,950 : INFO : saving Word2Vec object under **/Models/word2vec/word2vec_model, separately None
2017-06-30 20:19:22,951 : INFO : not storing attribute syn0norm
2017-06-30 20:19:22,951 : INFO : not storing attribute cum_table
2017-06-30 20:19:22,958 : INFO : saved **/Models/word2vec/word2vec_model

After this training - any of two similarity calculation produce zero:

speed - velocity - 0
high - low - 0

So it seems that the iterator is not working well so I've checked it using gojomo trick:

print(sum(1 for _ in s))
1

print(sum(1 for _ in s))
1

And its working.

What may be the problem?

2

There are 2 best solutions below

4
On BEST ANSWER

First, if your iterable class is working properly – and it looks OK to me – you won't need to "renew the iterator because it's empty". Rather, it will be capable of being iterated over multiple times. You can test if it's working properly as an iterable-object, rather than a single iteration, with code like:

sentences = SentencesIterator(mypath)
print(sum(1 for _ in sentences))
print(sum(1 for _ in sentences))

If the same length prints twice, congratulations, you have a true iterable object. (You might want to update the class name to reflect that.) If the second length is 0, you've only got an iterator: it can be consumed once, and then is empty on subsequent attempts. (If so, adjust the class code so that each call to __iter__() starts fresh. But as noted above, I think your code is already correct.)

That digression was important, because what's the true cause of your problem is that self.phraser[sentences] is just returning a one-time iterator object, not a repeatable iterable object. Thus, Word2Vec's 1st vocabulary-discovery step consumes the whole corpus in its one pass, then all training passes just see nothing – and no training occurs. (If you have INFO-level logging on, this should be evident in the output showing instant training over no examples.)

Try making a PhraserIterable class, which takes a phraser and a sentences, and upon each call to __iter__() starts a new, fresh pass over the setences. Supply a (confirmed-restartable) instance of that as the corpus for Word2Vec. You should see training take longer as it does its default 5 passes – and then see real results on later token-comparisons.

Separately: the on-the-fly upgrading of original sentences unigrams to phraser-calculated bigrams can be computationally expensive. The approach suggested above means that happens 6 times – the vocabulary-scan then the 5 training passes. Where running-time is a concern, it can be beneficial to perform the phraser-combination once, saving the results to an in-memory object (if your corpus easily fits in RAM) or a new simply-space-delimited interim-results file, then use that file as input to the Word2Vec model.

0
On

Using the help of gojomo this is the code that worked:

PhraseIterator:

class PhraseIterator(object):
def __init__(self, phraser, sentences_iterator):
    self.phraser = phraser
    self.sentences_iterator = sentences_iterator

def __iter__(self):
        yield self.phraser[self.sentences_iterator]

Using this iterator produced an error:

Unhashable type list

So i found a solution which was to use it that way:

from itertools import chain

phrase_iterator = PhraseIterator(self.phraser, sentences)
self.model = Word2Vec(list(chain(*phrase_iterator)), size=256, min_count=10, workers=10)

Now the similarities calculations working great(way better than before, without phrasing):

speed - velocity - 0.950267364305
high - low - 0.933983275802
tall - high - 0.858025875923
nice - good - 0.878882061037
computer - computational - 0.972395648333