Output of Cosine Similarity is not as expected

31 Views Asked by At

I am trying to generate the Cosine similarity between two words in a sentence. The sentence is "The black cat sat on the couch and the brown dog slept on the rug".

My Python code is below:

from nltk.tokenize import sent_tokenize, word_tokenize
import warnings
 
warnings.filterwarnings(action = 'ignore')
 
import gensim
from gensim.models import Word2Vec
from sklearn.metrics.pairwise import cosine_similarity

sentence = "The black cat sat on the couch and the brown dog slept on the rug"
# Replaces escape character with space
f = sentence.replace("\n", " ")
 
data = []

# sentence parsing
for i in sent_tokenize(f):
    temp = []
    # tokenize the sentence into words
    for j in word_tokenize(i):
        temp.append(j.lower())
    data.append(temp)
print(data)
# Creating Skip Gram model
model2 = gensim.models.Word2Vec(data, min_count = 1, vector_size = 512, window = 5, sg = 1)

# Print results
print("Cosine similarity between 'black' " +
          "and 'brown' - Skip Gram : ",
    model2.wv.similarity('black', 'brown'))

As "black" and "brown" are of colour type, their cosine similarity should be maximum (somewhere around 1). But my result shows following:

[['the', 'black', 'cat', 'sat', 'on', 'the', 'couch', 'and', 'the', 'brown', 'dog', 'slept', 'on', 'the', 'rug']]
Cosine similarity between 'black' and 'brown' - Skip Gram :  0.008911405

Any idea what is wrong here? Is my understanding about cosine similarity correct?

2

There are 2 best solutions below

0
gojomo On

If you're training your own word2vec model, as you show here, it needs a large dataset of varied in-context examples of word usage to create useful vectors. It's only the push-pull of trying to model tens of thousands of different words, in many subtly-varied usages, that moves the word-vectors to places where they reflect relative meanings.

That usefulness won't happen with a training corpus of just 15 words, or for words with few usage examples. (There's a good reason the default min_count is 5, and in general you should try to increase that value, as your data becomes large enough to allow it, rather than decrease it.)

Generally, word2vec can't be well-demonstrated or understood with toy-sized examples. Further, to even create word-vectors of common dimensionalities of 100 to 400 dimensions, it's best to have training texts in the millions or billions of words. You need even more training words to support even-larger dimensions, like your vector_size=512 choice.

So some potential options for you are:

  • if you want to train your own model, find a lot more training-texts, use a smaller vector_size, and larger min_count; or

  • use someone else's pretrained sets of word-vectors, which can be loaded into a Gensim KeyedVectors object (vectors without associated training model)

0
00__00__00 On

You cannot train a word2vec on a dozen tokens (words in your sentence). You need thousands to milion tokens