Spark MLLib's Word2Vec cosine similarity greater than 1

1.4k Views Asked by Jason Xie At 16 October 2025 at 20:55

http://spark.apache.org/docs/latest/mllib-feature-extraction.html#word2vec

On the spark implementation of word2vec, when the number of iterations or data partitions are greater than one, for some reason, the cosine similarity is greater than 1.

In my knowledge, cosine similarity should always be about -1 < cos < 1. Does anyone know why?

Original Q&A

There are 1 best solutions below

Kotaro Tanahashi On 17 November 2015 at 18:33

In findSynonyms method of word2vec, it does not calculate cosine similarity v1・vi / |v1| |vi|, instead it calculates v1・vi / |vi|, where v1 is the vector of the query word and vi is the vector of the candidate words. That's why the value sometimes exceeds 1. Just to find closer words, it is not necessary to divide by |v1| because it is constant.

Spark MLLib's Word2Vec cosine similarity greater than 1

There are 1 best solutions below

Related Questions in MACHINE-LEARNING

Related Questions in WORD2VEC

Related Questions in NEUROSCIENCE

Trending Questions

Popular # Hahtags

Popular Questions