http://spark.apache.org/docs/latest/mllib-feature-extraction.html#word2vec
On the spark implementation of word2vec, when the number of iterations or data partitions are greater than one, for some reason, the cosine similarity is greater than 1.
In my knowledge, cosine similarity should always be about -1 < cos < 1. Does anyone know why?
In
findSynonyms
method ofword2vec
, it does not calculate cosine similarityv1・vi / |v1| |vi|
, instead it calculatesv1・vi / |vi|
, wherev1
is the vector of the query word andvi
is the vector of the candidate words. That's why the value sometimes exceeds 1. Just to find closer words, it is not necessary to divide by|v1|
because it is constant.