So I have made an AnnoyIndexer and am running some most_similar queries to find the nearest neighbours of some vectors in a 300dimensional vector space. This is the code for it:
def most_similar(self, vector, num_neighbors):
"""Find the approximate `num_neighbors` most similar items.
Parameters
----------
vector : numpy.array
Vector for word/document.
num_neighbors : int
Number of most similar items
Returns
-------
list of (str, float)
List of most similar items in format [(`item`, `cosine_distance`), ... ]
"""
ids, distances = self.index.get_nns_by_vector(
vector, num_neighbors, include_distances=True)
return [(self.labels[ids[i]], 1 - distances[i] / 2) for i in range(len(ids))]
I am wondering why the returned values for the distances are all taken away from 1 and then divided by 2? Surely after doing that, largest/smallest distances are all messed up?
From the documentation of gensim:
The distances returned by the AnnoyIndex are the euclidean distance between the vectors. So the method needs to transform the euclidean distances in cosine distances. The cosine distance is equals to
1 - e/2
wheree
is the euclidean distance value, hence the transformation. See this for a derivation of the equivalence.Also notice that this transformation does not alter the ordinal relationship between the values, consider
0 < d1 < d2 < 1
thend1/2 < d2/2
and1 - d1/2 < 1 - d2/2
, so ifd1
was the distance ofo1
andd2
ofo2
then after the transformationo1
remains closer to the query vector thano2
.