Understanding the most_similar method for an AnnoyIndexer in gensim.similarities.index

1.3k Views Asked by ellie123 At 02 July 2025 at 15:15

So I have made an AnnoyIndexer and am running some most_similar queries to find the nearest neighbours of some vectors in a 300dimensional vector space. This is the code for it:

def most_similar(self, vector, num_neighbors):
    """Find the approximate `num_neighbors` most similar items.
    Parameters
    ----------
    vector : numpy.array
        Vector for word/document.
    num_neighbors : int
        Number of most similar items
    Returns
    -------
    list of (str, float)
        List of most similar items in format [(`item`, `cosine_distance`), ... ]
    """

    ids, distances = self.index.get_nns_by_vector(
        vector, num_neighbors, include_distances=True)

    return [(self.labels[ids[i]], 1 - distances[i] / 2) for i in range(len(ids))]

I am wondering why the returned values for the distances are all taken away from 1 and then divided by 2? Surely after doing that, largest/smallest distances are all messed up?

Original Q&A

There are 1 best solutions below

Dani Mesejo On 17 August 2018 at 22:12

From the documentation of gensim:

"List of most similar items in format [(`item`, `cosine_distance`), ...]"

The distances returned by the AnnoyIndex are the euclidean distance between the vectors. So the method needs to transform the euclidean distances in cosine distances. The cosine distance is equals to 1 - e/2 where e is the euclidean distance value, hence the transformation. See this for a derivation of the equivalence.

Also notice that this transformation does not alter the ordinal relationship between the values, consider 0 < d1 < d2 < 1 then d1/2 < d2/2 and 1 - d1/2 < 1 - d2/2, so if d1 was the distance of o1 and d2 of o2 then after the transformation o1 remains closer to the query vector than o2.

Understanding the most_similar method for an AnnoyIndexer in gensim.similarities.index

There are 1 best solutions below

Related Questions in PYTHON

Related Questions in NLP

Related Questions in GENSIM

Related Questions in WORD2VEC

Related Questions in ANNOY

Trending Questions

Popular # Hahtags

Popular Questions