Understanding the most_similar method for an AnnoyIndexer in gensim.similarities.index

1.2k Views Asked by At

So I have made an AnnoyIndexer and am running some most_similar queries to find the nearest neighbours of some vectors in a 300dimensional vector space. This is the code for it:

def most_similar(self, vector, num_neighbors):
    """Find the approximate `num_neighbors` most similar items.
    Parameters
    ----------
    vector : numpy.array
        Vector for word/document.
    num_neighbors : int
        Number of most similar items
    Returns
    -------
    list of (str, float)
        List of most similar items in format [(`item`, `cosine_distance`), ... ]
    """

    ids, distances = self.index.get_nns_by_vector(
        vector, num_neighbors, include_distances=True)

    return [(self.labels[ids[i]], 1 - distances[i] / 2) for i in range(len(ids))]

I am wondering why the returned values for the distances are all taken away from 1 and then divided by 2? Surely after doing that, largest/smallest distances are all messed up?

1

There are 1 best solutions below

2
On

From the documentation of gensim:

"List of most similar items in format [(`item`, `cosine_distance`), ...]"

The distances returned by the AnnoyIndex are the euclidean distance between the vectors. So the method needs to transform the euclidean distances in cosine distances. The cosine distance is equals to 1 - e/2 where e is the euclidean distance value, hence the transformation. See this for a derivation of the equivalence.

Also notice that this transformation does not alter the ordinal relationship between the values, consider 0 < d1 < d2 < 1 then d1/2 < d2/2 and 1 - d1/2 < 1 - d2/2, so if d1 was the distance of o1 and d2 of o2 then after the transformation o1 remains closer to the query vector than o2.