How to use the spotify's annoy library in python?

10.9k Views Asked by At

I want to know how the annoy library works. I got this test code from github but I'm new to coding, so it is difficult for me to understand .

from annoy import AnnoyIndex
import random
f = 40

t = AnnoyIndex(f, 'angular')  #Length of item vector that will be indexed
for i in range(1000):
    v = [random.gauss(0, 1) for z in range(f)]
    t.add_item(i, v)

t.build(10) # 10 trees
t.save('test.ann')

u = AnnoyIndex(f, 'angular')
u.load('test.ann') # super fast, will just mmap the file
print(u.get_nns_by_item(0, 1000)) # will find the 1000 nearest neighbors
1

There are 1 best solutions below

3
On

The annoy library is basically used to solve the problem of Nearest Neightbour search (NNS) for the Euclidean distance, Manhattan distance, cosine distance, Hamming distance, or Dot (Inner) Product distance. With Annoy, you will build a net of dots or vectors in a n dimensional space, in order to then ask it to give you different propierties of this given net. One of the advantatges in using this library is that it is specially efficient when it comes to memory storage. There are many things you can do with it and it gets as complicated as you want, but here is an easy example that may help.

a = AnnoyIndex(3, 'euclidean')
b = AnnoyIndex(3 ,'angular')

Here you first create your object upon which the Annoy package will operate. The first case is a three dimensional space with an euclidean distance. The second has an angular one.

a.add_item(0,[1,0,0])
a.add_item(1,[0,1,0])
a.add_item(2,[2,0,0])
a.add_item(3,[2.5,0,0])
a.add_item(4,[1,0,0.5])

b.add_item(0,[1,0,0]) 
b.add_item(1,[0,1,0])
b.add_item(2,[2,0,0])
b.add_item(3,[2.5,0,0])
b.add_item(4,[1,0,0.5])

a.build(1)
b.build(1)

Here I add the same five items at the same positions in the objects a and b respectively and then I build the net with a/b.build(n), where n is the number of "trees" used. The more trees, the more precise and faster the net will be when performing operations. In this case, we use n=1 as the net is very simple.

    print(a.get_nns_by_item(0, 4))
    print(b.get_nns_by_item(0, 4))

Now I ask for the net to give me all the elements in the net between 0 and 4 listed from closer to further away from the item 0, using its respective distances, obtaining 0,4,1,2,3 and 0,2,3,4,1 respectively. Lets now see why we obtained such outcome. In the following lines I calculate the distance between the item 0 to the other items and then I sort them from smaller to bigger.



Object a , distances using euclidean distance (d((x,y,z),(x',y',z')) = sqrt{( x-x')^2 + (y-y')^2 + (z-z')^2}) :

'''
    d0-0 = 0 
    d0-1 = sqrt{5/4} 
    d0-2= 1.5 
    d0-3 = 2 
    d0-4 = 1/sqrt{2} 
'''

Sorting from smaller to bigger : d0-0,d0-4,d0-1,d0-2,d0-3 (or {0,4,1,2,3})



Object b , distances using angular distance (angle between the two vectors in degrees) :

'''
    angle0-0 = 0
    angle0-1 = 90
    angle0-2  = 0
    angle0-3 = 0
    angle0-4 = 45
'''

In that case, if two elements have the same angle, the sorting gives priority to the smaller item : {0,2,3,4,1}



So far so good. Now lets see your piece of code.

    from annoy import AnnoyIndex
    import random
    f = 40

    t = AnnoyIndex(f, 'angular')  #Length of item vector that will be indexed
    for i in range(1000):
        v = [random.gauss(0, 1) for z in range(f)]
        t.add_item(i, v)

    t.build(10) # 10 trees
    t.save('test.ann')
    

In this first part, we set the number of dimensions of the space of our net to 40 and we choose to use an angular distance. Then, in the next step, we create a thousand items, each one with its correspondenly forty dimensional vector, which can be also understood as a point in a 40th dimensional space. Then, the value for each dimension is set by random.gauss(0,1), which randomly returns a point inside a gaussian distributon given by e^{{x}^2/2}/sqrt{2\pi}. We build then the corresponding net by using 10 trees and save it as 'test.ann'.

u = AnnoyIndex(f, 'angular')
u.load('test.ann') # super fast, will just mmap the file
    
    

Now we can download it and use it for any purpose we want, and the advantage of it is that the net is already built, and thus it is faster.

print(u.get_nns_by_item(0, 1000)) 

Now we want to get the thousand elements sorted by its angular distance relative to the 0th item. Notice that if instead of 1000 we put a smaller number, the items bigger than this one would automatically be discarted.

For more information about Nearest Neightbour Search:

https://en.wikipedia.org/wiki/Nearest_neighbor_search

Github Annoy package

https://github.com/spotify/annoy