Using annoy with Torchtext for nearest neighbor search

656 Views Asked by At

I'm using Torchtext for some NLP tasks, specifically using the built-in embeddings.

I want to be able to do a inverse vector search: Generate a noisy vector, find the vector that is closest to it, then get back the word that is "closest" to the noisy vector.

From the torchtext docs, here's how to attach embeddings to a built-in dataset:

from torchtext.vocab import GloVe
from torchtext import data

embedding = GloVe(name='6B', dim=100)

# Set up fields
TEXT = data.Field(lower=True, include_lengths=True, batch_first=True)
LABEL = data.Field(sequential=False, is_target=True)

# make splits for data
train, test = datasets.IMDB.splits(TEXT, LABEL)

# build the vocabulary
TEXT.build_vocab(train, vectors=embedding, max_size=100000)
LABEL.build_vocab(train)

# Get an example vector
embedding.get_vecs_by_tokens("germany")

Then we can build the annoy index:

from annoy import AnnoyIndex

num_trees = 50

ann_index = AnnoyIndex(embedding_dims, 'angular')

# Iterate through each vector in the embedding and add it to the index
for vector_num, vector in enumerate(TEXT.vocab.vectors):
    ann_index.add_item(vector_num, vector) # Here's the catch: will vector_num correspond to torchtext.vocab.Vocab.itos?

ann_index.build(num_trees)

Then say I want to retrieve a word using a noisy vector:

# Get an existing vector
original_vec = embedding.get_vecs_by_tokens("germany")
# Add some noise to it
noise = generate_noise_vector(ndims=100)
noisy_vector = original_vec + noise
# Get the vector closest to the noisy vector
closest_item_idx = ann_index.get_nns_by_vector(noisy_vector, 1)[0]
# Get word from noisy item
noisy_word = TEXT.vocab.itos[closest_item_idx]

My question comes in for the last two lines above: The ann_index was built using enumerate over the embedding object, which is a Torch tensor.

The [vocab][2] object has its own itos list that given an index returns a word.

My question is this: Can I be certain that the order in which words appear in the itos list, is the same as the order in TEXT.vocab.vectors? How can I map one index to the other?

1

There are 1 best solutions below

1
On BEST ANSWER

Can I be certain that the order in which words appear in the itos list, is the same as the order in TEXT.vocab.vectors?

Yes.

The Field class will always instantiate a Vocab object (source), and since you are passing the pre-trained vectors to TEXT.build_vocab, the Vocab constructor will call the load_vectors function.

if vectors is not None:
    self.load_vectors(vectors, unk_init=unk_init, cache=vectors_cache)

In the load_vectors, the vectors are filled by enumerating the words in the itos.

for i, token in enumerate(self.itos):
    start_dim = 0
    for v in vectors:
        end_dim = start_dim + v.dim
        self.vectors[i][start_dim:end_dim] = v[token.strip()]
        start_dim = end_dim
    assert(start_dim == tot_dim)

Therefore, you can be certain that itos and vectors will have the same order.