I have 1000 description for some SKU merchandise and I want to generate inverse embedding mapping to do semantic search
For example here is what I have
item description
item1 [word1, word2, word3, word4..........]
item2 [word1, word2_2, word3_3, word4_4..........]
As you can see item1 and item2 shares word1, but item1 and item2 has two different context, by generating embedding, we should be able to capture the context of each word
Here is how i generate embeddings
my_description = []
with open('/content/gdrive/My Drive/my.csv', 'r') as data:
df = pd.read_csv(data, encoding = ('utf-8'),nrows=100)
for index, row in df.iterrows():
my_str = row['description']
my_description.append(my_str)
import torch
from transformers import BertTokenizer, BertModel
%matplotlib inline
tokenizer = BertTokenizer.from_pretrained('bert-base-uncased')
model = BertModel.from_pretrained('bert-base-uncased',
output_hidden_states = True, # Whether the model returns all hidden-states.
)
model.eval()
text2 = company_description[0]
# Add the special tokens.
marked_text2 = "[CLS] " + text2 + " [SEP]"
# Split the sentence into tokens.
tokenized_text2 = tokenizer.tokenize(marked_text2)
# Map the token strings to their vocabulary indeces.
indexed_tokens2 = tokenizer.convert_tokens_to_ids(tokenized_text2)
segments_ids2 = [1] * len(tokenized_text2)
tokens_tensor2 = torch.tensor([indexed_tokens2])
segments_tensors2 = torch.tensor([segments_ids2])
with torch.no_grad():
outputs2 = model(tokens_tensor2, segments_tensors2)
hidden_states2 = outputs2[2]
token_embeddings2 = torch.stack(hidden_states2, dim=0)
token_embeddings2.size()
token_embeddings2 = torch.squeeze(token_embeddings2, dim=1)
token_embeddings2.size()
token_embeddings2 = token_embeddings2.permute(1,0,2)
token_embeddings2.size()
token_vecs_cat2 = []
for token in token_embeddings2:
cat_vec = torch.cat((token[-1], token[-2], token[-3], token[-4]), dim=0)
token_vecs_cat2.append(cat_vec)
token_vecs_sum2 = []
import numpy as np
x_token = np.empty((0, 768))
for token in token_embeddings2:
sum_vec = torch.sum(token[-4:], dim=0)
token_vecs_sum2.append(sum_vec)
x_token = np.concatenate((x_token, sum_vec.numpy().reshape((1,-1))), axis=0)
x_token would be the embeddings for all my word/token in one description
For example say that item1 has 500 tokens and embedding is 700
the shape of x_token would be (500 x 700)
so for each item i would have something like this
item token embeddings
item 1 token 1 [x1,x2,x3,.....]
item 1 token 2 [x1,x2,x3,.....]
....
item 2 token 1_2 [x1,x2,x3,.....]
item 2 token 2_2 [x1,x2,x3,.....]
....
item n token 1_n [x1,x2,x3,.....]
item n token 2_n [x1,x2,x3,.....]
Now my question is how do i perform search
If my search query is a sentence
"word1 word2 word3.....wordn"
If I generate embedding for each word in the sentence and perform ANN for top 10 nearest neighbor for each token
If my query has 10 tokens, I would get 100 item description back (10 for each token) In that case, how do i shortlist to top 10 item description? Which token should i use?
query = [token1, token2.......tokenN]
top 10 nearest_neighbor's item,
query_token1 -> [itemx1_1, itemx1_2, itemx1_10]
query_token2 -> [itemx2_1, itemx2_2, itemx2_10]
Am i doing semantic search wrong?