Cosine Similarity Involving Embeddings, Do we have to embed the whole sentence/text?

217 Views Asked by ShastaBiz At 17 November 2023 at 18:01

Imagine I have some code as such. I am using the encode function to create embeddings. Then from these I would look to calculate a cosine similarity score, after all the model I have selected is geared towards cosine similarity (as opposed to dot-product similarity).

My question is do you always embed the entire string as it is or would you/could you produce cleaning on the two strings before you encode them? Stopwords out. Maybe only keep nouns or entities. Is this a thing, or would the discontinuity/non-grammatical possibility in the resulting strings hurt us?

from sentence_transformers import SentenceTransformer, util
model_name = 'sentence-transformers/multi-qa-mpnet-base-cos-v1'
model = SentenceTransformer(model_name)
phrase1 = 'Some arbitrarily long string from a book'
phrase2 = "This is a another arbitrarily long string from the same book'    
emb1 = model.encode(phrase1)
emb2 = model.encode(phrase2)

I get a cosine similarity which is not spread out that well. There isn't enough separation between good matches and bad matches.

Original Q&A

There are 2 best solutions below

Kinjal On 18 November 2023 at 11:35

Since you are using sentence embeddings, encoding the whole sentence makes more sense.

An alternative approach in order to increase separation could be, if you have an idea of the categories where most of texts fall in, then you can use a zero-shot classifier to give scores to each text with every category. You can keep refining the categories like a semi-supervised approach.

cronoik On 19 November 2023 at 20:12

My question is do you always embed the entire string as it is or would you/could you produce cleaning on the two strings before you encode them? Stopwords out. Maybe only keep nouns or entities. Is this a thing, or would the discontinuity/non-grammatical possibility in the resulting strings hurt us?

Intuitively you could think that. The TSDAE paper investigated the influence of different POS tags to determine the similarity of two sentences. It was shown that nouns are the most relevant across different approaches (see figure below).

But that does not mean you could remove other less influential POS types to improve your result. The model you are using was trained with complete and grammatically correct sentences. Passing incomplete sentences to the model might confuse your model and decrease the performance.

The only real things you can do are:

Experiment with different pre-trained models (a different training dataset or objective can have a huge impact on your data).
Finetune your own sentence-transformer (check the training section of their homepage)

Cosine Similarity Involving Embeddings, Do we have to embed the whole sentence/text?

There are 2 best solutions below

Related Questions in NLP

Related Questions in WORD-EMBEDDING

Related Questions in COSINE-SIMILARITY

Related Questions in SENTENCE-TRANSFORMERS

Trending Questions

Popular # Hahtags

Popular Questions