Getting similarity score with spacy and a transformer model

1.6k Views Asked by At

I've been using the spacy en_core_web_lg and wanted to try out en_core_web_trf (transformer model) but having some trouble wrapping my head around the difference in the model/pipeline usage.

My use case looks like the following:

import spacy
from spacy import displacy
nlp = spacy.load("en_core_web_trf")

s1 = nlp("Running for president is probably hard.")
s2 = nlp("Space aliens lurk in the night time.")
s1.similarity(s2)

Output:

The model you're using has no word vectors loaded, so the result of the Doc.similarity method will be based on the tagger, parser and NER, which may not give useful similarity judgements.
(0.0, Space aliens lurk in the night time.)

Looking at this post, the transformer model does not have a word vector in the same way en_core_web_lg does, but you can get the embedding via s1._.trf_data.tensors. Which looks like:

sent1._.trf_data.tensors[0].shape
(1, 9, 768)
sent1._.trf_data.tensors[1].shape
(1, 768)

So I tried to manually take the cosine similarity (using this post as ref):

def similarity(obj1, obj2):
        (v1, t1), (v2, t2) = obj1._.trf_data.tensors, obj2._.trf_data.tensors
        try:
            return ((1 - cosine(v1, v2)) + (1 - cosine(t1, t2))) / 2
        except:
            return 0.0

But this does not work.

1

There are 1 best solutions below

1
On

As @polm23 mentioned, using sentence-transformers is a better approach to get sentence similarity.

First install the package: pip install sentence-transformers

Then use this code:

from sentence_transformers import SentenceTransformer
from sentence_transformers.util import cos_sim

model = SentenceTransformer('all-MiniLM-L6-v2')
sentences = ["Running for president is probably hard.","Space aliens lurk in the night time."]

embedded_list = model.encode(sentences)

similarity = cos_sim(embedded_list[0],embedded_list[1])

But if you are determined to use spacy for sentence similarity be aware that the reason that your code does not work is that v1 and v2 don't have the same shape, as you can see:

  • s1._.trf_data.tensors[0].shape --> (1, 9, 768)
  • s2._.trf_data.tensors[0].shape --> (1, 11, 768)

So it's not possible to get similarity between these two arrays.

s1._.trf_data.tensors is a tuple consists of two arrays:

  • s1._.trf_data.tensors[0] gives an array of size (1, 9, 768) which is consists of 9 arrays of size (1, 768) for each token.
  • s1._.trf_data.tensors[1] gives an array of size (1, 768) for the whole sentence

So you can get similarity as follows:
similarity = cosine(s1._.trf_data.tensors[1], s2._.trf_data.tensors[1])