AIML - Semantic Text similarity NLP

403 Views Asked by At

I'm new to AIML. Right now I'm working with a requirement where I need to see if two sentences are similar semantically. I'm searching for an API or Existing NLP service which compares given sentance with another array of sentences and provide us with matching one from passed array of sentances. If there is any proven algo/NLP implementation published now please let me through this folks.

Thanks in advance.

1

There are 1 best solutions below

1
Sahar Millis On

There are many ways to check for sentence-similarity. Sadly, there is not "right way", cuz it depends on the context, data, domain, and your preferences.

For example, You can check textual similarity by Rouge.

from rouge import Rouge

# Initialize Rouge
rouge = Rouge()

# Define reference and hypothesis texts in German
reference = "Das ist ein Beispieltext für die Berechnung des Rouge-Scores."
hypothesis = "Dies ist ein Beispieltext zur Berechnung des Rouge-Scores."

# Calculate Rouge scores
scores = rouge.get_scores(hypothesis, reference)

# Print the scores
print(scores)

Another way is to use Meteor.

import nltk
nltk.download('punkt')
nltk.download('wordnet')

from nltk.tokenize import word_tokenize
from nltk.translate.meteor_score import meteor_score

# Define reference and hypothesis texts in German
reference = "Hello my name is John"
hypothesis = "My name is not John"

# Tokenize the reference and hypothesis texts
reference_tokens = word_tokenize(reference)
hypothesis_tokens = word_tokenize(hypothesis)

# Calculate Meteor score
meteor = meteor_score([reference_tokens], hypothesis_tokens)

# Print the Meteor score
print("Meteor score:", meteor)

A more complex approach for using text similarity based on a Large Language Model. while the LLMs are heavy, they are great at understanding the context of natural language. For example,

!pip install --q sentence_transformers

from sentence_transformers import SentenceTransformer
from sklearn.metrics.pairwise import cosine_similarity
import numpy as np

model = SentenceTransformer('bert-base-multilingual-cased')

sentence = 'my name is John'
all_sentences = ['I am John', 'my name is Jane', 'your name is John', 'what is your name?', 'John is my name']

# Compute the BERT embeddings for the sentences
sentence_embeddings = model.encode([sentence] + all_sentences)

# Compute the cosine similarity between the sentence and all other sentences
similarity_scores = cosine_similarity(sentence_embeddings)[0][1:]

# Print or take the max score 
for i, score in enumerate(similarity_scores):
    print(f"Similarity score for sentence {i+1}: {score:.4f}")

If your NLP knowledge is limited, you can use AWS/GCP service for document similarity or LangChain.