How to extract the misclassified labels from evaluating the performance of the model (BinaryClassificationEvaluator)?

44 Views Asked by At

I am currently using sentence transformers to find the similarity between 2 sentences and I have a labelled data of either 1 or 0 (similar, not similar). After training my own model, I can evaluate the model performance on a dev/test dataset as shown below.

dev_samples = []
for index, row in df.iterrows():
    input_example = InputExample(texts=[row['Sent_1'], row['Sent_2']], label=row['Label_0_1'])
    dev_samples.append(input_example)    

model = SentenceTransformer(bi_encoder_model_save_path)
dev_evaluator = BinaryClassificationEvaluator.from_input_examples(dev_samples, name="dev_sample")

# CSV file with performance result is exported in the model folder
dev_evaluator(model, output_path=f'''{cwd}''')

In the above example, I am using BinaryClassificationEvaluator.from_input_examples(dev_samples, name="dev_sample")

The result is saved to an csv file with specific columns such as: Result from evaluation

How can I identify or extract the misclassified labels into a confusion matrix? Should I calculate the score using encode() of each input example and identify if the score is > 50, it is similar (1) and less, it is non-similar (0)

https://www.sbert.net/docs/usage/semantic_textual_similarity.html

Github Code

1

There are 1 best solutions below

1
VonC On BEST ANSWER

From what I understand from "Semantic Textual Similarity", you would need to compare the predicted labels against the true labels for each example in your development (dev) dataset.
The BinaryClassificationEvaluator does not directly output misclassified examples (as illustrated by UKPLab/sentence-transformers issue 1516), so you will need to manually calculate the predictions using the encode() method and then compare these predictions to the true labels.

  • The encode() method will encode both sentences in each InputExample, then calculate the similarity score. The similarity can be calculated using the cosine similarity between the embeddings of the two sentences.

  • Based on the similarity score, determine the predicted label. You mentioned using a threshold of 50% (or 0.5 in a normalized [0, 1] scale if your similarity scores are normalized). Labels above this threshold can be considered similar (1), and below it, not similar (0).

  • For each InputExample, compare the predicted label to the true label (the label attribute of InputExample).
    Based on the comparison, tally the true positives, false positives, true negatives, and false negatives to construct a confusion matrix.

As an example:

from sentence_transformers import SentenceTransformer, util
import numpy as np

# Assume model is already loaded
model = SentenceTransformer(bi_encoder_model_save_path)

# Encode sentences and calculate similarity scores
predictions = []
for example in dev_samples:
    embeddings = model.encode(example.texts)
    similarity_score = util.pytorch_cos_sim(embeddings[0], embeddings[1])
    predicted_label = 1 if similarity_score >= 0.5 else 0
    predictions.append((predicted_label, example.label))

# Construct a confusion matrix
true_positives = sum(1 for pred, true in predictions if pred == true == 1)
false_positives = sum(1 for pred, true in predictions if pred == 1 and true == 0)
true_negatives = sum(1 for pred, true in predictions if pred == true == 0)
false_negatives = sum(1 for pred, true in predictions if pred == 0 and true == 1)

confusion_matrix = np.array([[true_positives, false_positives],
                             [false_negatives, true_negatives]])

print("Confusion Matrix:")
print(confusion_matrix)