I am currently using sentence transformers to find the similarity between 2 sentences and I have a labelled data of either 1 or 0 (similar, not similar). After training my own model, I can evaluate the model performance on a dev/test dataset as shown below.
dev_samples = []
for index, row in df.iterrows():
input_example = InputExample(texts=[row['Sent_1'], row['Sent_2']], label=row['Label_0_1'])
dev_samples.append(input_example)
model = SentenceTransformer(bi_encoder_model_save_path)
dev_evaluator = BinaryClassificationEvaluator.from_input_examples(dev_samples, name="dev_sample")
# CSV file with performance result is exported in the model folder
dev_evaluator(model, output_path=f'''{cwd}''')
In the above example, I am using BinaryClassificationEvaluator.from_input_examples(dev_samples, name="dev_sample")
The result is saved to an csv file with specific columns such as:

How can I identify or extract the misclassified labels into a confusion matrix? Should I calculate the score using encode() of each input example and identify if the score is > 50, it is similar (1) and less, it is non-similar (0)
https://www.sbert.net/docs/usage/semantic_textual_similarity.html
From what I understand from "Semantic Textual Similarity", you would need to compare the predicted labels against the true labels for each example in your development (dev) dataset.
The
BinaryClassificationEvaluatordoes not directly output misclassified examples (as illustrated byUKPLab/sentence-transformersissue 1516), so you will need to manually calculate the predictions using theencode()method and then compare these predictions to the true labels.The
encode()method will encode both sentences in eachInputExample, then calculate the similarity score. The similarity can be calculated using the cosine similarity between the embeddings of the two sentences.Based on the similarity score, determine the predicted label. You mentioned using a threshold of 50% (or 0.5 in a normalized [0, 1] scale if your similarity scores are normalized). Labels above this threshold can be considered similar (1), and below it, not similar (0).
For each
InputExample, compare the predicted label to the true label (thelabelattribute ofInputExample).Based on the comparison, tally the true positives, false positives, true negatives, and false negatives to construct a confusion matrix.
As an example: