Fine tune Sentence transformer with single sentence and label data

64 Views Asked by At

I am trying to fine tune a sentence transformer model. The data I have contains below columns:

  1. raw_text - the raw chunks of text
  2. label - corresponding label for the text - True or False. (1 or 0)

I wanted to fine tune a sentence transformer model such that the embeddings are optimized in a way that all the True sentences are closer in the vector space than all the False sentence.

I have been reading about the losses from Loss Overview — Sentence-Transformers documentation

I am really confused which loss to use for my type of data and use-case. I am leaned towards below:

enter image description here

since it matches my data format. As I read more about these losses and the way they are being computed using anchor, positive and negative samples I feel less confident in using them since my data does not have these kind of pair.

Can someone here help me understand if what I am trying to do is plausible with existing losses in sentence transformer library?

Below is my code so far which work:

from sentence_transformers import SentenceTransformer, InputExample, SentencesDataset, LoggingHandler, losses
from torch.utils.data import DataLoader
import pandas as pd

# Load a pre-trained Sentence Transformer model
# model = SentenceTransformer('stsb-roberta-base') #Hugging face says this model produces embeddings of low quality
model = SentenceTransformer('all-mpnet-base-v2')

# Assume 'transportation_data' is your dataset containing 'page_raw_text' and 'is_practical' columns
data = pd.DataFrame({'text': train_data['page_raw_text'], 'label': train_data['label']})

# Create InputExample objects
examples = [InputExample(texts=[txt], label=label) for txt, label in zip(data['text'], data['label'])]

# Create a DataLoader object and a Loss model
train_dataset = SentencesDataset(examples=examples, model=model)
train_dataloader = DataLoader(train_dataset, shuffle=True, batch_size=8)
train_loss = losses.BatchAllTripletLoss(model=model)

# Define your training arguments
num_epochs = 10
evaluation_steps = 1

model.fit(train_objectives=[(train_dataloader, train_loss)], epochs=num_epochs,evaluation_steps=1) 

0

There are 0 best solutions below