How should data be formatted to train the Huggingface DPR model?

441 Views Asked by At

I am new to machine learning, so maybe I have completely overlooked something, but I am trying to finetune the DPR models from the Huggingface transformers model using a dataset I am building (https://huggingface.co/docs/transformers/model_doc/dpr). In the documentation on the Huggingface website, the area explaining how the model expects to be fed data is blank. How should I format my question / answer pairs to train the model? I am using pytorch.

I know DPR uses in-batch negatives, but some resources I have found suggest manually writing negatives and hard negatives, and other resources say the model automatically pulls negatives from other positive pairs in the batch. I can't find which is the case.

I read the documentation from Huggingface.com (see above). The explanation section is blank, and there are no examples that I could find.

I then went through the github page (https://github.com/facebookresearch/DPR) In the readme file there is a section on retriever data formatting. I am skeptical of this for a couple reasons 1) Huggingface calls all models context encoders and question encoders, not retrievers, so I am not sure that these are referencing the same models. 2) Providing every question its own set of negative answers seems computationally inefficient and doesn't allow for effective batching. 3) The model expects a json file? So during training we constantly have to write and retrieve json files? I have uploaded the pretrained model and called embeddings without json files so that doesn't seem to track I started reading all of the .py files, trying to parse the actual formatting aspect of data preprocessing, and quickly got lost.

I have read the original DPR paper, but they are training their own model, with their own training data, and that model is different than the one on Huggingface.

1

There are 1 best solutions below

3
Jess On

The DPR model is two components: a question encoder and a context encoder. Basically you're giving the model a couple of different datapoints that it can follow to say "this aligns with correct, this aligns with incorrect, and this aligns with extremely incorrect."

You can build your data by creating a list of dictionaries:

data = [
    {
        "question": "What was the largest dinosaur?",
        "answers": ["Argentinosaurus"],
        "positive_ctxs": [
            {
                "title": "Argentinosaurus",
                "text": "Argentinosaurus is a genus of titanosaur sauropod dinosaur first discovered by Guillermo Heredia in Argentina. The generic name refers to the country in which it was discovered. The dinosaur lived on the then-island continent of South America somewhere between 94 and 97 million years ago, during the Late Cretaceous Period. It is among the largest known dinosaurs."
            }
        ],
        "negative_ctxs": [
            {
                "title": "Tyrannosaurus",
                "text": "Tyrannosaurus is a genus of coelurosaurian theropod dinosaur. The species Tyrannosaurus rex (rex meaning 'king' in Latin), often called T. rex or colloquially T-Rex, is one of the most well-represented of the large theropods. Tyrannosaurus lived throughout what is now western North America, on what was then an island continent known as Laramidia."
            }
        ],
        "hard_negative_ctxs": [
            {
                "title": "Spinosaurus",
                "text": "Spinosaurus is a genus of theropod dinosaur that lived in what now is North Africa. Spinosaurus may be the largest of all known carnivorous dinosaurs, even larger than Tyrannosaurus and Giganotosaurus. Estimates published in 2005, 2007, and 2008 suggested that it was between 12.6–18 meters (41–59 ft) in length and 7 to 20.9 tonnes (7.7 to 23.0 short tons) in weight."
            }
        ]
    },(....etc)

Then you can encode them like this:

# import + don't forget to install
from transformers import DPRContextEncoder, DPRContextEncoderTokenizer, DPRQuestionEncoder, DPRQuestionEncoderTokenizer
from torch.utils.data import DataLoader
import torch

# Initialize stuff
question_tokenizer = DPRQuestionEncoderTokenizer.from_pretrained('facebook/dpr-question_encoder-single-nq-base')
question_model = DPRQuestionEncoder.from_pretrained('facebook/dpr-question_encoder-single-nq-base')

context_tokenizer = DPRContextEncoderTokenizer.from_pretrained('facebook/dpr-ctx_encoder-single-nq-base')
context_model = DPRContextEncoder.from_pretrained('facebook/dpr-ctx_encoder-single-nq-base')

# data is your list of dictionaries
questions = [item["question"] for i in data]
positive_contexts = [item["positive_ctxs"][0]["text"] for item in data]  # Just use the first positive context for each question

# Tokenize 
question_inputs = question_tokenizer(questions, return_tensors='pt', padding=True, truncation=True, max_length=512)
context_inputs = context_tokenizer(positive_contexts, return_tensors='pt', padding=True, truncation=True, max_length=512)

# Create DataLoader
batch_size = 16  # You can change this to w/e
dataset = torch.utils.data.TensorDataset(question_inputs['input_ids'], context_inputs['input_ids'])
dataloader = DataLoader(dataset, batch_size=batch_size)

(Depending on your next steps you may need to assign your device variable etc).

Then you can forward pass through the training loop:

for batch in dataloader:
    question_inputs, context_inputs = batch
    question_inputs = question_inputs.to(device) 
    context_inputs = context_inputs.to(device)  # Move the inputs to the Device

    # Forward pass through the models
    question_outputs = question_model(question_inputs)
    context_outputs = context_model(context_inputs)

    # Compute the loss, backwards pass and optimization, etc etc etc

Hope this helps!