I am new to machine learning, so maybe I have completely overlooked something, but I am trying to finetune the DPR models from the Huggingface transformers model using a dataset I am building (https://huggingface.co/docs/transformers/model_doc/dpr). In the documentation on the Huggingface website, the area explaining how the model expects to be fed data is blank. How should I format my question / answer pairs to train the model? I am using pytorch.
I know DPR uses in-batch negatives, but some resources I have found suggest manually writing negatives and hard negatives, and other resources say the model automatically pulls negatives from other positive pairs in the batch. I can't find which is the case.
I read the documentation from Huggingface.com (see above). The explanation section is blank, and there are no examples that I could find.
I then went through the github page (https://github.com/facebookresearch/DPR) In the readme file there is a section on retriever data formatting. I am skeptical of this for a couple reasons 1) Huggingface calls all models context encoders and question encoders, not retrievers, so I am not sure that these are referencing the same models. 2) Providing every question its own set of negative answers seems computationally inefficient and doesn't allow for effective batching. 3) The model expects a json file? So during training we constantly have to write and retrieve json files? I have uploaded the pretrained model and called embeddings without json files so that doesn't seem to track I started reading all of the .py files, trying to parse the actual formatting aspect of data preprocessing, and quickly got lost.
I have read the original DPR paper, but they are training their own model, with their own training data, and that model is different than the one on Huggingface.
The DPR model is two components: a question encoder and a context encoder. Basically you're giving the model a couple of different datapoints that it can follow to say "this aligns with correct, this aligns with incorrect, and this aligns with extremely incorrect."
You can build your data by creating a list of dictionaries:
Then you can encode them like this:
(Depending on your next steps you may need to assign your device variable etc).
Then you can forward pass through the training loop:
Hope this helps!