I am in the process of creating a custom dataset to benchmark the accuracy of the 'bert-large-uncased-whole-word-masking-finetuned-squad' model for my domain, to understand if I need to fine-tune further, etc.
When looking at the different Question Answering datasets on the Hugging Face site (squad, adversarial_qa, etc. ), I see that the answer is commonly formatted as a dictionary with keys: answer (the text) and answer_start (char index where answer starts).
I'm trying to understand:
- The intuition behind how the model uses the answer_start when calculating the loss, accuracy, etc.
- If I need to go through the process of adding this to my custom dataset (easier to run model evaluation code, etc?)
- If so, is there a programmatic way to do this to avoid manual effort?
Any help or direction would be greatly appreciated!
Code example to show format:
import datasets
ds = datasets.load_dataset('squad')
train = ds['train']
print('Example: \n')
print(train['answers'][0])
Your question is a bit broad to give you a specific answer, but I will try my best to point you in some directions.
There are different types of QA tasks/datasets. The ones you mentioned (SQuAD and adversarial_qa) belong to the field of extractive question answering. There, a model must select a span from a given context that answers the given question. For example:
A simple approach that is often used today, is a linear layer that predicts the answer start and answer end from the last hidden state of a transformer encoder (code example). The last hidden state holds one vector for each input token (token!= words) and the linear layer is trained to assign high probabilities to tokens that could potentially be the start and end of the answer span. To train a model with your data, the loss function needs to know which tokens should get a high probability (i.e. the answer and the start token).
You should go through this process, otherwise, how should someone know where the answer starts in your context? They can of course interfere with it programmatically, but what if your answer string appears twice in the context? Providing an answer start position avoids confusion and allows your users to use it right away with one of the many extractive questions answering scripts that are already available out there.
You could simply loop through your dataset and use str.find:
Output: