How to map token indices from the SQuAD data to tokens from BERT tokenizer?

4k Views Asked by At

I am using the SQuaD dataset for answer span selection. After using the BertTokenizer to tokenize the passages, for some samples, the start and end indices of the answer don't match the real answer span position in the passage tokens anymore. How to solve this problem? One way is to modify the answer indices (also the training targets) accordingly? But how to do it?

1

There are 1 best solutions below

1
On

The tokenization in the original dataset is different from how BERT tokenizes the input. In BERT, less frequent words get split into subword units. You can easily find out the character offsets of the tokens in the original dataset.

In the newer versions of Transformers, the tokenizers have the option of return_offsets_mapping. If this is set to True, it returns the character offset (a tuple (char_start, char_end)). If you have the character offsets in the original text, you can map them with the output of the tokenizer.

from transformers import BertTokenizerFast
tok = BertTokenizerFast.from_pretrained("bert-base-cased")
tok("I am a tokenizer.", return_offsets_mapping=True)

The output:

{'input_ids': [101, 146, 1821, 170, 22559, 17260, 119, 102],
 'token_type_ids': [0, 0, 0, 0, 0, 0, 0, 0],
 'attention_mask': [1, 1, 1, 1, 1, 1, 1, 1],
 'offset_mapping': [(0, 0),  (0, 1), (2, 4), (5, 6), (7, 12), (12, 16), (16, 17), (0, 0)]}

The (0, 0) spans correspond to technical tokens, in the case of BERT [CLS] and [SEP].

When you have the offsets using both the original tokenization and BERT tokenization, you can find out what are the indices in the re-tokenized string.