How to use output from T5 model to replace masked tokens in input sequence

1.8k Views Asked by At

I'm working with the T5 model from the Hugging Face Transformers library and I have an input sequence with masked tokens that I want to replace with the output generated by the model. Here's the code.

from transformers import T5Tokenizer, T5ForConditionalGeneration

tokenizer = T5Tokenizer.from_pretrained("t5-small")
model = T5ForConditionalGeneration.from_pretrained("t5-small")

input_data = "The <extra_id_0> walks in <extra_id_1> park"
input_ids = tokenizer(input_data, return_tensors="pt").input_ids

sequence_ids = model.generate(input_ids)
output_sequences = tokenizer.batch_decode(sequence_ids)
output_sequences

This code produces the following output:

['<pad><extra_id_0> park offers<extra_id_1> the<extra_id_2> park.</s>']

What I want to do is replace the masked tokens <extra_id_0> and <extra_id_1> in the input sequence with the corresponding output tokens from the model, so that the final output is:

The park offers walks in the park.

I'm hoping someone can help me with the code to achieve this.

Notice that this is the correspondence:

mask in input_data -> answer in output_sequences
<extra_id_0> -> <extra_id_0> park offers (so we extract 'park offers' only)
<extra_id_1> -> <extra_id_1> the  (so we extract 'the' only)
1

There are 1 best solutions below

0
On

The t5 model considers tokens which begin with <extra_id as potential mask tokens. As written in the documentation

"Each sentinel token represents a unique mask token for this sentence and should start with <extra_id_0>, <extra_id_1>, … up to <extra_id_99>"

In the output, you can consider the text between <extra_id_0> and <extra_id_1> as your output for the mask_0, the text between <extra_id_1> and <extra_id_2> as your output for the mask 1 and so on.

To extract this from your generated output, you can use the following code snippet. This will take the number of masks as input and return a list of string as output where each element represents the text predicted for the corresponding mask.

def extract_text(text,num_masks=1):
    list_of_text = []
    for i in range(num_masks):
        prev_id = '<extra_id_' + str(i) + '>'
        curr_id = '<extra_id_' + str(i+1) + '>'
        st_token_index = text.index(prev_id)
        end_token_index = text.index(curr_id)
        list_of_text.append(text[st_token_index+12:end_token_index])
    return list_of_text

Also, you should note that t5 is not really the best choice for the masked language modelling task as discussed here. Models like BERT are specifically trained for these type of tasks and can directly be used with the fill mask pipeline from huggingface

from transformers import pipeline
nlp_fill = pipeline('fill-mask')