How to prevent DataCollatorForLanguageModelling from using input_ids as labels in CLM tasks?

11 Views Asked by At

How to instructDataCollatorForLanguageModeling to not use shifted inputs as labels but my own labels?

Here's a MWE:

data = {
'sources': ["This is some text", "Another text athta ljdlsfjsdlf", "Also some bulshit type text who knows wtf?"],
'targets': ["Some potential target.", "The answer is JoLo!", "Who killed margaret and what was the motive and poential causes!"]
}

tokenizer = AutoTokenizer.from_pretrained("openai-community/gpt2")
config = AutoConfig.from_pretrained("openai-community/gpt2")
gpt2model = AutoModelForCausalLM.from_config(config)
tokenizer.pad_token = tokenizer.eos_token
data_collator = DataCollatorForLanguageModeling(tokenizer=tokenizer, mlm=False)

>> "Special tokens have been added in the vocabulary, make sure the associated word embeddings are fine-tuned or trained"


tokenized_data = tokenizer(data['sources'])
with tokenizer.as_target_tokenizer():
    tokenized_data['labels'] = tokenizer(data['targets'])

>> tokenized_data
{'input_ids': [[1212, 318, 617, 2420], [6610, 2420, 379, 4352, 64, 300, 73, 67, 7278, 69, 8457, 67, 1652], [7583, 617, 4807, 16211, 2099, 2420, 508, 4206, 266, 27110, 30]], 'attention_mask': [[1, 1, 1, 1], [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1], [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1]],
'labels': {'input_ids': [[4366, 2785, 2496, 13], [464, 3280, 318, 5302, 27654, 0], [8241, 2923, 6145, 8984, 290, 644, 373, 262, 20289, 290, 745, 1843, 5640, 0]], 'attention_mask': [[1, 1, 1, 1], [1, 1, 1, 1, 1, 1], [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1]]}}


tokenized_labels = tokenized_data.pop('labels')

outputs = data_collator(tokenized_data)

>> outputs
{'input_ids': tensor([[ 1212,   318,   617,  2420, 50257, 50257, 50257, 50257, 50257, 50257,
         50257, 50257, 50257],
        [ 6610,  2420,   379,  4352,    64,   300,    73,    67,  7278,    69,
          8457,    67,  1652],
        [ 7583,   617,  4807, 16211,  2099,  2420,   508,  4206,   266, 27110,
            30, 50257, 50257]]), 'attention_mask': tensor([[1, 1, 1, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0],
        [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1],
        [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 0, 0]]), 'labels': tensor([[ 1212,   318,   617,  2420,  -100,  -100,  -100,  -100,  -100,  -100,
          -100,  -100,  -100],
        [ 6610,  2420,   379,  4352,    64,   300,    73,    67,  7278,    69,
          8457,    67,  1652],
        [ 7583,   617,  4807, 16211,  2099,  2420,   508,  4206,   266, 27110,
            30,  -100,  -100]])}

Now the outputs['labels'] are just the shifted outputs['input_ids'] which is happening automatically from the DataColaltorForLanguageModeling.

The question is since I do have proper labels for the data, in this case tokenized_data['labels'] or the variable tokenized_labels, how do I use that in the Trainer class?

So, the dataset.map(...) will tokenize the whole dataset and will return tokens for both text and labels.

Then using data_collator will create labels by shifting the input_ids and feed that to the model.

How do I tell Trainer or DataCollator to use my tokenized_labels instead of creating them based on the inputs?

0

There are 0 best solutions below