How to train a language model in Huggingface with a custom loss?

242 Views Asked by At

I'm following Huggingface's tutorial on training a causal language model.

I want to modify it such that instead of just predicting the next token, the model is also predicting a vector after some tokens corresponding to the sentiment.

So for example, if my original sequence of tokens is

This is my bad sequence of words # I'm simplifying here to make each word a token but in practice each word is usually multiple tokens

The sequence I want to train my model on is

This [some_vector] is [some_vector] my bad [some_vector] sequence of words [some_vector]

I have the vectors and I have the sequences, and I can structure it as needed. I also know that I can use cosine similarity loss to measure the similarity between the predicted vector and the gold vector. But, I don't know how to set it up such that the model can be trained to predict it.

One idea is to add 1 token to the model that will be used as a vector:

# add the tokens to the tokenizer vocabulary
tokenizer.add_tokens(list(new_tokens))

# add new, random embeddings for the new tokens
model.resize_token_embeddings(len(tokenizer))

In that case, the architecture doesn't need to be changed.

Another option is to modify the architecture in some way such that the model always predicts the next token but also the vector corresponding to it. But that sounds like a more complex solution.

But again, I'm not entirely sure how to set the training up using their script.

0

There are 0 best solutions below