I am presently using a pretrained Roberta model to identify the sentiment scores and categories for my dataset. I am truncating the length to 512 but I still get the warning. What is going wrong here? I am using the following code to achieve this:
from transformers import AutoTokenizer
from transformers import AutoModelForSequenceClassification
from scipy.special import softmax
model = f"j-hartmann/sentiment-roberta-large-english-3-classes"
tokenizer = AutoTokenizer.from_pretrained(model, model_max_length=512,truncation=True)
automodel = AutoModelForSequenceClassification.from_pretrained(model)
The warning that I am getting here:
Token indices sequence length is longer than the specified maximum sequence length for this model (627 > 512). Running this sequence through the model will result in indexing errors
You have not shared the code where you use tokenizer to encode/tokenize the inputs, so I'm taking my own example to explain how you can achieve this.
example usage:
These above parameters will tokenize any string into
max_length
tokens by padding (if number of tokens is <max_length
) or truncating (for tokens count >max_length
).Note:
max_length
cannot be greater than512
for RoBERTa model.