I am trying to do text classification using pretrained BERT model. I trained the model on my dataset, and in the phase of testing; I know that BERT can only take to 512 tokens, so I wrote if condition to check the length of the test senetence in my dataframe. If it is longer than 512 I split the sentence into sequences each sequence has 512 token. And then do tokenizer encode. The length of the seqience is 512, however, after doing tokenize encode the length becomes 707 and I get this error.
The size of tensor a (707) must match the size of tensor b (512) at non-singleton dimension 1
Here is the code I used to do the preivous steps:
tokenizer = BertTokenizer.from_pretrained('bert-base-cased', do_lower_case=False)
import math
pred=[]
if (len(test_sentence_in_df.split())>512):
n=math.ceil(len(test_sentence_in_df.split())/512)
for i in range(n):
if (i==(n-1)):
print(i)
test_sentence=' '.join(test_sentence_in_df.split()[i*512::])
else:
print("i in else",str(i))
test_sentence=' '.join(test_sentence_in_df.split()[i*512:(i+1)*512])
#print(len(test_sentence.split())) ##here's the length is 512
tokenized_sentence = tokenizer.encode(test_sentence)
input_ids = torch.tensor([tokenized_sentence]).cuda()
print(len(tokenized_sentence)) #### here's the length is 707
with torch.no_grad():
output = model(input_ids)
label_indices = np.argmax(output[0].to('cpu').numpy(), axis=2)
pred.append(label_indices)
print(pred)
This is because, BERT uses word-piece tokenization. So, when some of the words are not in the vocabulary, it splits the words to it's word pieces. For example: if the word
playing
is not in the vocabulary, it can split down toplay, ##ing
. This increases the amount of tokens in a given sentence after tokenization. You can specify certain parameters to get fixed length tokenization:tokenized_sentence = tokenizer.encode(test_sentence, padding=True, truncation=True,max_length=50, add_special_tokens = True)