Dimensions must be equal, text tokenize Tensorflow&HuggingFace

19 Views Asked by At

what can I do if I tokenize my x and y data using PreTrainedTokenizerFast with huggingFace where x is text and y is its summary and when I try to execute model.fit() I get a dimension mismatch error?

pre_trained_model = TFBartModel.from_pretrained(model_dir,from_pt=True)
pre_trained_tokenizer = PreTrainedTokenizerFast(tokenizer_file=os.path.join(model_dir, "tokenizer.json"),from_pt=True)
pre_trained_tokenizer.add_special_tokens({'pad_token': '[PAD]','sep_token' :'[/S]'})
x_train_token = pre_trained_tokenizer(x_tr,truncation=True,max_length=MAX_TEXT_LEN,padding=True,return_tensors='tf')
y_train_token = pre_trained_tokenizer(y_tr,truncation=True,max_length=MAX_SUMMARY_LEN,padding=True,return_tensors='tf')
input_ids = tf.keras.Input(shape=(MAX_TEXT_LEN,), dtype=tf.int32, name='input_ids')
input_mask = tf.keras.Input(shape=(MAX_TEXT_LEN,), dtype=tf.int32, name='attention_mask')
embeddings = pre_trained_model(input_ids, attention_mask=input_mask)[0]
model = tf.keras.Model(inputs=[input_ids, input_mask], outputs=embeddings)
optimizer = tf.keras.optimizers.Adam(learning_rate=3e-5)
loss = tf.keras.losses.SparseCategoricalCrossentropy(from_logits=True)
metrics = [tf.keras.metrics.SparseCategoricalAccuracy('accuracy')]
model.compile(optimizer=optimizer, loss=loss, metrics=metrics)
model.fit(x=[x_train_token['input_ids'],x_train_token['attention_mask']],y=y_train_token,batch_size=32,epochs=5)

x_train_token['input_ids'].shape,y_train_token['input_ids'].shape = (TensorShape([32889, 50]), TensorShape([32889, 10]))

ValueError: Dimensions must be equal, but are 10 and 50 for '{{node Equal}} = Equal[T=DT_FLOAT, incompatible_shape_error=true](Cast_1, Cast_2)' with input shapes: [?,10], [?,50]

0

There are 0 best solutions below