T5 fine tuned model outputs <unk> instead of curly braces and other special characters

798 Views Asked by At

First of I'll start with saying that I'm a beginner when it comes to machine learning as a whole and transformers so my apologies if it's a dumb question. I've been fine tuning t5 for the task of generating mongodb queries, but I was met with this strange output that doesn't look like as the intended one.

inputs = tokenizer(inputs, max_length=max_input_length, truncation=True, return_tensors="pt")

output = model.generate(**inputs, num_beams=8, do_sample=True, min_length=10, max_length=64)

decoded_output = tokenizer.batch_decode(output, skip_special_tokens=False)[0]

print(decoded_output)

predicted_Query = nltk.sent_tokenize(decoded_output.strip())[0]

print(predicted_Query)

Gives the following output:

<pad> db.movies.find(<unk>"title": "The Poor Little Rich Girl"<unk>, <unk>"writers": 1<unk>)</s>

<pad> db.movies.find(<unk>"title": "The Poor Little Rich Girl"<unk>, <unk>"writers": 1<unk>)</s>

The query is correct for the most part, I assume that the <unk> token is supposed to be curly braces but the model wasn't able to understand them (as in OOV case). Note that the dataset that was used to fine tune it contain curly braces in the output so I'm confused on how it couldn't recognize it during the testing. Would it be a problem with the tokenizer? If it's the case, could I expend the vocab by adding some new tokens ? I'm not asking for an answer (although it's welcomed) but some guidance would be appreciated. Thank you for your time.

I tested if the tokenizer can handle curly braces and it showed it can. Again I'm new to this so I'm not really sure if I understand the problem well.

1

There are 1 best solutions below

0
On

After some research I've found a solution. T5 tokenizer vocab was missing a few characters like curly braces and others, so I used the following to add them.

from transformers import  AutoModel
new_words = ['{', '}']
 
model = AutoModel.from_pretrained("t5-base")

tokenizer.add_tokens(new_words)

model.resize_token_embeddings(len(tokenizer))