T5 fine tuned model outputs <unk> instead of curly braces and other special characters

801 Views Asked by zaki Miho At 07 June 2025 at 17:15

First of I'll start with saying that I'm a beginner when it comes to machine learning as a whole and transformers so my apologies if it's a dumb question. I've been fine tuning t5 for the task of generating mongodb queries, but I was met with this strange output that doesn't look like as the intended one.

inputs = tokenizer(inputs, max_length=max_input_length, truncation=True, return_tensors="pt")

output = model.generate(**inputs, num_beams=8, do_sample=True, min_length=10, max_length=64)

decoded_output = tokenizer.batch_decode(output, skip_special_tokens=False)[0]

print(decoded_output)

predicted_Query = nltk.sent_tokenize(decoded_output.strip())[0]

print(predicted_Query)

Gives the following output:

<pad> db.movies.find(<unk>"title": "The Poor Little Rich Girl"<unk>, <unk>"writers": 1<unk>)</s>

<pad> db.movies.find(<unk>"title": "The Poor Little Rich Girl"<unk>, <unk>"writers": 1<unk>)</s>

The query is correct for the most part, I assume that the <unk> token is supposed to be curly braces but the model wasn't able to understand them (as in OOV case). Note that the dataset that was used to fine tune it contain curly braces in the output so I'm confused on how it couldn't recognize it during the testing. Would it be a problem with the tokenizer? If it's the case, could I expend the vocab by adding some new tokens ? I'm not asking for an answer (although it's welcomed) but some guidance would be appreciated. Thank you for your time.

I tested if the tokenizer can handle curly braces and it showed it can. Again I'm new to this so I'm not really sure if I understand the problem well.

Original Q&A

There are 1 best solutions below

zaki Miho On 27 March 2023 at 23:11

After some research I've found a solution. T5 tokenizer vocab was missing a few characters like curly braces and others, so I used the following to add them.

from transformers import  AutoModel
new_words = ['{', '}']
 
model = AutoModel.from_pretrained("t5-base")

tokenizer.add_tokens(new_words)

model.resize_token_embeddings(len(tokenizer))

T5 fine tuned model outputs <unk> instead of curly braces and other special characters

There are 1 best solutions below

Related Questions in PYTHON

Related Questions in MACHINE-LEARNING

Related Questions in HUGGINGFACE-TRANSFORMERS

Related Questions in HUGGINGFACE-TOKENIZERS

Related Questions in T5-TRANSFORMER

Trending Questions

Popular # Hahtags

Popular Questions