Slow and Fast tokenizer gives different outputs(sentencepiece tokenization)

771 Views Asked by At

When i use T5TokenizerFast(Tokenizer of T5 architecture), the output is expected as follows:

['▁', '</s>', '▁Hello', '▁', '<sep>', '</s>']

But when i use the normal tokenizer, it starts to split special token "/s>" as follows:

['▁</', 's', '>', '▁Hello', '<sep>', '</s>']

And this is print of not fast tokenizer:

PreTrainedTokenizer(name_or_path='', vocab_size=60000, model_max_len=1000000000000000019884624838656, is_fast=False, padding_side='right', truncation_side='right', special_tokens={'eos_token': '</s>', 'unk_token': '<unk>', 'pad_token': '<pad>'})

For fast :

PreTrainedTokenizerFast(name_or_path='', vocab_size=60000, model_max_len=1000000000000000019884624838656, is_fast=True, padding_side='right', truncation_side='right', special_tokens={'eos_token': '</s>', 'unk_token': '<unk>', 'pad_token': '<pad>'})

Code that i am using to produce these outputs:

tokenizer = T5TokenizerFast('new_sp.model', extra_ids=0)
tokenizer.add_tokens(['<sep>'])
print(tokenizer.convert_ids_to_tokens(tokenizer.encode("</s> Hello <sep>")))

I would appreciate any help. Thanks.

0

There are 0 best solutions below