When i use T5TokenizerFast(Tokenizer of T5 architecture), the output is expected as follows:
['▁', '</s>', '▁Hello', '▁', '<sep>', '</s>']
But when i use the normal tokenizer, it starts to split special token "/s>" as follows:
['▁</', 's', '>', '▁Hello', '<sep>', '</s>']
And this is print of not fast tokenizer:
PreTrainedTokenizer(name_or_path='', vocab_size=60000, model_max_len=1000000000000000019884624838656, is_fast=False, padding_side='right', truncation_side='right', special_tokens={'eos_token': '</s>', 'unk_token': '<unk>', 'pad_token': '<pad>'})
For fast :
PreTrainedTokenizerFast(name_or_path='', vocab_size=60000, model_max_len=1000000000000000019884624838656, is_fast=True, padding_side='right', truncation_side='right', special_tokens={'eos_token': '</s>', 'unk_token': '<unk>', 'pad_token': '<pad>'})
Code that i am using to produce these outputs:
tokenizer = T5TokenizerFast('new_sp.model', extra_ids=0)
tokenizer.add_tokens(['<sep>'])
print(tokenizer.convert_ids_to_tokens(tokenizer.encode("</s> Hello <sep>")))
I would appreciate any help. Thanks.