Preserving formatting in a BERT-tokenized string

35 Views Asked by At

If I tokenize some string

from transformers import AutoTokenizer
t = AutoTokenizer.from_pretrained('bert-base-cased')

tokens = t.tokenize("I don't think the situation is quite as cut-and-dried as that - you should ask him directly.")

Then t.convert_tokens_to_string(tokens) will return "I don ' t think the situation is quite as cut - and - dried as that - you should ask him directly.".

Is there some way to preserve the original formatting in the "untokenized" text? Perhaps by using a different tokenizer? I am doing masked word replacement, but only on whole words, and the BERT tokenizer is quite good for this, as opposed to e.g. the GPT-2 tokenizer which does preserve formatting better, but doesn't allow single word manipulation quite as easily.

0

There are 0 best solutions below