If I tokenize some string
from transformers import AutoTokenizer
t = AutoTokenizer.from_pretrained('bert-base-cased')
tokens = t.tokenize("I don't think the situation is quite as cut-and-dried as that - you should ask him directly.")
Then t.convert_tokens_to_string(tokens)
will return "I don ' t think the situation is quite as cut - and - dried as that - you should ask him directly."
.
Is there some way to preserve the original formatting in the "untokenized" text? Perhaps by using a different tokenizer? I am doing masked word replacement, but only on whole words, and the BERT tokenizer is quite good for this, as opposed to e.g. the GPT-2 tokenizer which does preserve formatting better, but doesn't allow single word manipulation quite as easily.