I'm currently working on a text processing pipeline that involves tokenization with GPT-2 and spaCy for French. However, I'm encountering problems related to character encoding and tokenization outputs. After the gpt2 tokenisation, the words change, for instance voilà to VoilÃł. And then for the next step, I'll have an issue, as the word can't be found in the text.
I tried chardet.detect(text.encode()) to change it back to the original word, but it didn't work. Have you ever face this problem?
Thanks in advance!