Handling GPT-2 Tokenization and Encoding in a Text Processing Pipeline

30 Views Asked by sRah At 24 January 2024 at 16:25

I'm currently working on a text processing pipeline that involves tokenization with GPT-2 and spaCy for French. However, I'm encountering problems related to character encoding and tokenization outputs. After the gpt2 tokenisation, the words change, for instance voilà to VoilÃł. And then for the next step, I'll have an issue, as the word can't be found in the text.

I tried chardet.detect(text.encode()) to change it back to the original word, but it didn't work. Have you ever face this problem?

Thanks in advance!

Original Q&A

Handling GPT-2 Tokenization and Encoding in a Text Processing Pipeline

There are 0 best solutions below

Related Questions in CHARACTER-ENCODING

Related Questions in TOKENIZE

Related Questions in GPT-2

Trending Questions

Popular # Hahtags

Popular Questions