How to avoid 'CharMap' codec error when creating BertLMDataBunch object for French texts

19 Views Asked by At

Issues with BertLMDataBunch.from_raw_corpus(), 'charmap' codec can't encode character '\u0627' in position 0: character maps to

When creating a BertLMDataBunch object, I got issue that 'charmap' codec can't encode character '\u0627' in position 0. When I tried to encode my texts using utf-8, I got this error : 'charmap' codec can't encode characters in position 20-25: character maps to I also thought about avoiding punctuation or special characters like 'éèêçàôûù' but I got the same error.

df_train is my labeled dataset, and Description is the column with french texts.

DATA_PATH = Path('./data/')

all_texts = df_train['Description'].to_list()
all_texts = [ (x.encode('utf-8', errors='ignore')).decode('utf-8', errors='ignore') for x in all_texts]

The texts also contain numbers

the BertLMDataBunch object

enter image description here

The object I created generates a text file lm_trained that contains texts like this :

Bonjour Le 21 Avril 2021 j ai envoy� une r�clamation

If anyone can help me to fix this. Thank you !

0

There are 0 best solutions below