I'm working on a project with a dataset coming from Board Game Geek.
The issue I have concerns the name of the games I'm studying. I think the encoding worked bad so I have encoded letters in the csv file I received. For example : Orl\u00e9ans instead of Orléans
When I import the csv in Python, they remain like that and I want to correct these letters.
I manage to find the correct function I guess with this :
>>> unicodedata.normalize("NFD", 'Orl\u00e9ans')
'Orléans'
The problem is that I can't run this function through a for loop.
Indeed, the string displayed is 'Orl\u00e9ans'
but in fact, it's 'Orl\\u00e9ans'
so the function cannot do the job.
Is there any way to correct this ? I have 20000 entries in the dataset, I can't correct them all 1 by 1.
Thank you
EDIT I got the answer in this article : Process escape sequences in a string in Python
>>> myString = "spam\\neggs"
>>> decoded_string = bytes(myString, "utf-8").decode("unicode_escape") # python3
>>> decoded_string = myString.decode('string_escape') # python2
>>> print(decoded_string)
spam
eggs
Thanks a lot
I would try to use latin1 encoding as follows:
import codecs with codecs.open(r'$(path to your csv file)', encoding='latin1') as f: