Need to process the Kannada language(Regional) in Python 3.7 without Encoding Issues

1.2k Views Asked by At

There is a JSON file with Kannada letters in it. Info.json

{
  "name":"",
  "url":"",
  "desc":"ಹಾಡುಗಳನ್ನು ಈಗ ಆನಂದಿಸಿ."
}

If i try to read this file without encoding like

with open('info.json', 'r')

I get Error: 'charmap' codec can't decode byte 0x8d in position 38: character maps to <undefined>

If I use UTF-8 like with open('info.json', 'r', encoding='utf-8')

only the Kannada Content is converted into Escape Unicode Entities like \u0c85\u0ca4\u0ccd\u0ca4\u0cb2\u0cbf\u0ca4\u0ccd\u0ca4

As this is a string I am finding problem in converting this back to actual Kannada Characters.

I tried using various types of decoding like...

str(infoObj['desc'], "utf-8"),
infoObj['desc'].decode('unicode-escape')

Did a lot of research for 5 hours without any success.

Seeking assistance as to how i can get back Kannada Text.

Thanks in advance.

2

There are 2 best solutions below

0
On

it worked for me when I added errors='ignore' along with utf8 encoding...

with open('info.json', 'r', encoding='utf8', errors='ignore')
0
On

If I use UTF-8 like with open('info.json', 'r', encoding='utf-8')

only the Kannada Content is converted into Escape Unicode Entities like \u0c85\u0ca4\u0ccd\u0ca4\u0cb2\u0cbf\u0ca4\u0ccd\u0ca4

No it is not.

The Kannada content is correctly interpreted as a Python string containing the Kannada letters. Simply, depending of the way you are trying to display a non ascii string, some characters may be displayed with their unicode values, may disappear or may be replaced with an other special replacement character.

And Python makes no difference between a character and its representation:

>>> "\x41\x62" == "Ab"
True

So you may have a problem in displaying Kannada letters, but not in correctly decoding the json file.