(Python) Incorrect string value (CP1521 to UTF8)

487 Views Asked by At

There is a problem with .json file, which contains cyrillic symbols. How to convert CP1251 to UTF-8? (temp_data.decode('utf-8') has no effect, such as ensure_ascii=False in .dumps)

import json

def load_data(filepath):   
    with open(filepath, 'r') as f:
        temp_data = json.load(f)
    return temp_data 


    def pretty_print_json(d):
        out_json = json.dumps(d, sort_keys=True, indent=4, separators = (',', ': '))
        print(out_json)

    if __name__ == '__main__':
        print("Enter the path to .json file: ") 
        in_path = input()
        print("There are pretty printed json format: ")
        pretty_print_json(load_data(in_path))
2

There are 2 best solutions below

4
RaminNietzsche On

You can pass the ensure_ascii, If ensure_ascii is true (the default), all non-ASCII characters in the output are escaped with \uXXXX sequences, and the results are str instances consisting of ASCII characters only. If ensure_ascii is false, a result may be a Unicode instance. This usually happens if the input contains Unicode strings or the encoding parameter is used.

Change your code to this:

out_json = json.dumps(d, sort_keys=True, indent=4, separators = (',', ': '), ensure_ascii=False)

And there is a full code:

import json

def load_data(filepath):   
    with open(filepath, 'r') as f:
        temp_data = json.load(f)
    return temp_data 


def pretty_print_json(d):
    out_json = json.dumps(d, sort_keys=True, indent=4, separators = (',', ': '), ensure_ascii=False)
    print(out_json)

if __name__ == '__main__':
    print("Enter the path to .json file: ") 
    in_path = raw_input()
    print("There are pretty printed json format: ")
    pretty_print_json(load_data(in_path))

I tested this code with this JSON file.

You can see the result in asciinema.

0
Mark Tolonen On

This works. Provide a sample of your data file and specify the encoding if your data doesn't:

#coding:utf8
import json

datafile_encoding = 'cp1251'  # Any encoding that supports Cyrillic works.

# Create a test file with Cyrillic symbols.
with open('test.json','w',encoding=datafile_encoding) as f:
    D = {'key':'АБВГДЕЖЗИЙКЛМНОПРСТ', 'key2':'АБВГДЕЖЗИЙКЛМНОПРСТ'}
    json.dump(D,f,ensure_ascii=False)

# specify the encoding of the data file
def load_data(filepath):   
    with open(filepath, 'r', encoding=datafile_encoding) as f:
        temp_data = json.load(f)
    return temp_data 

# Use ensure_ascii=False
def pretty_print_json(d):
    out_json = json.dumps(d, sort_keys=True, ensure_ascii=False, indent=4, separators = (',', ': '))
    print(out_json)

if __name__ == '__main__':
    in_path = 'test.json'
    pretty_print_json(load_data(in_path))
{
    "key": "АБВГДЕЖЗИЙКЛМНОПРСТ",
    "key2": "АБВГДЕЖЗИЙКЛМНОПРСТ"
}