Problems handling file with Chinese characters

877 Views Asked by At

I have a file (with a custom extension) that's dumped by a tool. The file is a mix of English and Chinese & Korean characters. The file can be read with notepad and the non-English characters are visible. By running the following code I was able to determine that the file is in UTF16-LE encoding:

def check_encoding(filename):
        """
        """
        fh = open(filename, 'rb') # open the file in raw binary mode
        data = fh.read() # read the contents of the file
        return chardet.detect(data)['encoding']

This returns 'UTF-16LE'.

I have a method that process this file where I do this:

file = open(filename, 'r', encoding='UTF-16-LE')
for line in file:

However when I execute this method I get this error (at the start of the for loop):

UnicodeDecodeError: 'utf-16-le' codec can't decode bytes in position 106-107: illegal encoding

I've tried opening the file with encoding = UTF8 but that throws the same error (albeit for the utf-8 codec at a different position)

The file is too big for me to manually narrow down to the exact line that's causing problems.

Now here's the interesting part. I make a copy of the file (lets call it copy.xyz). Now I open the original file in Notepad++ and copy all the content, paste it into copy.xyz (also opened in Notepad++) and save it (ctrl+s). Now if I try to run my method on the copy it doesn't throw any error!!! Can anyone help me figure out what's going on here and how I can fix the errors on the original file?

EDIT: I opened the file in raw binary mode and tried to figure out the encoding for the characters on each line using chardet. That gave me 12 different encodings. What would be the next step?

0

There are 0 best solutions below