read files with different encoding format using sys.stdin in python3

1.9k Views Asked by At

I have many files which are encoded with UTF-8 or GBK. My system encoding is UTF-8 (LANG=zh_CN.UTF-8), so I can read files encoded with UTF-8 easily. But I must read file encoding with GBK as well. I'm following Python 3: How to specify stdin encoding here:

import sys 
import io
input_stream = io.TextIOWrapper(sys.stdin.buffer, encoding='gbk')
for line in input_stream:
    print(line)

My question is how can I read all the files (both GBK and UTF-8) safely from sys.stdin. Or can you give me some better solution?

To slightly expand on this question, I want to handle files like this:

cat *.in | python3 handler.py

*.in returns many files encoded with either UTF-8 or GBK.

If I use the following code in handler.py

for line in sys.stdin:
    ...some code

it will throw an error as soon as it tries to process a GBK file:

UnicodeDecodeError: 'utf-8' codec can't decode byte 0xd5 in position 0: invalid continuation byte

On the other hand, if I use code like this:

input_stream = io.TextIOWrapper(sys.stdin.buffer, encoding='gbk')
for line in input_stream:
    ...some code

it will throw an error on any UTF-8 file:

UnicodeDecodeError: 'gbk' codec can't decode byte 0x80 in position 25: illegal multibyte sequence

I want to find a safe way to handle both types of files (UTF-8 and GBK) within my script.

1

There are 1 best solutions below

0
On BEST ANSWER

You can read the input as raw bytes, and then examine the input to decide what to actually decode it into.

See also Reading binary data from stdin

Assuming you can read entire lines at a time (i.e. the encoding for an entire line can be expected to be consistent), I'd try to decode as utf-8, then fall back to gbk.

for raw_line in input_stream:
    try:
        line = raw_line.decode('utf-8')
    except UnicodeDecodeError:
        line = raw_line.decode('gbk')
    # ...