why does file.tell() affect encoding?

132 Views Asked by At

Calling tell() while reading a GBK-encoded file of mine causes the next call to readline() to raise a UnicodeDecodeError. However, if I don't call tell(), it doesn't raise this error.

C:\tmp>hexdump badtell.txt

000000: 61 20 6B 0D 0A D2 BB B0-E3                       a k......

C:\tmp>type test.py

with open(r'c:\tmp\badtell.txt', "r", encoding='gbk') as f:
    while True:
        pos = f.tell()
        line = f.readline();
        if not line: break
        print(line)

C:\tmp>python test.py

a k

Traceback (most recent call last):
  File "test.py", line 4, in <module>
    line = f.readline();
UnicodeDecodeError: 'gbk' codec can't decode byte 0xd2 in position 0:  incomplete multibyte sequence

When I remove the f.tell() statement, it decoded successfully. Why? I tried Python3.4/3.5 x64 on Win7/Win10, it is all the same.

Any one, any idea? Should I report a bug?

I have a big text file, and I really want to get file position ranges of this big text, is there a workaround?

2

There are 2 best solutions below

5
On

I just replicated this on Python 3.4 x64 on Linux. Looking at the docs for TextIOBase, I don't see anything that says tell() causes problems with reading a file, so maybe it is indeed a bug.

b'\xd2'.decode('gbk')

gives an error like the one that you saw, but in your file that byte is followed by the byte BB, and

b'\xd2\xbb'.decode('gbk')

gives a value equal to '\u4e00', not an error.

I found a workaround that works for the data in your original question, but not for other data, as you've since found. Wish I knew why! I called seek() after every tell(), with the value that tell() returned:

pos = f.tell()
f.seek(pos)
line = f.readline()

An alternative to f.seek(f.tell()) is to use the SEEK_CUR mode of seek() to give the position. With an offset of 0, this does the same as the above code: moves to the current position and gets that position.

pos = f.seek(0, io.SEEK_CUR)
line = f.readline()
0
On

OK, there is a workaround, It works so far:

with open(r'c:\tmp\badtell.txt', "rb") as f:
    while True:
        pos = f.tell()
        line = f.readline();
        if not line: break
        line = line.decode("gbk").strip('\n')
        print(line)

I submitted an issue yesterday here: http://bugs.python.org/issue26990

still no response yet