Calling tell()
while reading a GBK-encoded file of mine causes the next call to readline()
to raise a UnicodeDecodeError
. However, if I don't call tell()
, it doesn't raise this error.
C:\tmp>hexdump badtell.txt
000000: 61 20 6B 0D 0A D2 BB B0-E3 a k......
C:\tmp>type test.py
with open(r'c:\tmp\badtell.txt', "r", encoding='gbk') as f:
while True:
pos = f.tell()
line = f.readline();
if not line: break
print(line)
C:\tmp>python test.py
a k
Traceback (most recent call last):
File "test.py", line 4, in <module>
line = f.readline();
UnicodeDecodeError: 'gbk' codec can't decode byte 0xd2 in position 0: incomplete multibyte sequence
When I remove the f.tell()
statement, it decoded successfully. Why?
I tried Python3.4/3.5 x64 on Win7/Win10, it is all the same.
Any one, any idea? Should I report a bug?
I have a big text file, and I really want to get file position ranges of this big text, is there a workaround?
I just replicated this on Python 3.4 x64 on Linux. Looking at the docs for
TextIOBase
, I don't see anything that saystell()
causes problems with reading a file, so maybe it is indeed a bug.gives an error like the one that you saw, but in your file that byte is followed by the byte
BB
, andgives a value equal to
'\u4e00'
, not an error.I found a workaround that works for the data in your original question, but not for other data, as you've since found. Wish I knew why! I called
seek()
after everytell()
, with the value thattell()
returned:An alternative to
f.seek(f.tell())
is to use theSEEK_CUR
mode ofseek()
to give the position. With an offset of 0, this does the same as the above code: moves to the current position and gets that position.