I have files with a small header (8 bytes, say zrxxxxxx), followed by a gzipped stream of data. Reading such files works fine most of the time. However in very specific cases, seeking backwards fails. This is a simple way to reproduce:
from gzip import GzipFile
f = open('test.bin', 'rb')
f.read(8) # Read zrxxxxxx
h = GzipFile(fileobj=f, mode='rb')
h.seek(8192)
h.seek(8191) # gzip.BadGzipFile: Not a gzipped file (b'zr')
Unfortunately I cannot share my file, but it looks like any similar file will do.
Debugging the situation, I noticed that DecompressReader.seek (in Lib/_compression.py) sometimes rewinds the original file, which I suspect causes the issue:
#...
# Rewind the file to the beginning of the data stream.
def _rewind(self):
self._fp.seek(0)
#...
def seek(self, offset, whence=io.SEEK_SET):
#...
# Make it so that offset is the number of bytes to skip forward.
if offset < self._pos:
self._rewind()
else:
offset -= self._pos
#...
Is this a bug? Or is it me doing it wrong?
Any simple workaround?
Looks like a bug in Python. When you ask it to seek backwards, it has to go all the way back to the start of the gzip stream and start over. However the library did not take note of the offset of the file object it was given, so instead of rewinding to the start of the gzip stream, it is rewinding to the start of the file.
As for a workaround, you would need to give
GzipFilea custom file object with a replacedseek()operation, such thatseek(0)goes to the right place. This seemed to work:(I don't really know Python, so I'm sure there's a better way. I tried to subclass
fileso that I would only need to interceptseek(), but somehowfileis not actually a class.)