Do failures seeking backwards in a gzip.GzipFile mean it's broken?

89 Views Asked by Michel de Ruiter At 05 June 2023 at 15:44

I have files with a small header (8 bytes, say zrxxxxxx), followed by a gzipped stream of data. Reading such files works fine most of the time. However in very specific cases, seeking backwards fails. This is a simple way to reproduce:

from gzip import GzipFile

f = open('test.bin', 'rb')
f.read(8)  # Read zrxxxxxx

h = GzipFile(fileobj=f, mode='rb')
h.seek(8192)
h.seek(8191)  # gzip.BadGzipFile: Not a gzipped file (b'zr')

Unfortunately I cannot share my file, but it looks like any similar file will do.

Debugging the situation, I noticed that DecompressReader.seek (in Lib/_compression.py) sometimes rewinds the original file, which I suspect causes the issue:

#...
# Rewind the file to the beginning of the data stream.
def _rewind(self):
    self._fp.seek(0)
    #...

def seek(self, offset, whence=io.SEEK_SET):
    #...
    # Make it so that offset is the number of bytes to skip forward.
    if offset < self._pos:
        self._rewind()
    else:
        offset -= self._pos
    #...

Is this a bug? Or is it me doing it wrong?

Any simple workaround?

Original Q&A

There are 1 best solutions below

Mark Adler On 05 June 2023 at 20:38 BEST ANSWER

Looks like a bug in Python. When you ask it to seek backwards, it has to go all the way back to the start of the gzip stream and start over. However the library did not take note of the offset of the file object it was given, so instead of rewinding to the start of the gzip stream, it is rewinding to the start of the file.

As for a workaround, you would need to give GzipFile a custom file object with a replaced seek() operation, such that seek(0) goes to the right place. This seemed to work:

from gzip import GzipFile
f = open('test.bin', 'rb')
f.read(8)  # Read zrxxxxxx
class shift():
    def __init__(self, f):
        self.f = f
        self.to = f.tell()
    def seek(self, offset):
        return self.f.seek(self.to + offset)
    def read(self, size=-1):
        return self.f.read(size)
s = shift(f)
h = GzipFile(fileobj=s, mode='rb')
h.seek(8192)
h.seek(8191)

(I don't really know Python, so I'm sure there's a better way. I tried to subclass file so that I would only need to intercept seek(), but somehow file is not actually a class.)

Do failures seeking backwards in a gzip.GzipFile mean it's broken?

There are 1 best solutions below

Related Questions in PYTHON

Related Questions in GZIP

Related Questions in SEEK

Trending Questions

Popular # Hahtags

Popular Questions