Why reading a compressed TAR file in reverse order is 100x slower?

103 Views Asked by At

First, let's generate a compressed tar archive:

from io import BytesIO
from tarfile import TarInfo
import tarfile

with tarfile.open('foo.tgz', mode='w:gz') as archive:
    for file_name in range(1000):
        file_info = TarInfo(str(file_name))
        file_info.size = 100_000
        archive.addfile(file_info, fileobj=BytesIO(b'a' * 100_000))

Now, if I read the archive contents in natural order:

import tarfile

with tarfile.open('foo.tgz') as archive:
    for file_name in archive.getnames():
        archive.extractfile(file_name).read()

and measure the execution time using the time command, I get less than 1 second on my PC:

real    0m0.591s
user    0m0.560s
sys     0m0.011s

But if I read the archive contents in reverse order:

import tarfile

with tarfile.open('foo.tgz') as archive:
    for file_name in reversed(archive.getnames()):
        archive.extractfile(file_name).read()

the execution time is now around 120 seconds:

real    2m3.050s
user    2m0.910s
sys     0m0.059s

Why is that? Is there some bug in my code? Or is it some tar's feature? Is it documented somewhere?

1

There are 1 best solutions below

10
tripleee On BEST ANSWER

A tar file is strictly sequential. You end up reading the beginning of the file 1000 times, rewinding between them, reading the second member 999 times, etc etc.

Remember, the "tape archive" format was designed at a time when unidirectional tape reels on big spindles was the hardware they used. Having an index would only have wasted space on the tape, as you would literally have to read every byte between where you are and where you want to seek to on the tape anyway.

In contrast, modern archive formats like .zip are designed for use on properly seekable devices, and typically contain an index which lets you quickly move to the position where a specific archive member can be found.