Python 3.8 lzma decompress huge file incremental input and output

1.4k Views Asked by At

I am looking to do, in Python 3.8, the equivalent of:

xz --decompress --stdout < hugefile.xz > hugefile.out

where neither the input nor output might fit well in memory.

As I read the documentation at https://docs.python.org/3/library/lzma.html#lzma.LZMADecompressor I could use LZMADecompressor to process incrementally available input, and I could use its decompress() function to produce output incrementally.

However it seems that LZMADecompressor puts its entire decompressed output into a single memory buffer, and decompress() reads its entire compressed input from a single input memory buffer.

Granted, the documentation confuses me as to when the input and/or output can be incremental.

So I figure I will have to spawn a separate child process to execute the "xz" binary.

Is there anyway of using the lzma Python module for this task?

1

There are 1 best solutions below

0
On

Instead of using the low-level LZMADecompressor, use lzma.open to get a file object. Then, you can copy data into an other file object with the shutil module:

import lzma
import shutil

with lzma.open("hugefile.xz", "rb") as fsrc:
    with open("hugefile.out", "wb") as fdst:
        shutil.copyfileobj(fsrc, fdst)

Internally, shutils.copyfileobj reads and write data in chunks, and the LZMA decompression is done on the fly. This avoids decompressing the whole data into memory.