Maximum size of compressed data using Python's zlib

Question

Maximum size of compressed data using Python's zlib

206 Views Asked by Michal Charemza At 03 June 2023 at 11:17

I'm writing a Python library that makes ZIP files in a streaming way. If the uncompressed or compressed data of a member of the zip is 4GiB or bigger, then it has to use a particular extension to the original ZIP format - zip64. The issue with always using this is that it has less support. So, I would like to only use zip64 if needed. But whether a file is zip64 has to be specified in the zip before the compressed data, and so if streaming, before the size of the compressed data is known.

In some cases however, the size of the uncompressed data is known. So, I would like to predict the maximum size that zlib can output based on this uncompressed size, and if this is 4GiB or bigger, use zip64 mode.

In other words, if the the total length of chunks in the below is known, what will be the maximum total length of bytes that get_compressed can yield? (I assume this maximum size would depend on level, memLevel and wbits)

import zlib

chunks = (
    b'any',
    b'iterable',
    b'of',
    b'bytes',
    b'-' * 1000000,
)

def get_compressed(level=9, memLevel=9, wbits=-zlib.MAX_WBITS):
    compress_obj = zlib.compressobj(level=level, memLevel=memLevel, wbits=wbits)
    for chunk in chunks:
        if compressed := compress_obj.compress(chunk):
            yield compressed

    if compressed := compress_obj.flush():
        yield compressed

print('length', len(b''.join(get_compressed())))

This is complicated by the fact that Python zlib module's behaviour is not consistent between Python versions.

I think that Java attempts a sort of "auto zip64 mode" without knowing the uncompressed data size, but libarchive has problems with it.

Original Q&A

There are 3 best solutions below

**Kelly Bundy** · Answer 1 · 2023-06-03T12:26:25.950000

You could estimate it by compressing some random data. Compressed sizes for 1000 chunks of 1000 bytes each, with varying arguments:

level=0:  1000155 (+0.015%)
level=1:  1000155 (+0.015%)
level=2:  1000155 (+0.015%)
level=3:  1000155 (+0.015%)
level=4:  1000155 (+0.015%)
level=5:  1000155 (+0.015%)
level=6:  1000155 (+0.015%)
level=7:  1000155 (+0.015%)
level=8:  1000155 (+0.015%)
level=9:  1000155 (+0.015%)
memLevel=1:  1039350 (+3.935%)
memLevel=2:  1019600 (+1.960%)
memLevel=3:  1009780 (+0.978%)
memLevel=4:  1004885 (+0.488%)
memLevel=5:  1002445 (+0.245%)
memLevel=6:  1001225 (+0.122%)
memLevel=7:  1000615 (+0.061%)
memLevel=8:  1000310 (+0.031%)
memLevel=9:  1000155 (+0.015%)

And with 2000 chunks of 2000 bytes each:

level=0:  4000590 (+0.015%)
level=1:  4000610 (+0.015%)
level=2:  4000610 (+0.015%)
level=3:  4000610 (+0.015%)
level=4:  4000615 (+0.015%)
level=5:  4000615 (+0.015%)
level=6:  4000615 (+0.015%)
level=7:  4000615 (+0.015%)
level=8:  4000615 (+0.015%)
level=9:  4000615 (+0.015%)
memLevel=1:  4157400 (+3.935%)
memLevel=2:  4078390 (+1.960%)
memLevel=3:  4039120 (+0.978%)
memLevel=4:  4019540 (+0.488%)
memLevel=5:  4009770 (+0.244%)
memLevel=6:  4004885 (+0.122%)
memLevel=7:  4002445 (+0.061%)
memLevel=8:  4001225 (+0.031%)
memLevel=9:  4000615 (+0.015%)

So looks like if you only change level, it's about 0.015% overhead.

import zlib
import os

chunks = [
  os.urandom(1000)
  for _ in range(1000)
]

def get_compressed(level=9, memLevel=9, wbits=-zlib.MAX_WBITS):
    compress_obj = zlib.compressobj(level=level, memLevel=memLevel, wbits=wbits)
    for chunk in chunks:
        if compressed := compress_obj.compress(chunk):
            yield compressed

    if compressed := compress_obj.flush():
        yield compressed

insize = sum(map(len, chunks))
for level in range(10):
    compressed = get_compressed(level=level)
    outsize = len(b''.join(compressed))
    print(f'{level=}: ', outsize, f'({(outsize-insize)/insize:+.3%})')

for memLevel in range(1, 10):
    compressed = get_compressed(memLevel=memLevel)
    outsize = len(b''.join(compressed))
    print(f'{memLevel=}: ', outsize, f'({(outsize-insize)/insize:+.3%})')

Attempt This Online!

**Mark Adler** · Answer 2 · 2023-06-03T16:35:49.050000

Sure, you could find this out. But then you are relying on a detailed, undocumented behavior of a particular version of zlib. Deflate in zlib could be modified or rewritten, and then your code is broken.

Even if you have the exact bounds for incompressible data, you could still end up with entries marked as needing Zip64 that don't. E.g. if the data is compressible, but the bound pushes it over.

Furthermore, a streaming zipper, if it is truly streaming, should be able to accept a streaming input, in which case it has no idea what the uncompressed size is in the first place. So this wouldn't help.

The right way to handle this for a streaming zipper is to mark the local header as not needing Zip64. Upon discovering that it does need Zip64, use the appropriate data descriptor, and mark the entry in the central directory as needing Zip64. If an unzipper is using the central directory, as most do, then it has the right information. If the unzipper is streaming, then it has to try all of the possible data descriptors anyway, so it didn't matter what the local header claimed.

**Michal Charemza** · Answer 3 · 2023-06-04T15:57:08.420000

There seem to be several bounds offered at https://github.com/madler/zlib/blob/04f42ceca40f73e2978b50e93806c2a18c1281fc/deflate.c#L696

One of these is a "tight" bound of ~0.03%, calculated using:

uncompressed_size + (uncompressed_size >> 12) + (uncompressed_size >> 14) + (uncompressed_size >> 25) + 7

But - it is only applicable if memLevel == 8, and abs(wbits)=15 https://github.com/madler/zlib/issues/822

Using this, the largest value that can fit in a file without Zip64 is 4293656841 - this gives a bound of exactly the Zip32 limit of 4294967295.

To check this, can compress 4293656841 bytes of random data:

import itertools
import os
import zlib

def gen_bytes(num, chunk_size=65536):
    while num:
        to_yield = min(chunk_size, num)
        num -= to_yield
        yield os.urandom(to_yield)

def get_suspected_max(uncompressed_size):
    return uncompressed_size + (uncompressed_size >> 12) + (uncompressed_size >> 14) + (uncompressed_size >> 25) + 7

def get_compressed(chunks, level=9):
    compress_obj = zlib.compressobj(level=level, memLevel=8, wbits=-zlib.MAX_WBITS)
    for chunk in chunks:
        if compressed := compress_obj.compress(chunk):
            yield compressed

    if compressed := compress_obj.flush():
        yield compressed

def get_sum(chunks):
    s = 0
    for c in chunks:
        s += len(c)
    return s

levels = [0, 1, 8, 9]
chunk_sizes = [10000000, 1000000, 65536, 10000, 1000]
for level, chunk_size in itertools.product(levels, chunk_sizes):
    num_bytes = 4293656841
    compressed_size = get_sum(get_compressed(gen_bytes(num_bytes), level=level))
    percentage_increase = ((compressed_size - num_bytes) / num_bytes)
    percentage_increate_str = f'{percentage_increase:+.3%}'
    suspected_max = get_suspected_max(num_bytes)
    print(f'level: {level}, num_bytes: {num_bytes}, chunk_size: {chunk_size}, compressed_size: {compressed_size}, increase: {percentage_increate_str}, diff from max: {suspected_max - compressed_size}')

which outputs for me:

level: 0, num_bytes: 4293656841, chunk_size: 10000000, compressed_size: 4293985671, increase: +0.008%, diff from max: 981624
level: 0, num_bytes: 4293656841, chunk_size: 1000000, compressed_size: 4294000341, increase: +0.008%, diff from max: 966954
level: 0, num_bytes: 4293656841, chunk_size: 65536, compressed_size: 4294148216, increase: +0.011%, diff from max: 819079
level: 0, num_bytes: 4293656841, chunk_size: 10000, compressed_size: 4294214516, increase: +0.013%, diff from max: 752779
level: 0, num_bytes: 4293656841, chunk_size: 1000, compressed_size: 4294307396, increase: +0.015%, diff from max: 659899
level: 1, num_bytes: 4293656841, chunk_size: 10000000, compressed_size: 4294962221, increase: +0.030%, diff from max: 5074
level: 1, num_bytes: 4293656841, chunk_size: 1000000, compressed_size: 4294962221, increase: +0.030%, diff from max: 5074
level: 1, num_bytes: 4293656841, chunk_size: 65536, compressed_size: 4294962226, increase: +0.030%, diff from max: 5069
level: 1, num_bytes: 4293656841, chunk_size: 10000, compressed_size: 4294962221, increase: +0.030%, diff from max: 5074
level: 1, num_bytes: 4293656841, chunk_size: 1000, compressed_size: 4294962216, increase: +0.030%, diff from max: 5079
level: 8, num_bytes: 4293656841, chunk_size: 10000000, compressed_size: 4294966581, increase: +0.031%, diff from max: 714
level: 8, num_bytes: 4293656841, chunk_size: 1000000, compressed_size: 4294966581, increase: +0.031%, diff from max: 714
level: 8, num_bytes: 4293656841, chunk_size: 65536, compressed_size: 4294966581, increase: +0.031%, diff from max: 714
level: 8, num_bytes: 4293656841, chunk_size: 10000, compressed_size: 4294966576, increase: +0.031%, diff from max: 719
level: 8, num_bytes: 4293656841, chunk_size: 1000, compressed_size: 4294966581, increase: +0.031%, diff from max: 714
level: 9, num_bytes: 4293656841, chunk_size: 10000000, compressed_size: 4294966581, increase: +0.031%, diff from max: 714
level: 9, num_bytes: 4293656841, chunk_size: 1000000, compressed_size: 4294966581, increase: +0.031%, diff from max: 714
level: 9, num_bytes: 4293656841, chunk_size: 65536, compressed_size: 4294966581, increase: +0.031%, diff from max: 714
level: 9, num_bytes: 4293656841, chunk_size: 10000, compressed_size: 4294966581, increase: +0.031%, diff from max: 714
level: 9, num_bytes: 4293656841, chunk_size: 1000, compressed_size: 4294966581, increase: +0.031%, diff from max: 714

But yes - checking a few cases is certainly not a proof. This is also probably (as noted in the answer in https://stackoverflow.com/a/76396986/1319998) dependant on implementation details in zlib, which can change between versions.

Maximum size of compressed data using Python's zlib

There are 3 best solutions below

Related Questions in PYTHON

Related Questions in ZIP

Related Questions in ZLIB

Related Questions in DEFLATE

Related Questions in PYTHON-ZLIB

Trending Questions

Popular # Hahtags

Popular Questions