I'm writing a Python library that makes ZIP files in a streaming way. If the uncompressed or compressed data of a member of the zip is 4GiB or bigger, then it has to use a particular extension to the original ZIP format - zip64. The issue with always using this is that it has less support. So, I would like to only use zip64 if needed. But whether a file is zip64 has to be specified in the zip before the compressed data, and so if streaming, before the size of the compressed data is known.
In some cases however, the size of the uncompressed data is known. So, I would like to predict the maximum size that zlib can output based on this uncompressed size, and if this is 4GiB or bigger, use zip64 mode.
In other words, if the the total length of chunks in the below is known, what will be the maximum total length of bytes that get_compressed can yield? (I assume this maximum size would depend on level, memLevel and wbits)
import zlib
chunks = (
b'any',
b'iterable',
b'of',
b'bytes',
b'-' * 1000000,
)
def get_compressed(level=9, memLevel=9, wbits=-zlib.MAX_WBITS):
compress_obj = zlib.compressobj(level=level, memLevel=memLevel, wbits=wbits)
for chunk in chunks:
if compressed := compress_obj.compress(chunk):
yield compressed
if compressed := compress_obj.flush():
yield compressed
print('length', len(b''.join(get_compressed())))
This is complicated by the fact that Python zlib module's behaviour is not consistent between Python versions.
I think that Java attempts a sort of "auto zip64 mode" without knowing the uncompressed data size, but libarchive has problems with it.
You could estimate it by compressing some random data. Compressed sizes for 1000 chunks of 1000 bytes each, with varying arguments:
And with 2000 chunks of 2000 bytes each:
So looks like if you only change
level, it's about 0.015% overhead.Attempt This Online!