How do you git archive a branch using gzip with highest compression level?

833 Views Asked by At

I'm trying to git archive a branch using gzip with the highest level compression (9) but it seems to not compress at that level. Here is my command:

git -C /home/user/example.com/ archive --format tar -o /home/user/site_backups/develop-`date +%Y-%m-%dT%H%M`.tar develop | gzip -9

It creates the tar file but the size is over 100MB compared to a zip that was compressed at 86MB using this command:

git -C /home/user/example.com/ archive --format zip -o /home/user/site_backups/develop-`date +%Y-%m-%dT%H%M`.zip develop

Can the output file be compressed more? What am I doing wrong?

1

There are 1 best solutions below

0
On

Nowadays (4 years later), the command would be:

git -C /home/user/example.com/ archive --format tgz -19 -o /home/user/site_backups/develop-`date +%Y-%m-%dT%H%M`.tar develop

With Git 2.30 (Q1 2021), "git archive"(man) now allows compression level higher than "-9" when generating tar.gz output.

See commit cde8ea9 (09 Nov 2020) by René Scharfe (rscharfe).
(Merged by Junio C Hamano -- gitster -- in commit ede4d63, 18 Nov 2020)

archive: support compression levels beyond 9

Signed-off-by: René Scharfe

Compression programs like zip, gzip, bzip2 and xz allow to adjust the trade-off between CPU cost and size gain with numerical options from -1 for fast compression and -9 for high compression ratio.
zip also accepts -0 for storing files verbatim.
git archive(man) directly support these single-digit compression levels for ZIP output and passes them to filters like gzip.

Zstandard additionally supports compression level options -10 to -19, or up to -22 with --ultra.
This seems to work with git archive(man) in most cases, e.g. it will produce an archive with -19 without complaining, but since it only supports single-digit compression level options this is the same as -1 -9 and thus -9.

Allow git archive(man) to accept multi-digit compression levels to support the full range supported by zstd.
Explicitly reject them for the ZIP format, as otherwise deflateInit2() would just fail with a somewhat cryptic "stream consistency error".


Note that, with Git 2.38 (Q3 2022), "git archive"(man) now (optionally and then by default) avoids spawning an external "gzip" process when creating ".tar.gz" (and ".tgz") archives.

See commit 4f4be00, commit 23fcf8b, commit 76d7602, commit dfce118, commit 96b9e51, commit 650134a (15 Jun 2022) by René Scharfe (rscharfe).
(Merged by Junio C Hamano -- gitster -- in commit b5a2d6c, 11 Jul 2022)

archive-tar: add internal gzip implementation

Original-patch-by: Rohit Ashiwal
Signed-off-by: René Scharfe

Git uses zlib for its own object store, but calls gzip when creating tgz archives.

Add an option to perform the gzip compression for the latter using zlib, without depending on the external gzip binary.

Plug it in by making write_block a function pointer and switching to a compressing variant if the filter command has the magic value "git archive gzip"(man)".
Does that indirection slow down tar creation? Not really, at least not in this test:

$ hyperfine -w3 -L rev HEAD,origin/main -p 'git checkout {rev} && make' \

'./git -C ../linux archive --format=tar HEAD # {rev}'

Benchmark #1: ./git -C ../linux archive --format=tar HEAD # HEAD Time (mean ± σ): 4.044 s ± 0.007 s [User: 3.901 s, System: 0.137 s] Range (min … max): 4.038 s … 4.059 s 10 runs

Benchmark #2: ./git -C ../linux archive --format=tar HEAD # origin/main Time (mean ± σ): 4.047 s ± 0.009 s [User: 3.903 s, System: 0.138 s] Range (min … max): 4.038 s … 4.066 s 10 runs


How does tgz creation perform?  

$ hyperfine -w3 -L command 'gzip -cn','git archive gzip'
'./git -c tar.tgz.command="{command}" -C ../linux archive --format=tgz HEAD' Benchmark #1: ./git -c tar.tgz.command="gzip -cn" -C ../linux archive --format=tgz HEAD Time (mean ± σ): 20.404 s ± 0.006 s [User: 23.943 s, System: 0.401 s] Range (min … max): 20.395 s … 20.414 s 10 runs

Benchmark #2: ./git -c tar.tgz.command="git archive gzip -C ../linux archive --format=tgz HEAD Time (mean ± σ): 23.807 s ± 0.023 s [User: 23.655 s, System: 0.145 s] Range (min … max): 23.782 s … 23.857 s 10 runs


Summary
'./git -c tar.tgz.command="gzip -cn" -C ../linux archive --format=tgz HEAD' ran
  1.17 ± 0.00 times faster than './git -c tar.tgz.command="git archive gzip" -C ../linux archive --format=tgz HEAD'

So the internal implementation takes 17% longer on the Linux repo, but uses 2% less CPU time.  
That's because the external gzip can run in parallel on its own processor, while the internal one works sequentially and avoids the inter-process communication overhead.  

What are the benefits?  

Only an internal sequential implementation can offer this eco mode, and it allows avoiding the gzip(1) requirement.

This implementation uses the helper functions from our zlib.c instead of the convenient gz* functions from zlib, because the latter doesn't give the control over the generated gzip header that the next patch requires.

And:

archive-tar: use internal gzip by default

Signed-off-by: René Scharfe

Drop the dependency on gzip(1) and use our internal implementation to create tar.gz and tgz files.

git archive now includes in its man page:

magic command git archive gzip by default, which invokes an internal implementation of gzip.

So a git archive using an external gzip would be:

git -c tar.tgz.command="gzip -cn" archive --format=tgz HEAD >external_gzip.tgz

While the new default one would use the internal zlib:

git archive --format=tgz HEAD >j.tgz

In both cases, the compression level options mentioned above still apply.