Determining GZIPOutputStream behavior

236 Views Asked by At

The following code produces files which is deterministic (shasum is the same) for two strings.

    try(
            FileOutputStream fos = new FileOutputStream(saveLocation);
            GZIPOutputStream zip = new GZIPOutputStream(fos, GZIP_BUFFER_SIZE);
            BufferedWriter writer = new BufferedWriter(new OutputStreamWriter(zip, StandardCharsets.UTF_8));
            ){
        writer.append(str);
    }

Produces:

a.gz f0200d53f7f9b35647b5dece0146d72cd1c17949

However, if I take the file on the command line and re-zip it, it produces a different result

> gunzip -n a.gz ;gzip -n a ; shasum a.gz 

50f478a9ceb292a2d14f1460d7c584b7a856e4d9  a.gz

How can I get it to match the original sha using /usr/bin/gzip and gunzip ?

1

There are 1 best solutions below

0
Stephen C On

I think that the problem is likely to be the Gzip file header.

  • The Gzip format has provision for including a file name and file timestamp in the file headers. (I see you are using the -n when uncompressing and recompressing ... which is probably correct here.)

  • The Gzip format also includes an "operating system id" in the header. This is supposed to identify the source file system type; e.g. 0 for FAT, 3 for UNIX, and so on.

Either of these could lead to differences in the Gzip files and hence different hashes.

If I was going to solve this myself, I would start by using cmp to see where the compressed file differences start, and then od to identify what the differences are. Refer to the Gzip file format spec to figure out what the differences mean:

  • RFC 1952 - GZIP file format specification version 4.3
  • Wikipedia's gzip page.

How can I get it to match the original SHA using gzip and gunzip ?

Assuming that the difference is the OS id, I don't think there is a practical way to solve this with the gzip and gunzip commands.


I looked at the source code for GZIPOutputStream in Java 11, and it is not promising.

  • It is hard-wiring the timestamp to zero.
  • It is hard-wiring the OS identifier to zero (which is supposed to mean FAT).

The hard-wiring is in a private method and would be next to impossible to "fix" by subclassing or reflection. You could copy the code and fix it that way, but then you have to maintain your variant GZIPOutputStream class indefinitely.

(I would be looking at changing the application ... or whatever ... so that I didn't need the checksums to be identical. You haven't said why you are doing this. It is for testing purposes only, try looking for a different way to implement the tests.)