I'm trying to take hash of gzipped string in Python and need it to be identical to Java's. But Python's gzip
implementation seems to be different from Java's GZIPOutputStream
.
Python gzip
:
import gzip
import hashlib
gzip_bytes = gzip.compress(bytes('test', 'utf-8'))
gzip_hex = gzip_bytes.hex().upper()
md5 = hashlib.md5(gzip_bytes).hexdigest().upper()
>>>gzip_hex
'1F8B0800678B186002FF2B492D2E01000C7E7FD804000000'
>>>md5
'C4C763E9A0143D36F52306CF4CCC84B8'
Java GZIPOutputStream
:
import java.io.ByteArrayOutputStream;
import java.util.zip.GZIPOutputStream;
import java.io.IOException;
import java.security.MessageDigest;
import java.security.NoSuchAlgorithmException;
public class HelloWorld{
private static final char[] HEX_ARRAY = "0123456789ABCDEF".toCharArray();
public static String bytesToHex(byte[] bytes) {
char[] hexChars = new char[bytes.length * 2];
for (int j = 0; j < bytes.length; j++) {
int v = bytes[j] & 0xFF;
hexChars[j * 2] = HEX_ARRAY[v >>> 4];
hexChars[j * 2 + 1] = HEX_ARRAY[v & 0x0F];
}
return new String(hexChars);
}
public static String md5(byte[] bytes) {
try {
MessageDigest md = MessageDigest.getInstance("MD5");
byte[] thedigest = md.digest(bytes);
return bytesToHex(thedigest);
}
catch (NoSuchAlgorithmException e){
new RuntimeException("MD5 Failed", e);
}
return new String();
}
public static void main(String []args){
String string = "test";
final byte[] bytes = string.getBytes();
try {
final ByteArrayOutputStream bos = new ByteArrayOutputStream();
final GZIPOutputStream gout = new GZIPOutputStream(bos);
gout.write(bytes);
gout.close();
final byte[] encoded = bos.toByteArray();
System.out.println("gzip: " + bytesToHex(encoded));
System.out.println("md5: " + md5(encoded));
}
catch(IOException e) {
new RuntimeException("Failed", e);
}
}
}
Prints:
gzip: 1F8B08000000000000002B492D2E01000C7E7FD804000000
md5: 1ED3B12D0249E2565B01B146026C389D
So, both gzip bytes outputs seem to be very similar, but slightly different.
1F8B0800678B186002FF2B492D2E01000C7E7FD804000000
1F8B08000000000000002B492D2E01000C7E7FD804000000
Python gzip.compress()
method accepts compresslevel
argument in range of 0-9. Tried all of them, but none gives desired result.
Any way to get same result as Java's GZIPOutputStream
in Python?
Your requirement "hash of gzipped string in Python and need it to be identical to Java's" cannot be met in general. You need to change your requirement, implementing your need differently. I would recommend requiring simply that the decompressed data have identical hashes. In fact, there is a 32-bit hash (a CRC-32) of the decompressed data already there in the two gzip strings, which are identical (
0xd87f7e0c
). If you want a longer hash, then you can append one. The last four bytes is the uncompressed length, modulo 232, so you can compare those as well. Just compare the last eight bytes of the two strings and check that they are the same.The difference between the two gzip strings in your question illustrates the issue. One has a time stamp in the header, and the other does not (set to zeros). Even if they both had time stamps, they would still very likely be different. They also have some other bytes in the header different, like the originating operating system.
Furthermore, the compressed data in your examples is extremely short, so it just so happens to be identical in this case. However for any reasonable amount of data, the compressed data generated by two gzippers will be different, unless they happen to made with exactly the same deflate code, the same version of that code, and the same memory size and compression level settings. If you are not in control of all of those, you will never be able to assure the same compressed data coming out of them, given identical uncompressed data.
In short, don't waste your time trying to get identical compressed strings.