DEFLATE: is back-reference really better?

462 Views Asked by At

I am making my own DEFLATE compressor, which already beats ZLIB library almost every time.

In DEFLATE format (LZ77), the data stream either contains a Byte literal, or a Back-reference saying, that we should copy a byte sequence from previous decoded bytes. Compressors usually perform LZ77 (find back-references - as many as possible) and then, build a huffman tree and compress that byte/reference stream.

There could be an extreme case: a reference for 3 same bytes, the length is encoded by 15 bits, the distance by 15+13 bits, while the encoded byte is just 1 bit in our huffman tree. The difference is 43 vs. 3 bits. Using the reference made the output 14x larger, than encoding bytes directly.

My problem is following: I have a 10,097,918 B text file. ZLIB covers 9,089,334 B by references in its LZ77, while my compressor covers 9,305,056 B. So I guess I can say that my LZ77 is better.

But ZLIB gives a 1,329,309 B file, while my program gives a 1,381,153 B file (both build an optimal Huffman tree). So a better LZ77 leads to a worse Huffman encoding. Is there any research about Huffman-friendly LZ77?

1

There are 1 best solutions below

4
On

It's a hard problem, since you only know the bit costs associated with these decisions after making all decisions whether to take the literal or the back-reference. But you could iterate and use the costs from the previous run to make these decisions while re-encoding, several times, until you're happy with the result.

This is one of the things Zopfli does to get better compression ratios while keeping the DEFLATE format, in addition to using a shortest-path approach instead of greedy decisions.