Java GZIPOutputStream appears to allocate unnecessary byte arrays?

233 Views Asked by At

I have an application that processes text files and stores them and to save some space it gzips the files. So I have some chained OutputStreams and one of them is java.util.zip.GZIPOutputStream that manages the compression.

To make sure I was not wasting memory somewhere, I profiled my process with the async profiler/intellij with a file that had around 6MB of random data in a small loop for some amount of times. For reference I'm using Temurin JDK18.

I was surprised to see a lot of memory allocations with the GZIPOutputStream (via the parent method): gzip-output-samples 1,601,318,568 samples

That's a bit strange. I know GZIPOutputStream/DeflaterOutputStream uses a buffer, but why is it doing so many allocations? I look deeper in the code. I notice the parent method in java.util.zip.DeflaterOutputStream does this when it writes a byte:

    public void write(int b) throws IOException {
        byte[] buf = new byte[1];
        buf[0] = (byte)(b & 0xff);
        write(buf, 0, 1);
    }

So, it makes a new single byte array for every single byte? That definitely seems like it would be a lot of allocations? To see if it makes a difference, I extend GZIPOutputStream with a new class I called LowAllocGzipOutputStream with an override method like this:

    private final byte[] singleByteBuff = new byte[1];

    @Override
    public void write(int b) throws IOException {
        singleByteBuff[0] = (byte)(b & 0xff);
        write(singleByteBuff, 0, 1);
    }

I then profiled it again with my test case to see what might happen. The data was quite different: low-allocation-samples 162,262,880 samples

That is a pretty big reduction of allocations, -1,439,055,688 samples.

So I'm left with a few questions that I haven't found answers for:

  1. Why does GZIPOutputStream/DeflaterOutputStream allocate byte[]s like this? This is a class that comes with the JDK, so I'm sure it's been profiled and scrutinized heavily, but with my naive understanding it appears to be unnecessarily wasteful? Does the single byte array get optimized away by hotspot or something eventually? Does it not really add pressure to the garbage collector?
  2. Is there a negative consequence to my cached singleByteBuff method? I can't seem to think of any issue it would cause so far. The benefit that I find with it is that my app's memory profile is no longer dominated by DeflaterOutputStream byte[] allocations.
1

There are 1 best solutions below

0
On

Having spent more time digging into streams I will attempt to answer my own question, with a bit of guesswork:

From what I can measure, there's basically only one sane way to call an OutputStream if you care at at all about performance, and it's the method

public void write(byte[] b, int off, int len)

There are several reasons for this:

  1. Handling more bytes at a time can be more efficient for things like gzip
  2. Reusing buffers where you can saves memory
  3. Less function calls

The #3 is less obvious as an average java developer. Normally you don't think about function calls that much. But if you're processing a billion bytes one byte at a time that adds up! Function calls, little bits of work, etc, all of those things that you could just be doing less of add up needlessly.
In my original design for my app I was considering that I could treat OutputStreams like a state machine, where each input part was a byte. This is maybe just a wrong way to think about data streams when buffers are involved.

The only method you have to implement to make OutputStream work is this:

public abstract void write(int b)

Sometimes you really do need to write one byte. There's convenience to using this method. However, this method's convenience and simplicity is a trap. It's there for use, but if you care about performance you shouldn't use it. Definitely too much needless work will happen if you use it in production.

This is where I think the reasoning behind GZIPOutputStream comes from in regards to this method. If you've ever implemented an interesting OutputStream, a thing that becomes quickly obvious is that you really want all of your logic to flow into one of the methods. But, if you care about anything, you'd never choose write(int b), that would be crazy given how poorly it scales. So these simpler methods are implemented without much care. And if you are just writing one byte a few times a few array allocations are inconsequential.

In my question example, I made a more efficient method for GZIPOutputStream's write(int b) by adding a single byte buffer. And, as far as I can tell, it is more efficient! However, if you actually want your code to run efficiently, you'd still never use this method, no matter how optimized it could be. Your program would still be doing too much unnecessary work.
This is where I think the thinking of the design comes from. The write(int b) is there just so you can technically implement the OutputStream and also allow a single byte write, but it's something you should almost always avoid, so why optimize an inherently flawed method?

That said, a bit of javadoc in any of these methods could have gone a long way to help educate me here.