org.apache.hadoop.io.compress.GzipCodec, in this class GzipOutputStream is not closed, so memory leak

272 Views Asked by At

org.apache.hadoop.io.compress.**GzipCodec, in this class GzipOutputStream is not closed, so memory leak.

How to close GzipOutputStream? Or other stream should also be closed? Is there a good alternative?

spark version is 2.1.0 and hadoop version is 2.8.4

sparkPairRdd.saveAsHadoopFile(outputPath, String.class, String.class, MultipleTextOutputFormat.class, GzipCodec.class);
1

There are 1 best solutions below

0
On

If I am understanding the GzipCodec class correctly, its purpose is to create various compressor and decompressor streams and return them to the caller. It is not responsible for closing those streams. That is the responsibility of the caller.

How to close a GzipOutputStream?

You simply call close() on the object. If saveAsHadoopFile is using GzipCodec to create a GzipOutputStream, then that method is responsible for closing it.

Or other stream should also be closed?

The same as for a GzipOutputStream. Call close() on it.

Is there a good alternative?

To calling close explicitly?

As an alternative, you could manage a stream created by GzipCodec using try with resources.

But if you are asking if there is a way to avoid managing the streams properly, then the answer is No.


If you are actually encountering a storage leak that is (you think) due to saveAsHadoopFile not closing the streams that it opens, please provide a minimal reproducible example that we can look at. It could be a bug in Hadoop ... or you could be using it incorrectly.