org.apache.hadoop.io.compress.**GzipCodec, in this class GzipOutputStream is not closed, so memory leak.
How to close
GzipOutputStream? Or other stream should also be closed? Is there a good alternative?
spark version is 2.1.0 and hadoop version is 2.8.4
sparkPairRdd.saveAsHadoopFile(outputPath, String.class, String.class, MultipleTextOutputFormat.class, GzipCodec.class);
If I am understanding the
GzipCodecclass correctly, its purpose is to create various compressor and decompressor streams and return them to the caller. It is not responsible for closing those streams. That is the responsibility of the caller.You simply call
close()on the object. IfsaveAsHadoopFileis usingGzipCodecto create aGzipOutputStream, then that method is responsible for closing it.The same as for a
GzipOutputStream. Callclose()on it.To calling close explicitly?
As an alternative, you could manage a stream created by
GzipCodecusing try with resources.But if you are asking if there is a way to avoid managing the streams properly, then the answer is No.
If you are actually encountering a storage leak that is (you think) due to
saveAsHadoopFilenot closing the streams that it opens, please provide a minimal reproducible example that we can look at. It could be a bug in Hadoop ... or you could be using it incorrectly.