org.apache.hadoop.io.compress.**GzipCodec
, in this class GzipOutputStream
is not closed, so memory leak.
How to close
GzipOutputStream
? Or other stream should also be closed? Is there a good alternative?
spark version is 2.1.0 and hadoop version is 2.8.4
sparkPairRdd.saveAsHadoopFile(outputPath, String.class, String.class, MultipleTextOutputFormat.class, GzipCodec.class);
If I am understanding the
GzipCodec
class correctly, its purpose is to create various compressor and decompressor streams and return them to the caller. It is not responsible for closing those streams. That is the responsibility of the caller.You simply call
close()
on the object. IfsaveAsHadoopFile
is usingGzipCodec
to create aGzipOutputStream
, then that method is responsible for closing it.The same as for a
GzipOutputStream
. Callclose()
on it.To calling close explicitly?
As an alternative, you could manage a stream created by
GzipCodec
using try with resources.But if you are asking if there is a way to avoid managing the streams properly, then the answer is No.
If you are actually encountering a storage leak that is (you think) due to
saveAsHadoopFile
not closing the streams that it opens, please provide a minimal reproducible example that we can look at. It could be a bug in Hadoop ... or you could be using it incorrectly.