We are seeing an issue with executors being evicted on our spark streaming application. We are doing native spark aggregation from a kafka stream using watermarking
spark.readStream()
.format("kafka")
.withWatermark( "watermarked_timestamp", LATENESS_THRESHOL)
.dropDuplicates(METADATA_COL_MR_ID, "watermarked_timestamp");
.writeStream()
.option("checkpointLocation", LOC)
.outputMode(OutputMode.Append())
.foreachBatch(
This is a long running stream, and we are seeing executors being evicted from the app. The executors are marked as KILLED and we get an error code 137 on the driver. We see the memory usage on the executor process, growing over time. We have examined the heap and it seems to be fine – profiling the process, we see nothing untoward and we are not using all the heap, nor are we core-dumping OOM’s. It looks to us like the spark or our kernel on the pod is evicting the executor process after it grows to a certain number.
Here is the sample profile and we see nothing untoward there:
The main problem is the growth in what seems to be memory usage of the executor outside of the heap. We seem to be using a lot more memory by the executor that is outside of the heap and have no idea where that Memory is being used. We grow to 7G even though our XmX on the process is set to 2G.
There is a RocksDb native spark implement being used so state directories are being written to disk. Some SST files are being left around as, we believe, when we drop the executor, these SST files are not being cleaned up.