I have a spark job running on kubernetes using the spark-on-k8s-operator. This job usually takes less than 5 minutes to complete but sometimes I'm having a problem of job stuck because of executors lost that I'm still investigating.
How can I specify a timeout in Spark to make the driver kill all the executors and itself if the execution exceed the specified timeout ?
spark.scheduler.excludeOnFailure.unschedulableTaskSetTimeout
from https://spark.apache.org/docs/latest/configuration.html
As I'm aware, the Spark helm chart doesn't offer the
spark.scheduler.excludeOnFailure.unschedulableTaskSetTimeout
configuration optionSee https://github.com/GoogleCloudPlatform/spark-on-k8s-operator/blob/master/charts/spark-operator-chart/README.md