I had run a long-running batch job in DataProc Serverless. After some time of running, I figured out that running the job any longer was a waste of time and money, and I wanted to stop it.
I couldn't find a way to kill the job. However, there were two other ways.
- Cancel the batch
- Delete the batch
Initially, I used the first option, and I cancelled the job using:
gcloud dataproc batches cancel BATCH --region=REGION
On the dataproc batch console, it showed the job got cancelled, and I also saw the DCU and shuffle storage usage.
But the surprising point is, I can see the job is still running after one day on the spark history server.
After this, I thought of going with the second option to delete the batch job, and I ran this command.
gcloud dataproc batches delete BATCH --region=REGION
This removed the batch entry from the dataproc batch console, but the job is still seen to be running through the spark history server.
My query is:
- What is the best way to kill the job?
- Am I still being charged once I canceled the running job?
What you are observing is a known shortcoming of the Spark and Spark History Server. Spark marks only successfully finished Spark applications as completed and leaves failed/cancelled Spark applications in the in-progress/incomplete state (https://spark.apache.org/docs/latest/monitoring.html#spark-history-server-configuration-options):
To monitor batch job state you need to use Dataproc API - if Dataproc API/UI shows that state of the batch job is
CANCELLED, it means that it does not run anymore, regardless of the Spark application status in the Spark History Server.