We are using the airflow kubernetes executor and for the most part it works great. While normally pods get terminated and disappear after a completed task, sometimes "something" happens and these completed pods end up sticking around forever. Or until we manually kill them.
When I look in our logs, I see entry after entry like the following for these stuck pods:
Failed to adopt pod ap127331workitemhistorystreamfilifilisit.5e10fd80bbda40df8e7af5c21da88fea. Reason: (422)
HTTP response body: {"kind":"Status","apiVersion":"v1","metadata":{},"status":"Failure","message":"Pod \"ap127331workitemhistorystreamfilifilisit.5e10fd80bbda40df8e7af5c21da88fea\" is invalid: spec: Forbidden: pod updates may not change fields other than `spec.containers[*].image`, `spec.initContainers[*].image`, `spec.activeDeadlineSeconds` or `spec.tolerations` (only additions to existing tolerations)
I can't seem to find any rhyme or reason why some pods work fine and others get stuck. This is happening randomly with all DAGs and tasks.
Thanks so much for any help.
The service account assigned to your executor needs patch permission. I updated the role attached to the service account my Kubernetes executor pods execute as, to add permission to "patch":
This allowed airflow jobs to clean up, no longer leaving pods around after the tasks finished.