Delete K8s pod ready but with higher pod.deletion.cost first

Question

Delete K8s pod ready but with higher pod.deletion.cost first

247 Views Asked by Lyudmila Sun At 17 August 2025 at 08:56

Question Background

I have a service backed by a deployment with replicas of pods, which receive tasks and handle them asynchronously.

I want the pod to handle only one task at a time, thus I let my readinessProbe to track a variable status. When status equals BUSY, set the pod to NOT READY to avoid further request to this pod.

I also prefer when scaling down the deployment, pods that are not busy handling tasks are deleted first, so when I set status to BUSY, I set the controller.kubernetes.io/pod-deletion-cost to 100. When the task is finished, I set the controller.kubernetes.io/pod-deletion-cost back to 1.

When executing the system, I find that pods running task are still deleted first. This is because the priority of ready/not ready is higher than pod deletion cost value in the k8s pod deletion decision maker. Implementation can be find here (line 822 to 833).

How can I make the system work as expected? More specifically, I should either "determine whether a pod can be routed to from a service not by ready/not ready" or "avoid use ready/not ready to determine which pod will be deleted".

Any solution is welcome as long as meets the requirements:

a pod handles one task at a time
when scaling down, pods not busy are deleted first

Reproduce the problem

Here I use a simple Python Flask server as the app in pod.

app.py

from flask import Flask, request
import os

app = Flask(__name__)

status = "AVAILABLE" # Global variable to hold the status
pod_name = os.getenv('POD_NAME')

@app.route('/setStatus', methods=['POST'])
def setStatus():
    data = request.get_json()
    global status
    status = data.get('status', 'AVAILABLE')
    return f'{pod_name} has been set to {status}', 200

@app.route('/readinessCheck', methods=['GET'])
def readinessCheck():
    if status == 'BUSY':
        return f'{pod_name} is busy', 502
    else:
        return f'{pod_name} is available', 200
        

if __name__ == "__main__":
    app.run(host='0.0.0.0', port=5000, debug=True)

Dockerfile

FROM python:3.9-slim
WORKDIR /app
ADD app.py /app
RUN pip install -i https://pypi.tuna.tsinghua.edu.cn/simple --no-cache-dir Flask==2.0.2 Werkzeug==2.0.2
EXPOSE 5000
CMD ["python", "app.py"]

Build it to image by docker build . -t lyudmilalala/pdc-app-img:1.0.0.

Then use the image to config a deployment and a service.

deployment.yaml

apiVersion: apps/v1
kind: Deployment
metadata:
  name: pdc-app-deploy
spec:
  replicas: 5
  selector:
    matchLabels:
      app: pdc-app
  template:
    metadata:
      labels:
        app: pdc-app
      annotations:
        controller.kubernetes.io/pod-deletion-cost: '1'
    spec:
      containers:
      - name: pdc-app-pod
        image: lyudmilalala/pdc-app-img:1.0.0
        imagePullPolicy: IfNotPresent
        ports:
        - containerPort: 5000
          protocol: TCP
        env:
        - name: POD_NAME
          valueFrom:
            fieldRef:
              fieldPath: metadata.name
        readinessProbe:
          httpGet:
            path: /readinessCheck
            port: 5000
 
---

apiVersion: v1
kind: Service
metadata:
  name: pdc-app-svc
spec:
  type: NodePort
  ports:
  - port: 5000
    protocol: TCP
    nodePort: 32000
  selector:
    app: pdc-app

Start the service.

$ kubectl get pods -n default
NAME                              READY   STATUS    RESTARTS   AGE
pdc-app-deploy-8545d464db-4wd97   1/1     Running   0          27m
pdc-app-deploy-8545d464db-5pr5d   1/1     Running   0          14m
pdc-app-deploy-8545d464db-hxg94   1/1     Running   0          14m
pdc-app-deploy-8545d464db-knwtb   1/1     Running   0          27m
pdc-app-deploy-8545d464db-lrrhw   1/1     Running   0          14m

Currently all pod's pod-deletion-cost are 1.

Notice, commands here are for Windows Powershell, commands in other terminals may be a little bit different.

$ kubectl get pod pdc-app-deploy-8545d464db-knwtb -o jsonpath="{.metadata.annotations['controller\.kubernetes\.io/pod-deletion-cost']}"
1

Update pod-deletion-cost of some pods.

$ kubectl patch pod pdc-app-deploy-8545d464db-4wd97 -p '{\"metadata\":{\"annotations\":{\"controller.kubernetes.io/pod-deletion-cost\":\"120\"}}}'
pod/pdc-app-deploy-8545d464db-4wd97 patched
$ kubectl get pod pdc-app-deploy-8545d464db-4wd97 -o jsonpath="{.metadata.annotations['controller\.kubernetes\.io/pod-deletion-cost']}"
120

Assume our pod deletion cost values as in the table.

name	pod-deletion-cost
pdc-app-deploy-8545d464db-4wd97	40
pdc-app-deploy-8545d464db-95d5b	1
pdc-app-deploy-8545d464db-knwtb	20
pdc-app-deploy-8545d464db-lrrhw	80
pdc-app-deploy-8545d464db-zpx52	40

When all pods are READY, replace the replicas: 5 in deployment.yaml by replicas: 4, and apply changes. We can see, pod with the least pod-deletion-cost is deleted. Then we can try again by update to replicas: 3, it will work the same.

Then send a request to set one of the pod to NOT READY.

$ curl -X POST http://localhost:32000/setStatus -H "Content-Type: application/json" -d "{\"status\": \"BUSY\"}"
pdc-app-deploy-8545d464db-lrrhw has been set to BUSY
$ kubectl get pods -n default
NAME                              READY   STATUS    RESTARTS   AGE
pdc-app-deploy-8545d464db-4wd97   1/1     Running   0          67m
pdc-app-deploy-8545d464db-lrrhw   0/1     Running   0          54m
pdc-app-deploy-8545d464db-zpx52   1/1     Running   0          20m

Now if we decrease the replicas: 3 in deployment.yaml to replicas: 2, and apply changes. Even the NOT READY pod has higher pod-deletion-cost, it will be remove first.

$ kubectl get pods -n default
NAME                              READY   STATUS        RESTARTS   AGE
pdc-app-deploy-8545d464db-4wd97   1/1     Running       0          68m
pdc-app-deploy-8545d464db-lrrhw   0/1     Terminating   0          55m
pdc-app-deploy-8545d464db-zpx52   1/1     Running       0          21m

Some more info

Why I do not use a job

The application has a long cold start time. Thus I prefer to reuse the existing pods if possible.
The body for this task will be long. Sending it by HTTP body is more convenient.

Why I do not use a third party FaaS Framework

In fact, I tried a bunch FaaS frameworks such as OpenFaaS, OpenWhisk, KNative, but gave up due to the following reasons:

Complex system that are difficult to maintain.
Restriction in customizing scaling rules.
When the number of pods reach a max limit, I expect the cluster to send an alert to the task queue, telling it to not push tasks anymore, but currently I have not find functions like this.

I also read about innovations on k8s scaling config such as this, but seems none of them has been landed.

Original Q&A

Delete K8s pod ready but with higher pod.deletion.cost first

Question Background

Reproduce the problem

Some more info

There are 0 best solutions below

Related Questions in KUBERNETES

Related Questions in KUBERNETES-SERVICE

Trending Questions

Popular # Hahtags

Popular Questions