Delete K8s pod ready but with higher pod.deletion.cost first

254 Views Asked by At

Question Background

I have a service backed by a deployment with replicas of pods, which receive tasks and handle them asynchronously.

I want the pod to handle only one task at a time, thus I let my readinessProbe to track a variable status. When status equals BUSY, set the pod to NOT READY to avoid further request to this pod.

I also prefer when scaling down the deployment, pods that are not busy handling tasks are deleted first, so when I set status to BUSY, I set the controller.kubernetes.io/pod-deletion-cost to 100. When the task is finished, I set the controller.kubernetes.io/pod-deletion-cost back to 1.

When executing the system, I find that pods running task are still deleted first. This is because the priority of ready/not ready is higher than pod deletion cost value in the k8s pod deletion decision maker. Implementation can be find here (line 822 to 833).

How can I make the system work as expected? More specifically, I should either "determine whether a pod can be routed to from a service not by ready/not ready" or "avoid use ready/not ready to determine which pod will be deleted".

Any solution is welcome as long as meets the requirements:

  1. a pod handles one task at a time
  2. when scaling down, pods not busy are deleted first

Reproduce the problem

Here I use a simple Python Flask server as the app in pod.

app.py

from flask import Flask, request
import os

app = Flask(__name__)

status = "AVAILABLE" # Global variable to hold the status
pod_name = os.getenv('POD_NAME')

@app.route('/setStatus', methods=['POST'])
def setStatus():
    data = request.get_json()
    global status
    status = data.get('status', 'AVAILABLE')
    return f'{pod_name} has been set to {status}', 200

@app.route('/readinessCheck', methods=['GET'])
def readinessCheck():
    if status == 'BUSY':
        return f'{pod_name} is busy', 502
    else:
        return f'{pod_name} is available', 200
        

if __name__ == "__main__":
    app.run(host='0.0.0.0', port=5000, debug=True)

Dockerfile

FROM python:3.9-slim
WORKDIR /app
ADD app.py /app
RUN pip install -i https://pypi.tuna.tsinghua.edu.cn/simple --no-cache-dir Flask==2.0.2 Werkzeug==2.0.2
EXPOSE 5000
CMD ["python", "app.py"]

Build it to image by docker build . -t lyudmilalala/pdc-app-img:1.0.0.

Then use the image to config a deployment and a service.

deployment.yaml

apiVersion: apps/v1
kind: Deployment
metadata:
  name: pdc-app-deploy
spec:
  replicas: 5
  selector:
    matchLabels:
      app: pdc-app
  template:
    metadata:
      labels:
        app: pdc-app
      annotations:
        controller.kubernetes.io/pod-deletion-cost: '1'
    spec:
      containers:
      - name: pdc-app-pod
        image: lyudmilalala/pdc-app-img:1.0.0
        imagePullPolicy: IfNotPresent
        ports:
        - containerPort: 5000
          protocol: TCP
        env:
        - name: POD_NAME
          valueFrom:
            fieldRef:
              fieldPath: metadata.name
        readinessProbe:
          httpGet:
            path: /readinessCheck
            port: 5000
 
---

apiVersion: v1
kind: Service
metadata:
  name: pdc-app-svc
spec:
  type: NodePort
  ports:
  - port: 5000
    protocol: TCP
    nodePort: 32000
  selector:
    app: pdc-app

Start the service.

$ kubectl get pods -n default
NAME                              READY   STATUS    RESTARTS   AGE
pdc-app-deploy-8545d464db-4wd97   1/1     Running   0          27m
pdc-app-deploy-8545d464db-5pr5d   1/1     Running   0          14m
pdc-app-deploy-8545d464db-hxg94   1/1     Running   0          14m
pdc-app-deploy-8545d464db-knwtb   1/1     Running   0          27m
pdc-app-deploy-8545d464db-lrrhw   1/1     Running   0          14m

Currently all pod's pod-deletion-cost are 1.

Notice, commands here are for Windows Powershell, commands in other terminals may be a little bit different.

$ kubectl get pod pdc-app-deploy-8545d464db-knwtb -o jsonpath="{.metadata.annotations['controller\.kubernetes\.io/pod-deletion-cost']}"
1

Update pod-deletion-cost of some pods.

$ kubectl patch pod pdc-app-deploy-8545d464db-4wd97 -p '{\"metadata\":{\"annotations\":{\"controller.kubernetes.io/pod-deletion-cost\":\"120\"}}}'
pod/pdc-app-deploy-8545d464db-4wd97 patched
$ kubectl get pod pdc-app-deploy-8545d464db-4wd97 -o jsonpath="{.metadata.annotations['controller\.kubernetes\.io/pod-deletion-cost']}"
120

Assume our pod deletion cost values as in the table.

name pod-deletion-cost
pdc-app-deploy-8545d464db-4wd97 40
pdc-app-deploy-8545d464db-95d5b 1
pdc-app-deploy-8545d464db-knwtb 20
pdc-app-deploy-8545d464db-lrrhw 80
pdc-app-deploy-8545d464db-zpx52 40

When all pods are READY, replace the replicas: 5 in deployment.yaml by replicas: 4, and apply changes. We can see, pod with the least pod-deletion-cost is deleted. Then we can try again by update to replicas: 3, it will work the same.

Then send a request to set one of the pod to NOT READY.

$ curl -X POST http://localhost:32000/setStatus -H "Content-Type: application/json" -d "{\"status\": \"BUSY\"}"
pdc-app-deploy-8545d464db-lrrhw has been set to BUSY
$ kubectl get pods -n default
NAME                              READY   STATUS    RESTARTS   AGE
pdc-app-deploy-8545d464db-4wd97   1/1     Running   0          67m
pdc-app-deploy-8545d464db-lrrhw   0/1     Running   0          54m
pdc-app-deploy-8545d464db-zpx52   1/1     Running   0          20m

Now if we decrease the replicas: 3 in deployment.yaml to replicas: 2, and apply changes. Even the NOT READY pod has higher pod-deletion-cost, it will be remove first.

$ kubectl get pods -n default
NAME                              READY   STATUS        RESTARTS   AGE
pdc-app-deploy-8545d464db-4wd97   1/1     Running       0          68m
pdc-app-deploy-8545d464db-lrrhw   0/1     Terminating   0          55m
pdc-app-deploy-8545d464db-zpx52   1/1     Running       0          21m

Some more info

Why I do not use a job

  1. The application has a long cold start time. Thus I prefer to reuse the existing pods if possible.
  2. The body for this task will be long. Sending it by HTTP body is more convenient.

Why I do not use a third party FaaS Framework

In fact, I tried a bunch FaaS frameworks such as OpenFaaS, OpenWhisk, KNative, but gave up due to the following reasons:

  1. Complex system that are difficult to maintain.
  2. Restriction in customizing scaling rules.
  3. When the number of pods reach a max limit, I expect the cluster to send an alert to the task queue, telling it to not push tasks anymore, but currently I have not find functions like this.

I also read about innovations on k8s scaling config such as this, but seems none of them has been landed.

0

There are 0 best solutions below