Question Background
I have a service backed by a deployment with replicas of pods, which receive tasks and handle them asynchronously.
I want the pod to handle only one task at a time, thus I let my readinessProbe to track a variable status
. When status
equals BUSY
, set the pod to NOT READY
to avoid further request to this pod.
I also prefer when scaling down the deployment, pods that are not busy handling tasks are deleted first, so when I set status
to BUSY
, I set the controller.kubernetes.io/pod-deletion-cost
to 100. When the task is finished, I set the controller.kubernetes.io/pod-deletion-cost
back to 1.
When executing the system, I find that pods running task are still deleted first. This is because the priority of ready/not ready is higher than pod deletion cost value in the k8s pod deletion decision maker. Implementation can be find here (line 822 to 833).
How can I make the system work as expected? More specifically, I should either "determine whether a pod can be routed to from a service not by ready/not ready" or "avoid use ready/not ready to determine which pod will be deleted".
Any solution is welcome as long as meets the requirements:
- a pod handles one task at a time
- when scaling down, pods not busy are deleted first
Reproduce the problem
Here I use a simple Python Flask server as the app in pod.
app.py
from flask import Flask, request
import os
app = Flask(__name__)
status = "AVAILABLE" # Global variable to hold the status
pod_name = os.getenv('POD_NAME')
@app.route('/setStatus', methods=['POST'])
def setStatus():
data = request.get_json()
global status
status = data.get('status', 'AVAILABLE')
return f'{pod_name} has been set to {status}', 200
@app.route('/readinessCheck', methods=['GET'])
def readinessCheck():
if status == 'BUSY':
return f'{pod_name} is busy', 502
else:
return f'{pod_name} is available', 200
if __name__ == "__main__":
app.run(host='0.0.0.0', port=5000, debug=True)
Dockerfile
FROM python:3.9-slim
WORKDIR /app
ADD app.py /app
RUN pip install -i https://pypi.tuna.tsinghua.edu.cn/simple --no-cache-dir Flask==2.0.2 Werkzeug==2.0.2
EXPOSE 5000
CMD ["python", "app.py"]
Build it to image by docker build . -t lyudmilalala/pdc-app-img:1.0.0
.
Then use the image to config a deployment and a service.
deployment.yaml
apiVersion: apps/v1
kind: Deployment
metadata:
name: pdc-app-deploy
spec:
replicas: 5
selector:
matchLabels:
app: pdc-app
template:
metadata:
labels:
app: pdc-app
annotations:
controller.kubernetes.io/pod-deletion-cost: '1'
spec:
containers:
- name: pdc-app-pod
image: lyudmilalala/pdc-app-img:1.0.0
imagePullPolicy: IfNotPresent
ports:
- containerPort: 5000
protocol: TCP
env:
- name: POD_NAME
valueFrom:
fieldRef:
fieldPath: metadata.name
readinessProbe:
httpGet:
path: /readinessCheck
port: 5000
---
apiVersion: v1
kind: Service
metadata:
name: pdc-app-svc
spec:
type: NodePort
ports:
- port: 5000
protocol: TCP
nodePort: 32000
selector:
app: pdc-app
Start the service.
$ kubectl get pods -n default
NAME READY STATUS RESTARTS AGE
pdc-app-deploy-8545d464db-4wd97 1/1 Running 0 27m
pdc-app-deploy-8545d464db-5pr5d 1/1 Running 0 14m
pdc-app-deploy-8545d464db-hxg94 1/1 Running 0 14m
pdc-app-deploy-8545d464db-knwtb 1/1 Running 0 27m
pdc-app-deploy-8545d464db-lrrhw 1/1 Running 0 14m
Currently all pod's pod-deletion-cost
are 1.
Notice, commands here are for Windows Powershell, commands in other terminals may be a little bit different.
$ kubectl get pod pdc-app-deploy-8545d464db-knwtb -o jsonpath="{.metadata.annotations['controller\.kubernetes\.io/pod-deletion-cost']}"
1
Update pod-deletion-cost
of some pods.
$ kubectl patch pod pdc-app-deploy-8545d464db-4wd97 -p '{\"metadata\":{\"annotations\":{\"controller.kubernetes.io/pod-deletion-cost\":\"120\"}}}'
pod/pdc-app-deploy-8545d464db-4wd97 patched
$ kubectl get pod pdc-app-deploy-8545d464db-4wd97 -o jsonpath="{.metadata.annotations['controller\.kubernetes\.io/pod-deletion-cost']}"
120
Assume our pod deletion cost values as in the table.
name | pod-deletion-cost |
---|---|
pdc-app-deploy-8545d464db-4wd97 | 40 |
pdc-app-deploy-8545d464db-95d5b | 1 |
pdc-app-deploy-8545d464db-knwtb | 20 |
pdc-app-deploy-8545d464db-lrrhw | 80 |
pdc-app-deploy-8545d464db-zpx52 | 40 |
When all pods are READY
, replace the replicas: 5
in deployment.yaml
by replicas: 4
, and apply changes. We can see, pod with the least pod-deletion-cost
is deleted. Then we can try again by update to replicas: 3
, it will work the same.
Then send a request to set one of the pod to NOT READY
.
$ curl -X POST http://localhost:32000/setStatus -H "Content-Type: application/json" -d "{\"status\": \"BUSY\"}"
pdc-app-deploy-8545d464db-lrrhw has been set to BUSY
$ kubectl get pods -n default
NAME READY STATUS RESTARTS AGE
pdc-app-deploy-8545d464db-4wd97 1/1 Running 0 67m
pdc-app-deploy-8545d464db-lrrhw 0/1 Running 0 54m
pdc-app-deploy-8545d464db-zpx52 1/1 Running 0 20m
Now if we decrease the replicas: 3
in deployment.yaml
to replicas: 2
, and apply changes. Even the NOT READY
pod has higher pod-deletion-cost
, it will be remove first.
$ kubectl get pods -n default
NAME READY STATUS RESTARTS AGE
pdc-app-deploy-8545d464db-4wd97 1/1 Running 0 68m
pdc-app-deploy-8545d464db-lrrhw 0/1 Terminating 0 55m
pdc-app-deploy-8545d464db-zpx52 1/1 Running 0 21m
Some more info
Why I do not use a job
- The application has a long cold start time. Thus I prefer to reuse the existing pods if possible.
- The body for this task will be long. Sending it by HTTP body is more convenient.
Why I do not use a third party FaaS Framework
In fact, I tried a bunch FaaS frameworks such as OpenFaaS, OpenWhisk, KNative, but gave up due to the following reasons:
- Complex system that are difficult to maintain.
- Restriction in customizing scaling rules.
- When the number of pods reach a max limit, I expect the cluster to send an alert to the task queue, telling it to not push tasks anymore, but currently I have not find functions like this.
I also read about innovations on k8s scaling config such as this, but seems none of them has been landed.