I'm using rolling update strategy for deployment using these two commands:
kubectl patch deployment.apps/<deployment-name> -n <namespace> -p '{\"spec\":{\"template\":{\"metadata\":{\"labels\":{\"date\":\"`date +'%s'`\"}}}}}'
kubectl apply -f ./kube.deploy.yml -n <namespace>
kubectl apply -f ./kube_service.yml -n <namespace>
YAML properties for rolling update:
apiVersion: extensions/v1beta1
kind: Deployment
metadata:
name: "applyupui-persist-service-deployment"
spec:
# this replicas value is default
# modify it according to your case
replicas: 2
strategy:
type: RollingUpdate
rollingUpdate:
maxSurge: 1
maxUnavailable: 20%
template:
metadata:
labels:
app: "applyupui-persist-service-selector"
spec:
hostAliases:
- ip: "xx.xx.xx.xxx"
hostnames:
- "kafka02.prod.fr02.bat.cloud"
imagePullSecrets:
- name: tpdservice-devops-image-pull-secret
containers:
- name: applyupui-persist-service
image: gbs-bat-devops-preprod-docker-local.artifactory.swg-devops.com:443/applyupui-msg-persist-service:latest
imagePullPolicy: Always
env:
- name: KAFKA_BROKER
value: "10.194.6.221:9092,10.194.6.221:9093,10.194.6.203:9092"
- name: SCYLLA_DB
value: "scylla01.fr02.bat.cloud,scylla02.fr02.bat.cloud,scylla03.fr02.bat.cloud"
- name: SCYLLA_PORT
value: "9042"
- name: SCYLLA_DB_USER_ID
value: "kafcons"
- name: SCYLLA_DB_PASSWORD
value: "@%$lk*&we@45"
- name: SCYLLA_LOCAL_DC_NAME
value: "Frankfurt-DC"
- name: DC_LOCATION
value: "FRA"
- name: kafka.consumer.retry.topic.timeout.interval
value: "100"
- name: kafka.consumer.retry.topic.max.retry.count
value: "5"
- name: kafka.consumer.dlq.topic.timeout.interval
value: "100"
- name: kafka.producer.timeout.interval
value: "100"
- name: debug.log.enabled
value: "false"
- name: is-application-intransition-phase
value: "false"
- name: is-grace-period
value: "false"
- name: SCYLLA_KEYSPACE
value: "bat_tpd_pri_msg"
readinessProbe:
httpGet:
path: /greeting
port: 8080
initialDelaySeconds: 3
periodSeconds: 10
successThreshold: 1
timeoutSeconds: 1
nodeSelector:
deployment: frankfurt
# resources:
# requests:
# cpu: 100m
# memory: 100Mi
I tried changing maxsurge
and maxunavailable
parameters and different initialdelayseconds
parameter. Additionally, I tried giving the livelinessprobe
parameter
livenessprobe:
tcpSocket:
port: 8080
initialDelaySeconds: 15
periodSeconds: 20
, but none of it worked. It gives error in connection indicating some pod is down and hence there is a downtime.
First of all you need to make sure your yaml file is correct and all indentations are in place. After that you need set the values right in order to achieve a zero-downtime update. The examples below shows correctly defined
RollingUpdate
s:In this example there would be one additional Pod (
maxSurge: 1
) above the desired number of 2, and the number of available Pods cannot go lower than that number (maxUnavailable: 0
).Choosing this config, the Kubernetes will spin up an additional Pod, then stop an “old” one. If there’s another Node available to deploy this Pod, the system will be able to handle the same workload during deployment. If not, the Pod will be deployed on an already used Node at the cost of resources from other Pods hosted on the same Node.
You can also try something like this:
With the example above there would be no additional Pods (
maxSurge: 0
) and only a single Pod at a time would be unavailable (maxUnavailable: 1
).In this case, Kubernetes will first stop a Pod before starting up a new one. The advantage of that is that the infrastructure doesn’t need to scale up but the maximum workload will be less.
If you chose to use the percentage values for
maxSurge
andmaxUnavailable
you need to remember that:maxSurge
- the absolute number is calculated from the percentage by rounding upmaxUnavailable
- the absolute number is calculated from percentage by rounding downWith the
RollingUpdate
defined correctly you also have to make sure your applications provide endpoints to be queried by Kubernetes that return the app’s status. Below it's a/greeting
endpoint, that returns an HTTP 200 status when it’s ready to handle requests, and HTTP 500 when it’s not:initialDelaySeconds
- Time (in seconds) before the first check for readiness is done.periodSeconds
- Time (in seconds) between two readiness checks after the first one.successThreshold
- Minimum consecutive successes for the probe to be considered successful after having failed. Defaults to 1. Must be 1 for liveness. Minimum value is 1.timeoutSeconds
- Number of seconds after which the probe times out. Defaults to 1 second. Minimum value is 1.I have tested the above scenarios with success.
Please let me know if that helped.