We are using azuredevops agent configured in AKS cluster with the Keda scaledjobs. The AKS node pool sku is Standard_E8ds_v5 (1 instance) and we are using persistent volume mounted on azure disk .
the scaledJob property is as below.
apiVersion: keda.sh/v1alpha1
kind: ScaledJob
metadata:
annotations:
name: azdevops-scaledjob
namespace: ado
spec:
failedJobsHistoryLimit: 5
jobTargetRef:
template:
spec:
affinity:
nodeAffinity:
preferredDuringSchedulingIgnoredDuringExecution:
- preference:
matchExpressions:
- key: kubernetes.azure.com/mode
operator: In
values:
- mypool
- key: topology.disk.csi.azure.com/zone
operator: In
values:
- westeurope-1
weight: 2
containers:
- env:
- name: AZP_URL
value: https://azuredevops.xxxxxxxx/xxxxxxx/organisation
- name: AZP_TOKEN
value: xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx
- name: AZP_POOL
value: az-pool
image: xxxxxxxxxxxxxx.azurecr.io/vsts/dockeragent:xxxxxxxxx
imagePullPolicy: Always
name: azdevops-agent-job
resources:
limits:
cpu: 1500m
memory: 6Gi
requests:
cpu: 500m
memory: 3Gi
securityContext:
allowPrivilegeEscalation: true
privileged: true
volumeMounts:
- mountPath: /mnt
name: ado-cache-storage
volumes:
- name: ado-cache-storage
persistentVolumeClaim:
claimName: azure-disk-pvc
maxReplicaCount: 8
minReplicaCount: 1
pollingInterval: 30
successfulJobsHistoryLimit: 5
triggers:
- metadata:
organizationURLFromEnv: AZP_URL
personalAccessTokenFromEnv: AZP_TOKEN
poolID: "xxxx"
type: azure-pipelines
But we noticed a strange behavior as when trying to trigger a build, Error message in the pipeline:
"We stopped hearing from agent azdevops-scaledjob-xxxxxxx. Verify the agent machine is running and has a healthy network connection. Anything that terminates an agent process, starves it for CPU, or blocks its network access can cause this error".
The pipeline will be in hang state and will be continuing without error, but in backend the pod is already in state error. So we have to cancel the pipeline each time when it occures and need to iniatiate a new build, so that the pipeline will be scheduled to a available pod.
On describing the pod which is in error state, we could identify this.
azdevops-scaledjob-6xxxxxxxx-b 0/1 Error 0 27h
Pod has error as below.
Annotations: <none>
Status: Failed
Reason: Evicted
Message: The node was low on resource: ephemeral-storage. Container azdevops-agent-job was using 23001896Ki, which exceeds its request of 0.
I have set the
safe-to-evict
to false, so the AKS won't relocate the pod/job because nodedownscale
.The drawback here is that AKS can stay with more nodes than needed. So you must ensure the pod/job won't be there forever.
Another possibility is to change the node downscale timeout
Terraform code