Kubernetes KEDA scaledjob is not responding

176 Views Asked by At

We are using azuredevops agent configured in AKS cluster with the Keda scaledjobs. The AKS node pool sku is Standard_E8ds_v5 (1 instance) and we are using persistent volume mounted on azure disk .

the scaledJob property is as below.

apiVersion: keda.sh/v1alpha1
kind: ScaledJob
metadata:
  annotations:
  name: azdevops-scaledjob
  namespace: ado
spec:
  failedJobsHistoryLimit: 5
  jobTargetRef:
    template:
      spec:
        affinity:
          nodeAffinity:
            preferredDuringSchedulingIgnoredDuringExecution:
            - preference:
                matchExpressions:
                - key: kubernetes.azure.com/mode
                  operator: In
                  values:
                  - mypool
                - key: topology.disk.csi.azure.com/zone
                  operator: In
                  values:
                  - westeurope-1
              weight: 2
        containers:
        - env:
          - name: AZP_URL
            value: https://azuredevops.xxxxxxxx/xxxxxxx/organisation
          - name: AZP_TOKEN
            value: xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx
          - name: AZP_POOL
            value: az-pool
          image: xxxxxxxxxxxxxx.azurecr.io/vsts/dockeragent:xxxxxxxxx
          imagePullPolicy: Always
          name: azdevops-agent-job
          resources:
            limits:
              cpu: 1500m
              memory: 6Gi
            requests:
              cpu: 500m
              memory: 3Gi
          securityContext:
            allowPrivilegeEscalation: true
            privileged: true
          volumeMounts:
          - mountPath: /mnt
            name: ado-cache-storage
        volumes:
        - name: ado-cache-storage
          persistentVolumeClaim:
            claimName: azure-disk-pvc
  maxReplicaCount: 8
  minReplicaCount: 1
  pollingInterval: 30
  successfulJobsHistoryLimit: 5
  triggers:
  - metadata:
      organizationURLFromEnv: AZP_URL
      personalAccessTokenFromEnv: AZP_TOKEN
      poolID: "xxxx"
    type: azure-pipelines

But we noticed a strange behavior as when trying to trigger a build, Error message in the pipeline:

"We stopped hearing from agent azdevops-scaledjob-xxxxxxx. Verify the agent machine is running and has a healthy network connection. Anything that terminates an agent process, starves it for CPU, or blocks its network access can cause this error".

The pipeline will be in hang state and will be continuing without error, but in backend the pod is already in state error. So we have to cancel the pipeline each time when it occures and need to iniatiate a new build, so that the pipeline will be scheduled to a available pod.

On describing the pod which is in error state, we could identify this.

azdevops-scaledjob-6xxxxxxxx-b   0/1     Error     0          27h

Pod has error as below.

Annotations:  <none>
Status:       Failed
Reason:       Evicted
Message:      The node was low on resource: ephemeral-storage. Container azdevops-agent-job was using 23001896Ki, which exceeds its request of 0.
1

There are 1 best solutions below

0
On

I have set the safe-to-evict to false, so the AKS won't relocate the pod/job because node downscale.

The drawback here is that AKS can stay with more nodes than needed. So you must ensure the pod/job won't be there forever.

spec:
  jobTargetRef:
    template:
      metadata:
        annotations:
          "cluster-autoscaler.kubernetes.io/safe-to-evict": "false"

Another possibility is to change the node downscale timeout

Terraform code

  auto_scaler_profile {
    scale_down_unneeded = "90m"
  }