ECONNREFUSED errors when kubernetes pods terminate

129 Views Asked by At

We have a simple API service that has 2 parts

  1. API Gateway service
  2. Search Logic service

Our API Gateway service is exposed to the internet via a GKE Ingress and works fine. It performs authentication, validation and request aggregation, before sending requests on to the second Search Logic service. Our requests take at most 2-3 seconds, sometimes 5 seconds, but usually only a few hundred miliseconds. We're handling around 100-300 requests per second, with aprox 6 and 11 pods of each service respectively.

However, whenever a pod that is part of the second Search Logic service terminates (like due to a scale down event, or a rolling update) our API Gateway service gets random ECONNREFUSED errors when sending those requests to that service.

We have checked the logs in our Search service, and when these errors happen, our service doesn't actually receive those requests. We read into this and added things like a preStop hook command sleep 60 to try to account for when the service takes a while to remove the pod from the ClusterIP load balancer thing (based on what we saw here), which should delay the SIGTERM, but while that did reduce the frequency of errors we were seeing, we're still seeing them intermittently on scaling events, and very very frequently when performing a rolling update

Here's the relevant fields from our Search logic service

apiVersion: v1
kind: Service
metadata:
  name: search-api
spec:
  ports:
  - port: 80
    targetPort: 8080
  selector:
    app: search-api
  type: ClusterIP
apiVersion: apps/v1
kind: Deployment
metadata:
  name: search-api
spec:
  minReadySeconds: 5
  selector:
    matchLabels:
      app: search-api
  strategy:
    rollingUpdate:
      maxSurge: 2
      maxUnavailable: 0
    type: RollingUpdate
  template:
    metadata:
      labels:
        app: search-api
    spec:
      containers:
        image: image@sha256
        imagePullPolicy: Always
        lifecycle:
          preStop:
            exec:
              command:
              - sleep
              - "60"
        livenessProbe:
          failureThreshold: 2
          httpGet:
            path: /livez
            port: 8080
          initialDelaySeconds: 15
          periodSeconds: 5
        name: search-api
        ports:
        - containerPort: 8080
        readinessProbe:
          httpGet:
            path: /healthz
            port: 8080
          initialDelaySeconds: 15
          periodSeconds: 5
        resources:
          limits:
            cpu: 1000m
            memory: 300Mi
          requests:
            cpu: 250m
            memory: 256Mi
      nodeSelector:
        iam.gke.io/gke-metadata-server-enabled: "true"
      serviceAccountName: my-service-account
      terminationGracePeriodSeconds: 90
      topologySpreadConstraints:
      - labelSelector:
          matchLabels:
            app: search-api
        maxSkew: 1
        topologyKey: topology.kubernetes.io/zone
        whenUnsatisfiable: ScheduleAnyway
      - labelSelector:
          matchLabels:
            app: search-api
        maxSkew: 1
        topologyKey: node
        whenUnsatisfiable: ScheduleAnyway

And here's a sample of the code from our API Gateway service that calls the Search Logic service, a very very simple http call

await axios({
    method: 'POST',
    url: `http://search-api:80/logic/endpoint`,
    data: payload,
    headers: {
        connection: 'close',
        'Content-Type': 'application/json',
        Accept: 'application/json',
        'X-Request-Id': request_id
    }
});

We added the connection: close header as we were worried this was being caused by keep-alive connections, but it doesn't seem to have solved the problem

0

There are 0 best solutions below