Pod restarting but i don't see any issues in the log

324 Views Asked by At

we have a EKS Cluster 1.21 (we are upgrading it soon to 1.24) where the pod seems to be getting restarted in regular intervals, I checked the logs and memory usage but don't see anything that could point to the restart reason.

I see this in the events of the pod

LAST SEEN   TYPE      REASON      OBJECT                                     MESSAGE
52m         Warning   Unhealthy   pod/backend-6cc49d746-ztnvv   Readiness probe failed: Get "http://192.168.29.43:80/users/sign_in": context deadline exceeded (Client.Timeout exceeded while awaiting headers)
52m         Warning   Unhealthy   pod/backend-6cc49d746-ztnvv   Readiness probe failed: Get "http://192.168.29.43:3000/users/sign_in": context deadline exceeded (Client.Timeout exceeded while awaiting headers)
52m         Warning   Unhealthy   pod/backend-6cc49d746-ztnvv   Liveness probe failed: Get "http://192.168.29.43:3000/users/sign_in": context deadline exceeded (Client.Timeout exceeded while awaiting headers)
52m         Warning   Unhealthy   pod/backend-6cc49d746-ztnvv   Liveness probe failed: Get "http://192.168.29.43:80/users/sign_in": context deadline exceeded (Client.Timeout exceeded while awaiting headers)

my readiness and liveness probes simply check by loading the sign in page. this has been working for a long time but suddenly we are noticing the restart count

        livenessProbe:
          failureThreshold: 3
          httpGet:
            path: /users/sign_in
            port: 80
            scheme: HTTP
          periodSeconds: 15
          successThreshold: 1
          timeoutSeconds: 5
        name: nginx
        ports:
        - containerPort: 80
          protocol: TCP
        readinessProbe:
          failureThreshold: 3
          httpGet:
            path: /users/sign_in
            port: 80
            scheme: HTTP
          periodSeconds: 15
          successThreshold: 1
          timeoutSeconds: 5

i see this when i describe the pod when its in restart mode

Containers:
  1:
    Container ID:   docker://cf5b2086db6d55f
    Image:          60
    Image ID:       1
    Port:           3000/TCP
    Host Port:      0/TCP
    State:          Running
      Started:      Sun, 17 Sep 2023 17:01:21 +0200
    Last State:     Terminated
      Reason:       Error
      Exit Code:    137
      Started:      Sun, 17 Sep 2023 16:01:21 +0200
      Finished:     Sun, 17 Sep 2023 17:01:18 +0200
    Ready:          True
    Restart Count:  3

Looks like the exit code 137 is when container is using more memory but i have not specified any memory limit, whats the default that it's using? could the memory be a issue here which is causing the restart?

I am not sure in what direction to investigate to resolve the issue, any help would be great.

1

There are 1 best solutions below

0
Shivani On

As the error says that "Client.Timeout exceeded while awaiting headers", which means that the probe was considered to be failed by the Kubernetes as it didn't responded in specified time.

All needs to be done is to increase your timeoutSeconds to 10s for both livenessProbe and readinessProbe.

timeoutSeconds: This parameter is part of the configuration for both liveness and readiness probes. It specifies the number of seconds after which the probe times out. The default value is 1 second. If a probe doesn’t respond within the specified timeoutSeconds, Kubernetes considers the probe to have failed.