I am running workloads on an EKS cluster (v1.25.16-eks) and some of my pods get restarted after liveness probe times out, while the service running is completely fine:
Warning Unhealthy 19m (x9 over 25h) kubelet Liveness probe failed: command "docker-healthcheck" timed out
timeoutSeconds is set to 1 second. This is docker-healthcheck content:
#!/bin/sh
set -e
if env -i REQUEST_METHOD=GET SCRIPT_NAME=/health SCRIPT_FILENAME=/health cgi-fcgi -bind -connect localhost:9000; then
exit 0
fi
exit 1
My hot fix is to increase timeoutSeconds which works well. But I cannot figure out why those probes time out as they are simple exec probes that make a localhost HTTP call. They certainly take much less than 1 second when run manually from inside the container.
I am running out of explanations and cannot seem to find EKS-specific reasons. I tried looking for a kubelet overhead for executing those probes but could not find any. Is there any reason why I observe this behavior?
Thanks!