Problem
I have a health check for a managed instance group on GCP that continuously times out. Therefore, the group manager thinks all instances are always unhealthy.
My guess is I've misconfigured the firewall, but I can't determine if that's the case or how I've misconfigured it.
Note: I'm not asking about the load balance health checks. I'm asking about the managed instance group health checks for auth-healing.
Error
In the "Errors" tab of the managed instance group, I always see the following error:
WAITING_FOR_HEALTHY_TIMEOUT_EXCEEDED
Waiting for HEALTHY state timed out (autohealingPolicy.initialDelay=300 sec) for instance projects/project-402019/zones/us-central1-f/instances/server-20bd and health check projects/project-402019/global/healthChecks/server-health-check.
Debug steps
- When I ssh into a VM and run
curl http://127.0.0.1:3001/hello
, I successfully gethttp://127.0.0.1:3001/hello
(I'm running a simple echo service right now). - I've set up the firewall to allow incomming traffic on port 3001 as described in the docs. See firewall setup below.
- Set the initialization period and autohealing initial delay to 5+ minutes in case it takes forever to start up (I can ssh into the VM after 30s and curl successfully).
- Set the health check timeout to 30 seconds (when I curl it responds instantly).
- Opened all ports on the firewall.
Setup
Firewall
Health check
I've also tried multiple different settings here:
- TCP vs. HTTP
- Intervals/timeouts ranging from 2-30 seconds.
- Healthy/unhealthy thresholds from 1-3.
Turns out your binary needs to listen on
0.0.0.0
, not127.0.0.1
.Once I made that change to my code, everything else worked.