I'm hoping this list of facts helps me focus on the fix.
- Three node, per private subnet ARM-based HashiCorp Vault cluster on AWS, deployed with Terraform in dev environment
- NLB
- I can run
vault status
on any of the nodes and a sealed app reports back immediately - Vault is thus configured OK, as the
vault status
on the boxes works fine - On the nodes, I haveVAULT_CACERT=/opt/vault/tls/vault-ca.pem
andVAULT_ADDR=https://127.0.0.1:8200
- On my local macbook at home the same command results in
Error checking seal status: Get "https://vault.example.com:8200/v1/sys/seal-status": net/http: TLS handshake timeout
- The CNAME resolves OK to the AWS A record created when the IaC dropped
- I can ssh to any box directly (using sshuttle to create a VPN from my machine to the private address range the cluster is in)
- I can
telnet vault.example.com 8200
from my mac OK - In the AWS console, the LB is listed as having the listener configured as I wrote it, on
TCP:8200
, and pointing to my target group, correctly with no security policy, no default TLS cert for passthrough withTCP
set on both the listener and the target group. - The target group lists three nodes in an unhealthy status on port 8200.
- The nodes are
Instance reachability check passed
- No security group on the LB thus allowing traffic
- I run a
tcpdump -i ens5 | egrep '10.0.11.3|10.0.17.126|10.0.10.59'
and I can see the LB IP sending SYN packets to the node I was on but not FIN packets returned, as the TLS convo never completes.
I've had a look at the AWS page for troubleshooting load balancers and cannot see how to check the specific return code from the LB's attempt to forward to the health check endpoint. I've used the CIS benchmarked ubuntu 22.04 marketplace image as a base image and used Packer to bake in the Vault binary and config. I've checked the CIS benchmark for this version and there's nothing specific in there suggesting something might have been deliberately blocked in the OS to prevent this connection from continuing.