My service in production lost database access around 2am CET this morning.
It had been running fine for 18 months.
I access a GCP postgresql instance through an external IP, and use a sidecar to serve as proxy on my VM which is running hasura
2023/10/06 14:09:26 couldn't connect to "XXX:europe-west1:database-5jdu": dial tcp 34.76.132.64:3307: connect: connection timed out
hasura-sidecar-1 | 2023/10/06 14:09:26 New connection for "XXX:europe-west1:database-5jdu"
here is the Docker Compose :
version: '3.8'
services:
graphql-engine:
image: hasura/graphql-engine:v2.5.1
deploy:
replicas: 1
restart: always
networks:
- nginx-proxy
- cloud
env_file:
- .env
depends_on:
- sidecar
nginx:
image: nginx
restart: always
ports:
- "80:80"
- "443:443"
volumes:
- ./data/logs:/var/log/nginx
- ./data/letsencrypt:/etc/letsencrypt:ro
- ./data/www:/var/www:ro
- ./data/nginx/nginx.conf:/etc/nginx/nginx.conf:ro
- ./data/nginx/mime.types:/etc/nginx/mime.type:ro
- ./data/nginx/snippets:/etc/nginx/snippets:ro
- ./data/nginx/conf.d:/etc/nginx/conf.d:ro
depends_on:
- graphql-engine
networks:
- nginx-proxy
command: "/bin/sh -c 'while :; do sleep 6h & wait $${!}; nginx -s reload; done & nginx -g \"daemon off;\"'"
certbot:
image: certbot/certbot
restart: unless-stopped
volumes:
- ./data/letsencrypt:/etc/letsencrypt
- ./data/www/letsencrypt:/var/www/letsencrypt
entrypoint: "/bin/sh -c 'trap exit TERM; while :; do /usr/local/bin/certbot renew; sleep 12h & wait $${!};>
sidecar:
image: gcr.io/cloudsql-docker/gce-proxy:1.30.0
restart: always
networks:
- cloud
# ports:
# - 127.0.0.1:5432:5432
command: "/cloud_sql_proxy -instances=XXX:europe-west1:database-5jdu=tcp:0.0.0.0:5432"
networks:
nginx-proxy:
name: nginx-proxy
cloud:
Restarting the DB, the proxy, the VM didn't change anything.
I insist, this problem happened out of the blue, I hadn't touched the production setup since more than 12 months...
Does anyone know of something that changed in the way PGSQL managed DB is accessed since tonight ?
THanks for your help, my customers are stuck and keep calling me, and I have no way of helping them
Serge
Usually, a when the proxy prints the error
connect: connection timed outindicates that there is a firewall configuration blocking packets somewhere between the proxy and one of these 3 services:I imagine you have already checked if any network firewall rules changed. You may also check in the network configuration for that Cloud SQL instance in the Google Cloud console. I am not aware of any network changes within the Cloud SQL product itself today that would affect your instance.
Also, I noticed that you are using the older 1.30 version of the proxy. Would you consider upgrading to a newer version of the proxy v2.7.0? It has some important improvements to both logging and authentication that may make it easier to find the problem.
Your sidecar container definition using v2.7.0 would look like this:
You can find more discussion on sidecar configuration in the Sidecar Examples