Sidecar connection timeout suddenly this morning

118 Views Asked by At

My service in production lost database access around 2am CET this morning. It had been running fine for 18 months.

I access a GCP postgresql instance through an external IP, and use a sidecar to serve as proxy on my VM which is running hasura

2023/10/06 14:09:26 couldn't connect to "XXX:europe-west1:database-5jdu": dial tcp 34.76.132.64:3307: connect: connection timed out
hasura-sidecar-1  | 2023/10/06 14:09:26 New connection for "XXX:europe-west1:database-5jdu"

here is the Docker Compose :

version: '3.8'
services:
  graphql-engine:
    image: hasura/graphql-engine:v2.5.1
    deploy:
      replicas: 1
    restart: always
    networks:
      - nginx-proxy
      - cloud
    env_file:
      - .env
    depends_on:
      - sidecar

  nginx:
    image: nginx
    restart: always
    ports:
      - "80:80"
      - "443:443"
    volumes:
      - ./data/logs:/var/log/nginx
      - ./data/letsencrypt:/etc/letsencrypt:ro
      - ./data/www:/var/www:ro
      - ./data/nginx/nginx.conf:/etc/nginx/nginx.conf:ro
      - ./data/nginx/mime.types:/etc/nginx/mime.type:ro
      - ./data/nginx/snippets:/etc/nginx/snippets:ro
      - ./data/nginx/conf.d:/etc/nginx/conf.d:ro
    depends_on:
      - graphql-engine
    networks:
      - nginx-proxy
    command: "/bin/sh -c 'while :; do sleep 6h & wait $${!}; nginx -s reload; done & nginx -g \"daemon off;\"'"

  certbot:
    image: certbot/certbot
    restart: unless-stopped
    volumes:
      - ./data/letsencrypt:/etc/letsencrypt
      - ./data/www/letsencrypt:/var/www/letsencrypt
    entrypoint: "/bin/sh -c 'trap exit TERM; while :; do /usr/local/bin/certbot renew; sleep 12h & wait $${!};>

  sidecar:
    image: gcr.io/cloudsql-docker/gce-proxy:1.30.0
    restart: always
    networks:
      - cloud
#    ports:
#      - 127.0.0.1:5432:5432
    command: "/cloud_sql_proxy -instances=XXX:europe-west1:database-5jdu=tcp:0.0.0.0:5432"

networks:
  nginx-proxy:
    name: nginx-proxy
  cloud:

Restarting the DB, the proxy, the VM didn't change anything.

I insist, this problem happened out of the blue, I hadn't touched the production setup since more than 12 months...

Does anyone know of something that changed in the way PGSQL managed DB is accessed since tonight ?

THanks for your help, my customers are stuck and keep calling me, and I have no way of helping them

Serge

1

There are 1 best solutions below

0
Jonathan Hess On

Usually, a when the proxy prints the error connect: connection timed out indicates that there is a firewall configuration blocking packets somewhere between the proxy and one of these 3 services:

  • the database's IP address
  • the GCP metadata server
  • the GCP SQL Admin API server

I imagine you have already checked if any network firewall rules changed. You may also check in the network configuration for that Cloud SQL instance in the Google Cloud console. I am not aware of any network changes within the Cloud SQL product itself today that would affect your instance.

Also, I noticed that you are using the older 1.30 version of the proxy. Would you consider upgrading to a newer version of the proxy v2.7.0? It has some important improvements to both logging and authentication that may make it easier to find the problem.

Your sidecar container definition using v2.7.0 would look like this:

  sidecar:
    image: gcr.io/cloud-sql-connectors/cloud-sql-proxy:2.7.0
    args:
      # Enable structured logging with LogEntry format:
      - "--structured-logs"
      - "--port=5432"
      - "XXX:europe-west1:database-5jdu"
    restart: always
    networks:
      - cloud

You can find more discussion on sidecar configuration in the Sidecar Examples