We have 6 cloud run containers, all running with 2GiB of memory, 1 CPU, and 1 minimum instance. All of them have very similar dockerfiles that essentially start up simple express services.
We use a Postgres database running on Cloud SQL with 2 vCPUs, 8GB memory that all 6 cloud run containers have access to.
We use Prisma (https://github.com/prisma/prisma) to access the database.
One of our cloud run containers is used for us to process simple jobs on demand using BullMQ (a redis-based queue). In that container that, the infra is essentially the same as the others; however, occasionally, prisma is unable to reach our database. This generally only happens on the first attempt to connect to the database after a restart, but the amount of time after that restart can be any amount (even many hours). There is almost no activity on the other cloud run containers as well so there is only a very light volume of usage of the database at any given time.
The exact error message we see is Timed out fetching a new connection from the connection pool. More info: http://pris.ly/d/connection-pool (Current connection pool timeout: 10, connection limit: 5) Since this is the first connection, it means that there are definitely not a lack of connections available in the pool. Whenever everything is working, connections are obtained very fast (milliseconds). On failures, it takes the full 10 seconds and then times out.
We've tried many ways of solving this including:
- Bumping prisma's timeout to 20 seconds and its connection limit to 10 connections. This still causes failures
- Adding retries on the connection. We added 3 and sometimes it would still fail after 3 attempts. Other times it would only fail 0-2 times. However, if we ever manually retried a 4th time, it worked.
- Bumping prisma's timeout to infinite. So far we haven't seen any failures when doing that but it may just be low volume of attempts - and all the connections still happened in milliseconds.
I can't imagine that this issue is caused by Prisma given:
- its only happening on this one cloud run container. we use the same prisma configuration (though different connection pools) in all the containers
- its happening on both our staging and production instances on the same jobs cloud run container. So this feels like it has something to do with the setup of this cloud run instance.
Attempts at reproduction:
- wait many hrs with 0 requests and retry. haven't been able to repro
- manually deploy a new cloud run instance and try after that one is taking all the traffic as the first request. haven't been able to repro (even though most, if not all, of the failures have been on the first request)
More info exists on the prisma page question I asked: https://github.com/prisma/prisma/discussions/22252