We have a public gRPC API. We have a client that is consuming our API based on the REST paradigm of creating a connection (channel) for every request. We suspect that they are not closing this channel once the request has been made.
On the server side, everything functions ok for a while, then something appears to be exhausted. Requests back up on the servers and are not processed - this results in our proxy timing out and sending an unavailable response. Restarting the server fixes the issue, and I can see backed up requests being flushed in the logs as the servers shutdown.
Unfortunately, it seems that there is no way to monitor what is happening on the server side and prune these connections. We have the following keep alive settings, but they don't appear to have an impact:
grpc.KeepaliveParams(keepalive.ServerParameters{
MaxConnectionIdle: time.Minute * 5,
MaxConnectionAge: time.Minute * 15,
MaxConnectionAgeGrace: time.Minute * 1,
Time: time.Second * 60,
Timeout: time.Second * 10,
})
We also have tried upping MaxConcurrentStreams
from the default 250 to 1000, but the pod
Is there any way that we can monitor channel creation, usage and destruction on the server side - if only to prove or disprove that the clients method of consumption is causing the problems.
Verbose logging has not been helpful as it seems to only log the client activity on the server (I.e. the server consuming pub/sub and logging as a client). I have also looked a channelz, but we have mutual TLS auth and I have been unsuccessful in being able to get it to work on our production pods.
We have instructed our client to use a single channel, and if that is not possible, to close the channels that they are creating, but they are a corporate and move very slowly. We've also not been able to examine their code. We only know that they are developing with dotnet. We're also unable to replicate the behaviour running our own go client at similar volumes.
The culpit is
MaxConnectionIdle
, it will always create a new http2server after the specified amount of time, and eventually your service will crash due to a goroutine leak.Remove
MaxConnectionIdle
andMaxConnectionAge
, then (preferably) make sure bothServerParameters
andClientParameters
are using the sameTime
andTimeout
.