I've got the following pod which consistently failed liveness probes for 'connection reset by peer' and was marked unhealthy, but not restarted
146 Unhealthy: (combined from similar events): Liveness probe failed: Get "http://[ip address]:8080/q/health/live": read tcp [local node ip address]:54580->[ip address]:8080: read: connection reset by peer
It's only when http status 503 was returned, that the pod was restarted by the cluster. Is this expected behavior for kubernetes liveness probes, and a way to adjust for it to restart pods that are not accepting connections?
One thing to note is that the liveness config was set to a high interval, only one check every 30 secs. This has since been set more frequent. No successful probes were received during this timeframe of 5 hours where the pod was not restarted.
livenessProbe:
failureThreshold: 5
httpGet:
path: /q/health/live
port: http
scheme: HTTP
periodSeconds: 30
successThreshold: 1
timeoutSeconds: 30
initialDelaySeconds: 30
readinessProbe:
failureThreshold: 5
httpGet:
path: /q/health/ready
port: http
scheme: HTTP
periodSeconds: 30
successThreshold: 1
timeoutSeconds: 30
initialDelaySeconds: 15
pod part of a a deployment, not a replicaset, and no poddisruptionbudget present
expected: Pod restart after it was marked as unhealthy through the liveness config. What happened: The failing pod was kept around for 5 hours.
If the TCP connection with a peer across the network is closed by the peer or if it is closed unexpectedly, error occurs.
To resolve this issue:
If you are trying to perform background work with CPU throttling, try using the "CPU is always allocated" CPU allocation setting.
Ensure that you are within the outbound requests timeouts. If your application maintains any connection in an idle state beyond this threshold, the gateway needs to reap the connection.
By default, the TCP socket option keepalive is disabled for Cloud Run. There is no direct way to configure the keepalive option in Cloud Run at the service level, but you can enable the keepalive option for each socket connection by providing the correct socket options when opening a new TCP socket connection, depending on the client library that you are using for this connection in your application.
Occasionally outbound connections will be reset due to infrastructure updates. If your application reuses long-lived connections, then we recommend that you configure your application to re-establish connections to avoid the reuse of a dead connection.