Intermittent redis connectivity problem from GKE to Redis Memorystore

712 Views Asked by At

Experiencing frequent Redis connection drop off/timeout since 13 December 2PM GMT+8, and getting error messages like:

  • RedisException: Redis server 10.X.X.X:6379 went away
  • RedisException: Connection timed out
  • RedisException: read error on connection to 10.X.X.X:6379
  • ErrorException: Redis::get(): send of 43 bytes failed with errno=32 Broken pipe
  • ErrorException: Redis::lPush(): send of 6076 bytes failed with errno=32 Broken pipe

Steps to reproduce: It happens intermittently on certain connections to Redis. From stack trace it doesn't looks like an application bug/error.

Other information (workarounds you have tried, documentation consulted, etc):

PHP Laravel application running on GKE autopilot pods, connecting to redis using php-redis driver. No issues connecting to Redis before the issue occurred. No new deployments or code changes in the past 4 days.

Checked Redis servers are all healthy with >60% buffer between actual usage and max CPU and memory. GKE workloads are also with reasonable buffer of CPU and memory.

Tried redeploying application/restarting Pods in GKE but the same problem persists.

Occasionally experience high latency when using redis-cli on GKE pods to connect to Redis manually. Took 4-5s just to get connected which is abnormal.

Suspect it could be either:

  1. GKE cluster problem
  2. GKE network connectivity problem to Redis
  3. Redis memorystore problem
1

There are 1 best solutions below

1
On

Connectivity issues indicate access to the instance is blocked (egress firewall rules, wrong VPC network, networking outage/partition, etc..

Check Below Possible Causes :

*1)Connecting from GKE : You cannot connect to a Memorystore for Redis instance from a GKE cluster without VPC-native/IP aliasing enabled. It is easiest to enable VPC-native/IP aliasing during cluster creation. When creating your cluster, select VPC Native under advanced options. Please see Creating VPC-native clusters using Alias IPs for more information.

2)Connecting from a different VPC Network : The instance is only reachable from within that network(Authorized or Default Network Only). Verify that you are connecting from the same VPC network the instance was provisioned in.

You may have a common misconception that the instance resides in the user project and obviously it's not true because the instance actually resides in the Google tenant project and was made available to the user project via VPC peering. You can't connect to the instance from another VPC even if it's peered to the provisioned VPC network because transitive peering is not supported.

3)Network Peering deleted : Internally, Memorystore Redis runs Redis in a VM created in a tenant project (owned by Memorystore Redis), and uses VPC network peering to allow customers to connect.

Check you may deleted the VPC network peering for the network. If you deleted, the simplest solution is to create another instance using the same authorized_network, which will re-establish the peering. Once that's done you can delete that instance.

4)Egress firewall rules : Creation of firewall rules in your project is not necessary. Verify that you have not created any egress firewall rules in their customer project that are blocking traffic to the instance's private IP endpoint.

5)Connecting from GCE : No special configuration should be required to connect to Redis from a GCE VM, provided it is created in the same VPC network and region as the instance.

6)Connecting from On-premise : Accessing a Redis instance from on-premise networks using VPN is supported with Private_Service_Access ConnectMode only.

7)VPC Service Control : Check Service project and host project not in the same VPC service control perimeter.

Check below Intermittent connectivity issues :

1)Verify there are no instances down.

2)Check Memorystore Redis issues

  • View Node's health breakdown (from HM) graph to determine if Redis was unhealthy during the periods of intermittent connectivity. Automatic repairs should fix most cases of instance unhealthiness.
  • For standard tier instances view Mastership per Node graph to determine if Redis is failing over to the replica during periods of intermittent connectivity. A value of 1 on this graph indicates that node is the primary, and node switching values indicate a failover. Instances may experience a brief moment (seconds) of unavailability during failover which could lead to timeouts.
  • View nf_conntrack count/max ratio graph to determine if the nf_conntrack table is full and packets are being dropped. The value reported is a ratio, so a value of 1 indicates the table is full. When this table is full incoming connections are dropped. This issue can be confirmed by searching for "nf_conntrack: table full" in the tenant project logs (Example). If this is the case file a bug to the product team and explain the issue.
  • View Command Latency (usec) per Node, per Command graph to check the latency of commands Redis is executing. High latencies may be leading to connections timing out. Verifying your configuration (dependent on the library you are using) is not leading to timeouts.
  • View Connected Clients per Node graph to see if Redis has reached Connected clients per instance limit.
  • View Redis CPU usage per Node graph to see if the CPU is saturated.
  • View Used Memory Ratio per Node, RSS Memory Ratio per Node and OOM-prevention duration per Node to see if there might be an OOM situation.

3)Search tenant project logs for anything suspicious (dropped packets, timeouts, etc.).