I'm using a helm chart to deploy my kubernetes 1.27 cluster across 4 RHEL9 Minimal Build VMs (one controller, three workers). The cluster seems to deploy, but all the pods are in crashloop with connectivity issues. Redis cannot initialize. This cluster works fine on RHEL8 and on RHEL7 VMs. Redis 6.2.12 errors:
Initializing config..
/readonly-config/init.sh: line 84: Could not resolve the announce ip for this pod: not found
Error from server (BadRequest): container "sentinel" in pod "xio-redis-ha-server-0" is waiting to start: PodInitializing
*** FATAL CONFIG FILE ERROR (Redis 6.2.12) ***
Reading the configuration file, at line 2
>>> 'sentinel down-after-milliseconds mymaster 10000'
No such master with specified name.
General connectivity errors from other pods:
Caused by: java.util.concurrent.CompletionException: io.netty.channel.AbstractChannel$AnnotatedNoRouteToHostException: No route to host: name-redis-ha.default.svc.cluster.local/10.42.0.22:6379
Caused by: io.netty.channel.AbstractChannel$AnnotatedConnectException: Connection refused: name-redis-ha.default.svc.cluster.local/10.42.2.27:6379
Caused by: java.net.ConnectException: Connection refused
Caused by: org.redisson.client.RedisConnectionException: Unable to connect to Redis server: name-redis-ha.default.svc.cluster.local/10.42.2.27:6379
I've tried opening all ports, rebooting servers, restarting docker service, and 20 other things I've found on various blogs and posts. curling services from within pods works intermittently. restarting firewalld allows curling from within pods, but pods still cannot connect to one another. i've tried changing the iptablesBackend to the various different options in the firewalld.conf, in case this is an issue with the various firewall interfaces conflicting with each other. cluster canal pods say they are set to auto detect the firewall backend.
I struggled with this for several days, so I ended up making a list of all the settings I had to change in a RHEL 9 to be able to run a kubernetes cluster successfully with redis. Here's the list! I was running this in an on premise VMware environment:
FIREWALLD
firewalld is know to conflict with the cluster (known issue REF1). If pods are crashing post install, restarting firewalld and docker on all nodes, then deleting crashing pods may resolve the issue:
If the issue persists, it may be necessary to stop firewalld on all nodes and restart the docker service:
If other steps from this troubleshooting guide are implemented, these firewalld steps will need to be repeated afterwards.
SELINUX
If selinux is enabled, it may cause connectivity issues and crashing pods. It may be necessary to disable selinux, or set it to permissive. Be sure to overwrite the original values of SELINUX:
SECUREBOOT
VMware SecureBoot feature may interfere with operation of the cluster:
NETWORK MANAGER
Network Manager is know to conflict with the RKE cluster (known issue REF1) and may cause connectivity issues and crashing pods. It may be necessary to create the /etc/NetworkManager/conf.d/canal.conf file with the following contents on each node, then reload NetworkManager and reboot each node:
SYSCTL RP_FILTER
It may be necessary to create the /etc/sysctl.d/90-override.conf file with the following contents on each node and then reboot each node (this overrides the breaking STIG setting in the /etc/sysctl.d/99-sysctl.conf file, stipulated by CCE-84008-2):
NM-CLOUD SERVICE AND TIMER
On each node:
REFERENCES: