kubernetes 1.27 cluster connectivity issues on RHEL9 Minimal Build?

37 Views Asked by At

I'm using a helm chart to deploy my kubernetes 1.27 cluster across 4 RHEL9 Minimal Build VMs (one controller, three workers). The cluster seems to deploy, but all the pods are in crashloop with connectivity issues. Redis cannot initialize. This cluster works fine on RHEL8 and on RHEL7 VMs. Redis 6.2.12 errors:

Initializing config..
/readonly-config/init.sh: line 84: Could not resolve the announce ip for this pod: not found

Error from server (BadRequest): container "sentinel" in pod "xio-redis-ha-server-0" is waiting to start: PodInitializing

*** FATAL CONFIG FILE ERROR (Redis 6.2.12) ***
Reading the configuration file, at line 2
>>> 'sentinel down-after-milliseconds mymaster 10000'
No such master with specified name.

General connectivity errors from other pods:

Caused by: java.util.concurrent.CompletionException: io.netty.channel.AbstractChannel$AnnotatedNoRouteToHostException: No route to host: name-redis-ha.default.svc.cluster.local/10.42.0.22:6379
Caused by: io.netty.channel.AbstractChannel$AnnotatedConnectException: Connection refused: name-redis-ha.default.svc.cluster.local/10.42.2.27:6379
Caused by: java.net.ConnectException: Connection refused
Caused by: org.redisson.client.RedisConnectionException: Unable to connect to Redis server: name-redis-ha.default.svc.cluster.local/10.42.2.27:6379

I've tried opening all ports, rebooting servers, restarting docker service, and 20 other things I've found on various blogs and posts. curling services from within pods works intermittently. restarting firewalld allows curling from within pods, but pods still cannot connect to one another. i've tried changing the iptablesBackend to the various different options in the firewalld.conf, in case this is an issue with the various firewall interfaces conflicting with each other. cluster canal pods say they are set to auto detect the firewall backend.

1

There are 1 best solutions below

1
On BEST ANSWER

I struggled with this for several days, so I ended up making a list of all the settings I had to change in a RHEL 9 to be able to run a kubernetes cluster successfully with redis. Here's the list! I was running this in an on premise VMware environment:

FIREWALLD

firewalld is know to conflict with the cluster (known issue REF1). If pods are crashing post install, restarting firewalld and docker on all nodes, then deleting crashing pods may resolve the issue:

    sudo service firewalld status
    sudo service firewalld restart
    sudo service docker restart

If the issue persists, it may be necessary to stop firewalld on all nodes and restart the docker service:

    sudo service firewalld stop
    sudo service docker restart

If other steps from this troubleshooting guide are implemented, these firewalld steps will need to be repeated afterwards.

SELINUX

If selinux is enabled, it may cause connectivity issues and crashing pods. It may be necessary to disable selinux, or set it to permissive. Be sure to overwrite the original values of SELINUX:

    sudo vi /etc/selinux/config
SELINUX=permissive
    sudo reboot

SECUREBOOT

VMware SecureBoot feature may interfere with operation of the cluster:

  • Stop each VM: Actions > Power > Power Off.
  • Disable Secure Boot: Summary Tab > VM Hardware > Edit > VM Options > Boot Options > deselect Secure Boot > OK.
  • Restart each VM: Actions > Power > Power On.

NETWORK MANAGER

Network Manager is know to conflict with the RKE cluster (known issue REF1) and may cause connectivity issues and crashing pods. It may be necessary to create the /etc/NetworkManager/conf.d/canal.conf file with the following contents on each node, then reload NetworkManager and reboot each node:

    sudo systemctl status NetworkManager
    sudo vi /etc/NetworkManager/conf.d/canal.conf
[keyfile]
unmanaged-devices=interface-name:cali*;interface-name:flannel*
    sudo systemctl reload NetworkManager
    sudo reboot

SYSCTL RP_FILTER

It may be necessary to create the /etc/sysctl.d/90-override.conf file with the following contents on each node and then reboot each node (this overrides the breaking STIG setting in the /etc/sysctl.d/99-sysctl.conf file, stipulated by CCE-84008-2):

    sudo vi /etc/sysctl.d/90-override.conf
net.ipv4.conf.all.rp_filter = 0
net.ipv4.conf.default.rp_filter = 0
net.ipv4.conf.eth0.rp_filter = 0
net.ipv4.conf.lo.rp_filter = 0
    sudo reboot

NM-CLOUD SERVICE AND TIMER

On each node:

sudo systemctl disable --now nm-cloud-setup.service nm-cloud-setup.timer

REFERENCES:

  1. https://docs.rke2.io/known_issues