Kubernetes Multus: No macvlan connectivity between pods on different nodes (can't ping)

2.2k Views Asked by At

I have a problem where I have a Kubernetes cluster with two worker nodes, and one master. Let's lab them W1, W2, and M. I have a deployment that creates a set of CentOS7 pods, some on each worker. I use Multus so that there is an extra net1 interface on each pod that is mapped to eth1 on the workers. All the pods have net1 connected to the same macvlan named ‘up-net’.

On both W1 and W2 I can ping between pods that run on the same node, but a pod in W1 can't ping another one in W2 and vise-versa. Pinging over the standard kube network on eth0 works in all cases. It’s just the macvlan that has this problem.

That's the problem in short. So let me now describe the setup we're using in more detail.

We have a lab with 3 physical servers, on which we've deployed Kolla (which is Openstack installed on Kubernets). In this Openstack installation, I'm again trying to set up a Kubernetes installation were the master and worker nodes are hosted in Openstack virtual machines (i.e. W1, W2, M) are VM's running in Openstack. This means that we have three layers of virtualization in total. Just wanted to mention it should anyone know any potential leads based on that. But I haven’t bumped into any problem I think is related to the virtualization. Can also mention that these vm's have two interfaces eth0, and eth1. Eth1 is the device I want the macvlan on. Lastly, for both vm's and physical servers the operating system is CentOS7.

About the Kubernetes installation:

  • The Kubernetes (overcloud) was installed using Kubespray.
  • I edited the host files to make node1 master node2 W1 and node3 W2.
  • I set kube_network_plugin_multus to true to.
  • Whereabouts is used to assign ip addresses to the net1 interfaces.
  • I use calico as the networking driver.

Here’s the configurations used for the macvlan network:

apiVersion: "k8s.cni.cncf.io/v1"
kind: NetworkAttachmentDefinition
metadata:
  name: up-net
spec:
  config: '{
      "cniVersion": "0.3.0",
      "name": "up-net",
      "type": "macvlan",
      "master": "eth1",
      "mode": "bridge",
      "ipam": {
        "type": "whereabouts",
        "datastore": "kubernetes",
        "kubernetes": { "kubeconfig": "/etc/cni/net.d/whereabouts.d/whereabouts.kubeconfig" },
        "range": "192.168.3.225/28",
        "log_file" : "/tmp/whereabouts.log",
        "log_level" : "debug"
      }
    }'

Here is the configuration for the pods:

apiVersion: apps/v1
kind: Deployment
metadata:
    name: sample
    labels:
        app: centos-host
spec:
    replicas: 4
    selector:
        matchLabels:
            app: centos-host
    template:
        metadata:
            labels:
                app: centos-host
            annotations:
                k8s.v1.cni.cncf.io/networks: up-net
        spec:
            containers:
              - name: centos-container
                image: centos:7
                command: ["/bin/sleep", "infinity"]

I haven't specified explicitly which worker they end up on, but usually the load balancer distributes the four pods evenly.

Also, here are the Kube system pods:

[centos@node1 ~]$ kubectl get pods -n kube-system
NAME                                      READY   STATUS    RESTARTS   AGE
calico-kube-controllers-8b5ff5d58-msq2m   1/1     Running   1          29h
calico-node-2kg2l                         1/1     Running   1          29h
calico-node-4fxwr                         1/1     Running   1          29h
calico-node-m4l67                         1/1     Running   1          29h
coredns-85967d65-6ksqx                    1/1     Running   1          29h
coredns-85967d65-8nbgq                    1/1     Running   1          29h
dns-autoscaler-5b7b5c9b6f-567vz           1/1     Running   1          29h
kube-apiserver-node1                      1/1     Running   1          29h
kube-controller-manager-node1             1/1     Running   1          29h
kube-multus-ds-amd64-dzmj5                1/1     Running   1          29h
kube-multus-ds-amd64-mvfpc                1/1     Running   1          29h
kube-multus-ds-amd64-sbw8n                1/1     Running   1          29h
kube-proxy-6jgvn                          1/1     Running   1          29h
kube-proxy-tzf5t                          1/1     Running   1          29h
kube-proxy-vgmh8                          1/1     Running   1          29h
kube-scheduler-node1                      1/1     Running   1          29h
nginx-proxy-node2                         1/1     Running   1          29h
nginx-proxy-node3                         1/1     Running   1          29h
nodelocaldns-27bct                        1/1     Running   1          29h
nodelocaldns-75cgg                        1/1     Running   1          29h
nodelocaldns-ftvn9                        1/1     Running   1          29h
whereabouts-4tktv                         1/1     Running   0          28h
whereabouts-nfwkz                         1/1     Running   0          28h
whereabouts-vxgwr                         1/1     Running   0          28h

Now that the setup is explained on to the experiments I've run.

Consider the pods P1a and P1b on worker 1 (W1). On worker 2 (W2) there's P2a and P2b. I use ping and tcpdump to access connectivity.

Ping from P1a to P1b work fine and tcpdump tells me that there's icmp traffic on W1's eth1 device. The same goes for W2.

However, when I ping P2a from P1a it looks like follows:

[root@sample-7b9755db48-gxq5m /]# ping -c 2 192.168.3.228
PING 192.168.3.228 (192.168.3.228) 56(84) bytes of data.
From 192.168.3.227 icmp_seq=1 Destination Host Unreachable
From 192.168.3.227 icmp_seq=2 Destination Host Unreachable

--- 192.168.3.228 ping statistics ---
2 packets transmitted, 0 received, +2 errors, 100% packet loss, time 1000ms
pipe 2

An interesting lead is however that the icmp packets end up on the lo interface on the pod in this case:

[root@sample-7b9755db48-gxq5m /]# tcpdump -vnes0 -i lo
tcpdump: listening on lo, link-type EN10MB (Ethernet), capture size 262144 bytes
12:51:57.261003 00:00:00:00:00:00 > 00:00:00:00:00:00, ethertype IPv4 (0x0800), length 126: (tos 0xc0, ttl 64, id 32401, offset 0, flags [none], proto ICMP (1), length 112)
    192.168.3.227 > 192.168.3.227: ICMP host 192.168.3.228 unreachable, length 92
        (tos 0x0, ttl 64, id 39033, offset 0, flags [DF], proto ICMP (1), length 84)
    192.168.3.227 > 192.168.3.228: ICMP echo request, id 137, seq 1, length 64
12:51:57.261019 00:00:00:00:00:00 > 00:00:00:00:00:00, ethertype IPv4 (0x0800), length 126: (tos 0xc0, ttl 64, id 32402, offset 0, flags [none], proto ICMP (1), length 112)
    192.168.3.227 > 192.168.3.227: ICMP host 192.168.3.228 unreachable, length 92
        (tos 0x0, ttl 64, id 39375, offset 0, flags [DF], proto ICMP (1), length 84)
    192.168.3.227 > 192.168.3.228: ICMP echo request, id 137, seq 2, length 64


Do you think there might be a problem with my routing table? I can't see anything but I'm a bit new to networking:

[root@sample-7b9755db48-gxq5m /]# ip route
default via 169.254.1.1 dev eth0 
169.254.1.1 dev eth0 scope link 
192.168.3.224/28 dev net1 proto kernel scope link src 192.168.3.227

Finally, a list on things I've tried that didn't work:

  • Set eth1 to promiscuous mode on eth1 on W1, W2, and M.
  • Disabled rp_filter for ipv4 (as I syspected that macvlan does strange things with the macaddresses).
1

There are 1 best solutions below

0
On BEST ANSWER

To conclude I've managed to find the answer myself. It turned out it was the OpenStack security group that caused the problem. All I needed to change to get things running was to disable port security on all of the eth1-network ports. This is the command I used for every such port:

openstack port set --no-security-group --disable-port-security <id or name of the neutron port>

After that, the machines were reachable. No need to restart servers or services or the like.

I do find it a bit strange that this issue only occurs on the secondary network though. In either case, Hope this helps someone else that tries to run kubernetes in openstack VM's.