What is Kubernetes HA cluster failure behaviour in split-brain scenarios between racks?

3k Views Asked by At

I am interested in the behaviour of multi-master Kubernetes in the event of different types of failure, particularly if the masters are on different racks.

  • Scenario:

    • 2 racks, R1, R2.

    • API Masters:

      • M1 on R1, M2 on R2.
    • Worker nodes:

      • W1 on R1, W2 on R2.
    • Etcd:

      • A completely separate HA Etcd cluster comprising 3 nodes (i.e. it's not running on the API Master nodes).

My failure questions are basically around split brain scenarios:

What happens if M1 is the active master and R1 loses connection with Etcd and R2, but R2/M2 has connectivity to Etcd? i.e. what specifically causes a leadership election?

If there is a Pod P1 on R1/W1, M1 is the active master and R1 becomes disconnected from R2 and Etcd, what happens? Does P1 keep going, or is it killed? Does M2 start a separate instance of P (P2) on R2? If so, can P1 & P2 both be running at the same time?

If there is a Pod P2 on R2/W2 and M1 is the active master (i.e. pod is on separate rack to the master) and R1 loses connection to R2 and Etcd, what happens to P2? Does it keep going and M2 takes over?

1

There are 1 best solutions below

0
On BEST ANSWER

The master holds a lease in etcd. If the lease expires, the active master exits it’s process (expects to restart). The other master would observe the lease expiring and attempt to acquire it in etcd. As long as M2 can reach etcd and etcd has a quorum, the second master would then take over.

As far as competing Masters go, in general Kubernetes still uses etcd to perform consistent updates - ie even two Masters active at the same time are still contending to do the same thing to etcd, which has strong consistency and so the usual outcome is just failed updates. One example where that is not the case is daemonsets and ReplicaSets - two active Masters may create multiples of pods, and then scale them down when they realize there are too many per node or compare to the desired scale. But since neither daemonsets or ReplicaSets guarantee that behavior anyway (ReplicaSets can have > scale pods running at any time, daemonsets can have two pods per node briefly) it’s not broken per se.

If you need at-most-X pods behavior, only StatefulSets provide that guarantee today.