Is it possible to restore etcd v3.5 cluster after quorum lost by just restarting the nodes?

448 Views Asked by At

Documentation names that case as disaster recovery, and guides you to use etcdctl snapshot restore for such cases. However, even when quorum of a cluster is lost (1 and 2 nodes went down out of 3), restarting the nodes 1 and 2 that were down will still bring up a quorum and operate correctly.

Well, then why we even need snapshot then if such self-healing option is available in etcd?

Thanks in advance ;)

1

There are 1 best solutions below

4
On

In a situation where an etcd v3.5 cluster has lost quorum, simply restarting the nodes will not be sufficient to restore the cluster. When quorum is lost, it means that a majority of the etcd cluster members are unavailable or unreachable. In such cases, additional steps are required to restore the cluster's functionality.

To restore an etcd cluster after quorum has been lost, you would typically need to perform the following steps:

Identify the cause of quorum loss: Determine why the quorum was lost in the first place. It could be due to network issues, server failures, or other factors. Addressing the underlying cause is important to prevent recurrence.

Restore the unreachable or failed etcd nodes: If any nodes in the cluster are unreachable or failed, you may need to bring them back online or replace them with new nodes. This could involve fixing network connectivity issues, resolving hardware or software failures, or provisioning new nodes.

Re-establish communication and connectivity: Ensure that all etcd cluster members can communicate with each other. Verify that network connectivity is restored, and the nodes can communicate over the required ports and protocols.

Verify cluster health and quorum: Once the unreachable or failed nodes are back online and connectivity is restored, verify the health of the cluster. Ensure that the nodes can form a quorum, meaning that a majority of the etcd members are operational and can communicate with each other.

Perform recovery and synchronization: If necessary, perform recovery and synchronization processes to reconcile any data inconsistencies or missing data between the etcd nodes. This may involve using etcd-specific recovery mechanisms or restoring data from backups, if available