Questions about how Raft protocol deals with concurrent requests?

106 Views Asked by At

I am new to the Raft consensus protocol and I wonder how practical Raft implementations deal with the following case:

  1. The leader receives a request (say, proposed at index 1) and sends the log entry to all followers via AppendEntries RPCs
  2. Before the leader receives the responses from all followers, another request arrives and the leader proposes it at index 2.

In this case, as far as I know, the leader has to send both log entries at index 1 and 2 via a single AppendEntries RPC call as the nextIndex has not been advanced to be 2. But that seems waste network bandwidth.

I wonder how practical Raft implementations solve this problem in order to achieve high performance. Or maybe I just misunderstood how the raft protocol processes requests?

1

There are 1 best solutions below

2
AndrewR On

In addition to usual Raft RPCs a practical system would expose an API for clients: append(data). When a client wants to append some data to the log, they would call .append. Worth noting that the method has nothing to do with Raft indexes.

Depending on the implementation, the .append method acts differently if it is called on a leader of on a follower. Clearly, if it is called on a leader, it will try to append the data. If the method is called on a follower, then the result may be an error or a redirect to a leader, or even a proxy call to a leader.

When .append call is called on a leader more or less two things may happen (assuming the leader is true leader and the network is in a stable state):

  • 1: if there is no other append request in progress, the leader would call AppendEntries as usual
  • 2:a if there is a request in flight (a request waits for the majority, not for ALL followers to accept) then the .append request would be buffered. Several .append requests may be buffered together and on next AppendEntries iteration the leader would push all those values down to followers.
  • 2:b if there are too many .append requests are waiting then the leader could throttle some of those requests

Edit based on comments:

  • has to wait for the response of the majority before replicating the new and buffered requests

The log itself could be used as a buffer, so when an .append is called, the leader decides to either put it into the log, or throttle it the number of not yet confirmed entries is too long.

If the implementation uses one thread per follower (blocking IO), then it is possible that some of those not-yet-accepted values to be bundled together; that may get a better ratio to latency vs throughput.

I can see a point of not waiting for previous requests being accepted and issuing new ones, that won't break the protocol, but it will create much more internal traffic. That not necessary a bad thing, if local, inter-nodes network is much faster than the connection to clients.

  • what is the typical number of servers used in a raft cluster for a real-world configuration

That's a great question!

Clearly, at very least we need three nodes. That should allow the system to be available if one of nodes is down. With three nodes, we will have each piece of data stored/processed three times.

The question is if three nodes cluster is highly available? One node can go offline for stuff like software/system upgrades, or other maintenance. With three nodes only, these events put the system at risk, as if any other node fails, the whole system goes offline.

To mitigate previous issue, we could bump the node count to five; now we can survive a node being in maintenance, and one additional failure. That's seems like much more stable system, but the cost is up. Now for every piece of data we process and store it five times.

At the end, the choice between three, five, etc is about how much availability the business needs, what are RTO (recovery time objective) and RPO (recovery point objective), and how much cost is ok to spend.