As part of my distributed systems learning, I'm building a chat application. Currently my design is to have each server know the clients that they are connected to (this is the state that will be replicated using a consensus algorithm).
There is a load balancer that the client initially connect to and the load balancer responds with the server the client should subsequently talk to. Subsequent commands from client directly go to the instance it has been assigned to. To manage the state, I'm thinking of using Raft algorithm
for consensus.
Not sure why you would implement a consensus algorithm like Raft here. Traditionally RAFT is used to elect a leader. Does not sound like you need this. Something like:
client > load balancer (haproxy) > pool of chat servers
Haproxy (load balancer) can perform health checks against your server pool. If an server dies it will be removed from a pool. When a server becomes hot/stressed it can 'fail' health checks to be removed from pool (backend servers should throw a 503 http status via health checks) When traffic dies down the server will be re-added back to the pool. You can alert on/monitor number of healthy chat server pool members.
Handle errors on the client side. If an error is detected reconnect to the load balancer and grab a new server. All chat state should not be kept on the ephemeral chat server instance but some kind of global data store like Redis.
This allows you to be highly scalable. At extreme scale you might have data store issues with Redis but that can be mitigated with Redis Cluster or sharding your chats.