My application runs over a Kubernetes cluster of 3 nodes and uses Kafka to stream data. I am trying to check my system's ability to recover from node failure, so I deliberately fail one of the nodes for 1 minute.
Around 50% of the times, I experience data loss of a single data record after the node failure. If the controller Kafka broker was running on the failed node, I see that a new controller broker was elected as expected. When the data loss occur, I see the following error in the new controller broker log:
ERROR [Controller id=2 epoch=13] Controller 2 epoch 13 failed to change state for partition __consumer_offsets-45 from OfflinePartition to OnlinePartition (state.change.logger) [controller-event-thread]
I am not sure if that's the problem, but searching the web for information about this error made me suspect that I need to configure Kafka to have more than 1 replica for each topic.
This is how my topics/partitions/replicas configuration looks like:
My questions: Is my suspicion that more replicas are required is correct?
If yes, how do I increase the number of topics replicas? I played around with a few broker parameters such as default.replication.factor
and replication.factor
but I did not see the number of replicas change.
If no, what is the meaning of this error log?
Thanks!
Yes, if the broker hosting the single replica goes down, then you can expect an unclean topic. If you have unclean leader election disabled, however, you shouldn't lose data that's already been persisted to the broker.
To modify existing topics, you must use
kafka-reassign-partitions
tool, not any of the broker settings, as those only apply for brand new topics. Kafka | Increase replication factor of multiple topicsIdeally, you should disable auto topic creation, as well, to force clients to use Topic CRD resources in Strimzi that include a replication factor, and you can use other k8s tools to verify that they have values greater than 1.