Kafka deep healthcheck

819 Views Asked by At

We've recently run into an issue where a Kafka broker encountered a kernel issue which blocked IO (but was able to heartbeat back to zookeeper I guess). The result of this is that the Kafka broker stayed in the ISR set but was actually unable to complete any tasks.

The question is: 1) Is there any document on what Kafka checks before it emits a heartbeat, or is it just dumbing emitting heartbeats (I see https://cwiki.apache.org/confluence/display/KAFKA/A+Guide+To+The+Kafka+Protocol, but it seems to just mention the protocol without talking about what is actually checked before the heartbeat).

2) In my case, Kafka was heartbeating even though all requests are failing. Is there any way to employ deeper heartbeat within Kafka to check request success rate, etc? Or do we need to use external tools like https://github.com/pinterest/doctorkafka, https://www.slideshare.net/JiangjieQin/introduction-to-kafka-cruise-control-68180931 or https://github.com/yahoo/kafka-manager

2

There are 2 best solutions below

1
On

For your 2nd point, Kafka brokers emit a large number of metrics. If in your case brokers stopped processing requests, it should have been obvious from a number of metrics, like the basic bytes in/out per sec or network/disk IO.

It is essential to always monitor your Kafka clusters in order to be able to understand what's going on when things stop working. There are severals good articles online that list the most important Kafka, host and JVM metrics, for example:

Regarding your first question, I'm not sure what you're asking. The page you linked to is the Kafka protocol. It only details how Kafka clients and brokers interact. It doesn't cover any of the interaction between Kafka and Zookeeper.

0
On

We have built following healthcheck mechanism for our Kafka in kubernetes which kind of solved lot of our previous problems.

startupProbe:
exec:
  command:
    - sh
    - '-c'
    - >
      [[ "$(kafka-broker-api-versions --bootstrap-server
      localhost:9092 | grep 9092 | wc -l)" == "3" ]] || { echo >&2
      "Broker count is not 3"; exit 1; } && [[ "$(curl -s
      kafka-controller-headless.my-namespace:9102/metrics
      | grep
      kafka_controller_kafkacontroller_preferredreplicaimbalancecount
      | tail -n 1 | grep -o '[^ ]\+$')" == "0.0" ]] || { echo >&2
      "Replica Imbalance not 0"; exit 1; }

Here the "kafka-controller-headless.my-namespace" is the kubernetes service name of the "controller" (KRAFT)

What this does is that:

  • It will first check whether the broker count is matching (which is 3 for our setup)
  • It then also check for the "replica imbalance" to be zero - this prevents the brokers to not go down till the cluster is in balance