We've recently run into an issue where a Kafka broker encountered a kernel issue which blocked IO (but was able to heartbeat back to zookeeper I guess). The result of this is that the Kafka broker stayed in the ISR set but was actually unable to complete any tasks.
The question is: 1) Is there any document on what Kafka checks before it emits a heartbeat, or is it just dumbing emitting heartbeats (I see https://cwiki.apache.org/confluence/display/KAFKA/A+Guide+To+The+Kafka+Protocol, but it seems to just mention the protocol without talking about what is actually checked before the heartbeat).
2) In my case, Kafka was heartbeating even though all requests are failing. Is there any way to employ deeper heartbeat within Kafka to check request success rate, etc? Or do we need to use external tools like https://github.com/pinterest/doctorkafka, https://www.slideshare.net/JiangjieQin/introduction-to-kafka-cruise-control-68180931 or https://github.com/yahoo/kafka-manager
For your 2nd point, Kafka brokers emit a large number of metrics. If in your case brokers stopped processing requests, it should have been obvious from a number of metrics, like the basic bytes in/out per sec or network/disk IO.
It is essential to always monitor your Kafka clusters in order to be able to understand what's going on when things stop working. There are severals good articles online that list the most important Kafka, host and JVM metrics, for example:
Regarding your first question, I'm not sure what you're asking. The page you linked to is the Kafka protocol. It only details how Kafka clients and brokers interact. It doesn't cover any of the interaction between Kafka and Zookeeper.