Explain why metricbeat Kafka partition metric has a higher count than consumer metric

435 Views Asked by At

The problem

Hi, I am trying to visualize Kafka lags using Grafana. I have been trying to log kafka lags using Metricbeat and doing the math myself since Metricbeat does not support logging Kafka lags in the version that I am using (but it has been implemented recently). Instead of using max(partition.offset.newest) - max(consumergroup.offset) to calculate the lags, I am using sum(partition.offset.newest) - sum(consumergroup.offset) filtered on a particular kafka.topic.name. However, the sum does not tally, upon further investigation, I found out that the count does not even tally! The count for partition offsets is 30 per 10s while the count for consumergroup offsets is 12 per 10s. I expect the count for both to be the same

topic offsets vs consumer offsets

I do not understand why Metricbeat logs the partition more than the consumergroup. At first I thought it was because of my Metricbeat configuration where I have 2 host groups defined, which might caused it to be logged multiple times. However, after changing my configurations, the count just droppped by half.

topic offsets vs consumer offsets with 1 host

TL;DR

Why is the Metricbeat counts of partition and consumergroup different?

Setup

  1. Kafka 2 brokers
  2. Kafka topic partitions:
Topic: xxx     PartitionCount:3        ReplicationFactor:2     Configs:
Topic: xxx     Partition: 0    Leader: 2       Replicas: 2,1   Isr: 2,1
Topic: xxx     Partition: 1    Leader: 1       Replicas: 1,2   Isr: 1,2
Topic: xxx     Partition: 2    Leader: 2       Replicas: 2,1   Isr: 2,1
  1. Metricbeat config (modules.d/kafka.yml):
- module: kafka
  #metricsets:
  #  - partition
  #  - consumergroup
  period: 10s
  hosts: ["xxx.yyy:9092"]

Versions

  • Kafka 2.11-0.11.0.0
  • Elasticsearch-7.2.0
  • Kibana-7.2.0
  • Metricbeats-7.2.0
1

There are 1 best solutions below

0
On

after much debugging I have figured out what is wrong:

  1. For some reason, my kafka broker 1 has only producer metric and no consumer metric, connecting to broker 2 solved this problem. Connecting both brokers will add both metrics together.
  2. Lucene uses fuzzy search so my data has some other consumer groups inside as well. For exact word matching, use kafka.partition.topic.keyword: 'xxx' instead. This made the ratio of my kafka producer offset to consumer offset 2:1
  3. metricbeat logs the replicas as well, so I need to set NOT kafka.partition.partition.is_leader: false to get all partition leaders. This made the consumer to partition ratio 1:1.

After the 3 steps is done, I can use the formula sum(partition.offset.newest) - sum(consumergroup.offset) to get the lags

However, I do not know why broker 1 doesn't have the consumer information.