I am setting up Strimzi kafka Mirrormaker2 in our test environment which receives on an average 100k messages/5 mins. we have around 25 topics and 900 partitions in total for these topics. The default configuration i set up is mirroring only 60k messages/5 mins to the DR cluster. I am trying to optimize this configuration for better throughput and latency.
apiVersion: v1
items:
- apiVersion: kafka.strimzi.io/v1beta2
kind: KafkaMirrorMaker2
spec:
clusters:
- alias: source
authentication:
certificateAndKey:
certificate: user.crt
key: user.key
secretName: mirrormaker1
type: tls
bootstrapServers: bootstrap1:443
tls:
trustedCertificates:
- certificate: ca.crt
secretName: cert-source
- alias: target
authentication:
certificateAndKey:
certificate: user.crt
key: user.key
secretName: mirrormaker-dr
type: tls
bootstrapServers: bootstrap2:443
config:
offset.flush.timeout.ms: 120000
tls:
trustedCertificates:
- certificate: ca.crt
secretName: dest-cert
connectCluster: target
livenessProbe:
initialDelaySeconds: 40
periodSeconds: 40
timeoutSeconds: 30
metricsConfig:
type: jmxPrometheusExporter
valueFrom:
configMapKeyRef:
key: mm2-metrics-config.yaml
name: mm2-metrics
mirrors:
- checkpointConnector:
config:
checkpoints.topic.replication.factor: 3
tasksMax: 10
groupsPattern: .*
heartbeatConnector:
config:
heartbeats.topic.replication.factor: 3
sourceCluster: source
sourceConnector:
config:
consumer.request.timeout.ms: 150000
offset-syncs.topic.replication.factor: 3
refresh.topics.interval.seconds: 60
replication.factor: 3
source.cluster.producer.enable.idempotence: "true"
sync.topic.acls.enabled: "true"
target.cluster.producer.enable.idempotence: "true"
tasksMax: 60
targetCluster: target
topicsPattern: .*
readinessProbe:
initialDelaySeconds: 40
periodSeconds: 40
timeoutSeconds: 30
replicas: 4
resources:
limits:
cpu: 9
memory: 30Gi
requests:
cpu: 5
memory: 15Gi
version: 2.8.0
With the above config i don't see any errors in the log files.
I tried to fine tune the config for more throughput and latency as follows
consumer.max.partition.fetch.bytes: 2097152
consumer.max.poll.records: 1000
consumer.receive.buffer.bytes: 131072
consumer.request.timeout.ms: 200000
consumer.send.buffer.bytes: 262144
offset-syncs.topic.replication.factor: 3
producer.acks: 0
producer.batch.size: 20000
producer.buffer.memory: 30331648
producer.linger.ms: 10
producer.max.request.size: 2097152
producer.message.max.bytes: 2097176
producer.request.timeout.ms: 150000
I am seeing the following errors in the logs now but the data is still flowing and see the number of messages increased slightly to around ~65k/5mins. I also increased the tasksmax count from 60 to 800 and replicas from 4 to 8 but i don't see any difference doing this.Also the N/w Bytes in is around ~20 MiB/s. Even though i further increased consumer.request.timeout.ms the below error didn't disappear..
2022-04-26 04:09:51,223 INFO [Consumer clientId=consumer-null-1601, groupId=null] Error sending fetch request (sessionId=629190882, epoch=65) to node 4: (org.apache.kafka.clients.FetchSessionHandler) [task-thread-us-ashburn-1->us-phoenix-1-dr.MirrorSourceConnector-759] org.apache.kafka.common.errors.DisconnectException
Is there anything i can do to increase the throughput and decrease the latency?
I haven't configured Strimzi kafka Mirrormaker before, but at first look, the producer and consumer configs seem to be the same as what is exposed by the
kafka-clients
library. Assuming that is the case, the producer'sbatch.size
, which is set to 20000, is not number of records. It is in bytes, which means, with this config, the producer will transmit a maximum of only 20 kilobytes per send. Try increasing it to 65,536(64 kilobytes) or higher. If the throughput still doesn't increase, increaselinger.ms
to100
or higher, so that the producer waits longer for each batch to fill up before triggering a send