Kafka Mirrormaker2 config optimization

1.3k Views Asked by At

I am setting up Strimzi kafka Mirrormaker2 in our test environment which receives on an average 100k messages/5 mins. we have around 25 topics and 900 partitions in total for these topics. The default configuration i set up is mirroring only 60k messages/5 mins to the DR cluster. I am trying to optimize this configuration for better throughput and latency.

apiVersion: v1
items:
- apiVersion: kafka.strimzi.io/v1beta2
 kind: KafkaMirrorMaker2
 spec:
  clusters:
  - alias: source
   authentication:
    certificateAndKey:
     certificate: user.crt
     key: user.key
     secretName: mirrormaker1
    type: tls
   bootstrapServers: bootstrap1:443
   tls:
    trustedCertificates:
    - certificate: ca.crt
     secretName: cert-source
  - alias: target
   authentication:
    certificateAndKey:
     certificate: user.crt
     key: user.key
     secretName: mirrormaker-dr
    type: tls
   bootstrapServers: bootstrap2:443
   config:
     offset.flush.timeout.ms: 120000
   tls:
    trustedCertificates:
    - certificate: ca.crt
     secretName: dest-cert
  connectCluster: target
  livenessProbe:
   initialDelaySeconds: 40
   periodSeconds: 40
   timeoutSeconds: 30
  metricsConfig:
   type: jmxPrometheusExporter
   valueFrom:
    configMapKeyRef:
     key: mm2-metrics-config.yaml
     name: mm2-metrics
  mirrors:
  - checkpointConnector:
    config:
     checkpoints.topic.replication.factor: 3
    tasksMax: 10
   groupsPattern: .*
   heartbeatConnector:
    config:
     heartbeats.topic.replication.factor: 3
   sourceCluster: source
   sourceConnector:
    config:
     consumer.request.timeout.ms: 150000
     offset-syncs.topic.replication.factor: 3
     refresh.topics.interval.seconds: 60
     replication.factor: 3
     source.cluster.producer.enable.idempotence: "true"
     sync.topic.acls.enabled: "true"
     target.cluster.producer.enable.idempotence: "true"
    tasksMax: 60
   targetCluster: target
   topicsPattern: .*
  readinessProbe:
   initialDelaySeconds: 40
   periodSeconds: 40
   timeoutSeconds: 30
  replicas: 4
  resources:
   limits:
    cpu: 9
    memory: 30Gi
   requests:
    cpu: 5
    memory: 15Gi
  version: 2.8.0

With the above config i don't see any errors in the log files.

I tried to fine tune the config for more throughput and latency as follows

      consumer.max.partition.fetch.bytes: 2097152
      consumer.max.poll.records: 1000
      consumer.receive.buffer.bytes: 131072
      consumer.request.timeout.ms: 200000
      consumer.send.buffer.bytes: 262144
      offset-syncs.topic.replication.factor: 3
      producer.acks: 0
      producer.batch.size: 20000
      producer.buffer.memory: 30331648
      producer.linger.ms: 10
      producer.max.request.size: 2097152
      producer.message.max.bytes: 2097176
      producer.request.timeout.ms: 150000
      

I am seeing the following errors in the logs now but the data is still flowing and see the number of messages increased slightly to around ~65k/5mins. I also increased the tasksmax count from 60 to 800 and replicas from 4 to 8 but i don't see any difference doing this.Also the N/w Bytes in is around ~20 MiB/s. Even though i further increased consumer.request.timeout.ms the below error didn't disappear..

2022-04-26 04:09:51,223 INFO [Consumer clientId=consumer-null-1601, groupId=null] Error sending fetch request (sessionId=629190882, epoch=65) to node 4: (org.apache.kafka.clients.FetchSessionHandler) [task-thread-us-ashburn-1->us-phoenix-1-dr.MirrorSourceConnector-759] org.apache.kafka.common.errors.DisconnectException

Is there anything i can do to increase the throughput and decrease the latency?

1

There are 1 best solutions below

2
On

I haven't configured Strimzi kafka Mirrormaker before, but at first look, the producer and consumer configs seem to be the same as what is exposed by the kafka-clients library. Assuming that is the case, the producer's batch.size, which is set to 20000, is not number of records. It is in bytes, which means, with this config, the producer will transmit a maximum of only 20 kilobytes per send. Try increasing it to 65,536(64 kilobytes) or higher. If the throughput still doesn't increase, increase linger.ms to 100 or higher, so that the producer waits longer for each batch to fill up before triggering a send