Kafka connect constant rebalancing

156 Views Asked by At

I am deploying a kafka connect cluster, consisting of 4 workers using docker swarm. There are some cases at the initial deployment (when no other kafka connect cluster ever existed within the environment) and only then so far, that the workers cannot communicate to each other, and a constant rebalancing takes places.

The logs, that are being produced repeatedly, are the following ones:

[2023-07-19T10:53:34.399Z] INFO [Worker clientId=connect-1, groupId=connect-cluster] Rebalance started
[2023-07-19T10:53:34.400Z] INFO [Worker clientId=connect-1, groupId=connect-cluster] (Re-)joining group
[2023-07-19T10:53:34.400Z] INFO [Worker clientId=connect-1, groupId=connect-cluster] Successfully joined group with generation Generation{generationId=11, memberId='connect-1-a0bd7da2-7235-4fcc-a0f9-83921b3e5a0c', protocol='sessioned'}
[2023-07-19T10:53:34.400Z] INFO [Worker clientId=connect-1, groupId=connect-cluster] Successfully synced group in generation Generation{generationId=11, memberId='connect-1-a0bd7da2-7235-4fcc-a0f9-83921b3e5a0c', protocol='sessioned'}
[2023-07-19T10:53:34.400Z] INFO [Worker clientId=connect-1, groupId=connect-cluster] Joined group at generation 11 with protocol version 2 and got assignment: Assignment{error=0, leader='connect-1-1c92ee2f-e894-475f-b330-ec2215e4611b', leaderUrl='http://10.0.50.95:8083/', offset=819, connectorIds=[], taskIds=[], revokedConnectorIds=[], revokedTaskIds=[], delay=0} with rebalance delay: 0
[2023-07-19T10:53:34.401Z] WARN [Worker clientId=connect-1, groupId=connect-cluster] Catching up to assignment's config offset.
[2023-07-19T10:53:34.401Z] INFO [Worker clientId=connect-1, groupId=connect-cluster] Current config state offset -1 is behind group assignment 819, reading to end of config log
[2023-07-19T10:53:34.402Z] INFO [Worker clientId=connect-1, groupId=connect-cluster] Finished reading to end of log and updated config snapshot, new config log offset: -1
[2023-07-19T10:53:34.402Z] INFO [Worker clientId=connect-1, groupId=connect-cluster] Current config state offset -1 does not match group assignment 819. Forcing rebalance.

I have seen the question here and several others. All of them mention that there is something wrong with connect configs topic (either it does not have the proper configuration or it consists of more that 1 partitions). However, in my case this is not the issue since my connect configs topic has only 1 partition, and if I try to redeploy the cluster with a new group id (without deleting kafka connect topics in prior), it works. I know this shouldn't be a major issue, since it only happens at the initial deployment of it. However, I am trying to find out what is the root cause, since I am afraid that this might also happen in a later restart of the cluster. In that case, I could not create a new cluster with a new group id from scratch, since this may lead in losing the offsets related to my deployed connectors and jeopardise my data integrity.

Update: This is the configuration that we use. This is a part of our docker-compose.yml for a worker instance. The same applies for the rest of the workers that are deployed as separate services.

kafka-connect-worker-1:
  networks:
    - monitoring
  image: <custom_kafka_connect_image>:<version>
  entrypoint: /etc/confluent/docker/entrypoint.sh
  hostname: "kafka-connect-worker-1"
  environment:
    CONNECT_BOOTSTRAP_SERVERS: <kafka_brokers_list>
    CONNECT_SECURITY_PROTOCOL: SSL
    CONNECT_SSL_ENDPOINT_IDENTIFICATION_ALGORITHM:
    CONNECT_SSL_TRUSTSTORE_LOCATION: /usr/lib/jvm/jre/lib/security/cacerts
    CONNECT_PRODUCER_SECURITY_PROTOCOL: SSL
    CONNECT_PRODUCER_SSL_ENDPOINT_IDENTIFICATION_ALGORITHM: 
    CONNECT_PRODUCER_SSL_TRUSTSTORE_LOCATION: /usr/lib/jvm/jre/lib/security/cacerts
    CONNECT_CONSUMER_SECURITY_PROTOCOL: SSL
    CONNECT_CONSUMER_SSL_ENDPOINT_IDENTIFICATION_ALGORITHM:
    CONNECT_CONSUMER_SSL_TRUSTSTORE_LOCATION: /usr/lib/jvm/jre/lib/security/cacerts
    CONNECT_REST_PORT: 8083
    CONNECT_GROUP_ID: connect-cluster
    CONNECT_KEY_CONVERTER: org.apache.kafka.connect.storage.StringConverter
    CONNECT_VALUE_CONVERTER: org.apache.kafka.connect.json.JsonConverter
    CONNECT_KEY_CONVERTER_SCHEMAS_ENABLE: false
    CONNECT_VALUE_CONVERTER_SCHEMAS_ENABLE: false
    CONNECT_OFFSET_STORAGE_TOPIC: _connect-offsets
    CONNECT_OFFSET_STORAGE_REPLICATION_FACTOR: 2
    CONNECT_CONFIG_STORAGE_TOPIC: _connect-configs
    CONNECT_CONFIG_STORAGE_REPLICATION_FACTOR: 2
    CONNECT_STATUS_STORAGE_TOPIC: _connect-status
    CONNECT_STATUS_STORAGE_REPLICATION_FACTOR: 2
    CONNECT_CONSUMER_AUTO_OFFSET_RESET: latest
    CONNECT_PLUGIN_PATH: /usr/local/share/kafka/plugins
    CONNECT_CONSUMER_MAX_POLL_RECORDS: 1000
    CONNECT_REST_ADVERTISED_HOST_NAME: "kafka-connect-worker-1"
  deploy:
    replicas: 1
    placement:
      constraints:
        - node.role==worker
    update_config:
      parallelism: 1
      order: start-first
  resources:
    limits:
      cpus: '1.5'
      memory: 2G
    reservations:
      cpus: '1'
      memory: 1G
  volumes:
    - ./connector-plugins:/usr/local/share/kafka/plugins

Note: <custom_kafka_connect_image> is our custom docker image having confluentinc/cp-kafka-connect:7.4.0 as the base one. The difference from the base image is that it also enables jmx prometheus metrics exposing, adds custom logging configuration, has some changes in the jvm arguments and exports some secrets as env vars in the entrypoint.sh script.

0

There are 0 best solutions below