I am deploying a kafka connect cluster, consisting of 4 workers using docker swarm. There are some cases at the initial deployment (when no other kafka connect cluster ever existed within the environment) and only then so far, that the workers cannot communicate to each other, and a constant rebalancing takes places.
The logs, that are being produced repeatedly, are the following ones:
[2023-07-19T10:53:34.399Z] INFO [Worker clientId=connect-1, groupId=connect-cluster] Rebalance started
[2023-07-19T10:53:34.400Z] INFO [Worker clientId=connect-1, groupId=connect-cluster] (Re-)joining group
[2023-07-19T10:53:34.400Z] INFO [Worker clientId=connect-1, groupId=connect-cluster] Successfully joined group with generation Generation{generationId=11, memberId='connect-1-a0bd7da2-7235-4fcc-a0f9-83921b3e5a0c', protocol='sessioned'}
[2023-07-19T10:53:34.400Z] INFO [Worker clientId=connect-1, groupId=connect-cluster] Successfully synced group in generation Generation{generationId=11, memberId='connect-1-a0bd7da2-7235-4fcc-a0f9-83921b3e5a0c', protocol='sessioned'}
[2023-07-19T10:53:34.400Z] INFO [Worker clientId=connect-1, groupId=connect-cluster] Joined group at generation 11 with protocol version 2 and got assignment: Assignment{error=0, leader='connect-1-1c92ee2f-e894-475f-b330-ec2215e4611b', leaderUrl='http://10.0.50.95:8083/', offset=819, connectorIds=[], taskIds=[], revokedConnectorIds=[], revokedTaskIds=[], delay=0} with rebalance delay: 0
[2023-07-19T10:53:34.401Z] WARN [Worker clientId=connect-1, groupId=connect-cluster] Catching up to assignment's config offset.
[2023-07-19T10:53:34.401Z] INFO [Worker clientId=connect-1, groupId=connect-cluster] Current config state offset -1 is behind group assignment 819, reading to end of config log
[2023-07-19T10:53:34.402Z] INFO [Worker clientId=connect-1, groupId=connect-cluster] Finished reading to end of log and updated config snapshot, new config log offset: -1
[2023-07-19T10:53:34.402Z] INFO [Worker clientId=connect-1, groupId=connect-cluster] Current config state offset -1 does not match group assignment 819. Forcing rebalance.
I have seen the question here and several others. All of them mention that there is something wrong with connect configs topic (either it does not have the proper configuration or it consists of more that 1 partitions). However, in my case this is not the issue since my connect configs topic has only 1 partition, and if I try to redeploy the cluster with a new group id (without deleting kafka connect topics in prior), it works. I know this shouldn't be a major issue, since it only happens at the initial deployment of it. However, I am trying to find out what is the root cause, since I am afraid that this might also happen in a later restart of the cluster. In that case, I could not create a new cluster with a new group id from scratch, since this may lead in losing the offsets related to my deployed connectors and jeopardise my data integrity.
Update: This is the configuration that we use. This is a part of our docker-compose.yml for a worker instance. The same applies for the rest of the workers that are deployed as separate services.
kafka-connect-worker-1:
networks:
- monitoring
image: <custom_kafka_connect_image>:<version>
entrypoint: /etc/confluent/docker/entrypoint.sh
hostname: "kafka-connect-worker-1"
environment:
CONNECT_BOOTSTRAP_SERVERS: <kafka_brokers_list>
CONNECT_SECURITY_PROTOCOL: SSL
CONNECT_SSL_ENDPOINT_IDENTIFICATION_ALGORITHM:
CONNECT_SSL_TRUSTSTORE_LOCATION: /usr/lib/jvm/jre/lib/security/cacerts
CONNECT_PRODUCER_SECURITY_PROTOCOL: SSL
CONNECT_PRODUCER_SSL_ENDPOINT_IDENTIFICATION_ALGORITHM:
CONNECT_PRODUCER_SSL_TRUSTSTORE_LOCATION: /usr/lib/jvm/jre/lib/security/cacerts
CONNECT_CONSUMER_SECURITY_PROTOCOL: SSL
CONNECT_CONSUMER_SSL_ENDPOINT_IDENTIFICATION_ALGORITHM:
CONNECT_CONSUMER_SSL_TRUSTSTORE_LOCATION: /usr/lib/jvm/jre/lib/security/cacerts
CONNECT_REST_PORT: 8083
CONNECT_GROUP_ID: connect-cluster
CONNECT_KEY_CONVERTER: org.apache.kafka.connect.storage.StringConverter
CONNECT_VALUE_CONVERTER: org.apache.kafka.connect.json.JsonConverter
CONNECT_KEY_CONVERTER_SCHEMAS_ENABLE: false
CONNECT_VALUE_CONVERTER_SCHEMAS_ENABLE: false
CONNECT_OFFSET_STORAGE_TOPIC: _connect-offsets
CONNECT_OFFSET_STORAGE_REPLICATION_FACTOR: 2
CONNECT_CONFIG_STORAGE_TOPIC: _connect-configs
CONNECT_CONFIG_STORAGE_REPLICATION_FACTOR: 2
CONNECT_STATUS_STORAGE_TOPIC: _connect-status
CONNECT_STATUS_STORAGE_REPLICATION_FACTOR: 2
CONNECT_CONSUMER_AUTO_OFFSET_RESET: latest
CONNECT_PLUGIN_PATH: /usr/local/share/kafka/plugins
CONNECT_CONSUMER_MAX_POLL_RECORDS: 1000
CONNECT_REST_ADVERTISED_HOST_NAME: "kafka-connect-worker-1"
deploy:
replicas: 1
placement:
constraints:
- node.role==worker
update_config:
parallelism: 1
order: start-first
resources:
limits:
cpus: '1.5'
memory: 2G
reservations:
cpus: '1'
memory: 1G
volumes:
- ./connector-plugins:/usr/local/share/kafka/plugins
Note: <custom_kafka_connect_image> is our custom docker image having confluentinc/cp-kafka-connect:7.4.0 as the base one. The difference from the base image is that it also enables jmx prometheus metrics exposing, adds custom logging configuration, has some changes in the jvm arguments and exports some secrets as env vars in the entrypoint.sh script.