I am running confluent platform (version 6.1.1). I deploy the following components: 3 Brokers, 3 ZK, Schema Registry, 3 Kafka Connect, KSQL and Confluent Control Center (CCC).
The CCC has entered into a failed state and I have difficulties to bring it back.
To make things cleaner, I have created another EC2 instance (m4.2xlarge) where I configured new CCC with the aim to connect it to the current cluster. New CCC has exactly the same configuration as the failed one, but with a different confluent.controlcenter.id.
I start the CCC and it is running. I can access the CCC UI but it is not working properly: the pages are loading too long, it keeps showing the changing state of the connect cluster (sometimes healthy, sometimes not), it keeps showing the changing state of the brokers (sometimes healthy, sometimes not)
For example it looks like this (see screenshots):
After running certain amount of time, it is automatically restarted and keeps restarting every 5-7 minutes.
When it is started, I see a bunch of new topics created in the Kafka cluster.
After that in the control-center.log I see :
INFO [main] Setting offsets for topic=_confluent-monitoring (io.confluent.controlcenter.KafkaHelper)
INFO [main] found 12 topicPartitions for topic=_confluent-monitoring (io.confluent.controlcenter.KafkaHelper)
INFO [main] Setting offsets for topic=_confluent-metrics (io.confluent.controlcenter.KafkaHelper)
INFO [main] found 12 topicPartitions for topic=_confluent-metrics (io.confluent.controlcenter.KafkaHelper)
INFO [main] action=starting topology=command (io.confluent.controlcenter.ControlCenter)
INFO [main] waiting for streams to be in running state REBALANCING (io.confluent.command.CommandStore)
INFO [main] Streams state RUNNING (io.confluent.command.CommandStore)
INFO [main] action=started topology=command (io.confluent.controlcenter.ControlCenter)
INFO [main] action=starting operation=command-migration (io.confluent.controlcenter.ControlCenter)
INFO [main] action=completed operation=command-migration (io.confluent.controlcenter.ControlCenter)
INFO [main] action=starting topology=monitoring (io.confluent.controlcenter.ControlCenter)
INFO [main] action=started topology=monitoring (io.confluent.controlcenter.ControlCenter)
INFO [main] Starting Health Check (io.confluent.controlcenter.ControlCenter)
INFO [main] Starting Alert Manager (io.confluent.controlcenter.ControlCenter)
INFO [main] Starting Consumer Offsets Fetch (io.confluent.controlcenter.ControlCenter)
INFO [control-center-heartbeat-0] current clusterId=lCRehAk0RqmLR04nhXKHtA (io.confluent.controlcenter.healthcheck.HealthCheck)
INFO [control-center-heartbeat-0] broker id set has changed new={1001=[10.251.xx.xx:9093 (id: 1001 rack: null)], 1002=[10.251.xx.xx:9093 (id: 1002 rack: null)], 1003=[10.251.xx.xx:9093 (id: 1003 rack: null)]} removed={} (io.confluent.controlcenter.healthcheck.HealthCheck)
INFO [control-center-heartbeat-0] new controller=10.251.xx.xx:9093 (id: 1002 rack: null) (io.confluent.controlcenter.healthcheck.HealthCheck)
INFO [main] Initial capacity 128, increased by 64, maximum capacity 2147483647. (io.confluent.rest.ApplicationServer)
INFO [main] Adding listener: http://0.0.0.0:9021 (io.confluent.rest.ApplicationServer)
INFO [main] x509=X509@3a8ead9(ip-44-135-xx-xx.eu-central-1.compute.internal,h=[ip-44-135-xx-xx.eu-central-1.compute.internal],w=[]) for Server@7c8b37a8[provider=null,keyStore=file:///var/kafka-ssl/server.keystore.jks,trustStore=file:///var/kafka-ssl/client.truststore.jks] (org.eclipse.jetty.util.ssl.SslContextFactory)
INFO [main] x509=X509@3831f4c2(caroot,h=[eu-central-1.compute.internal],w=[]) for Server@7c8b37a8[provider=null,keyStore=file:///var/kafka-ssl/server.keystore.jks,trustStore=file:///var/kafka-ssl/client.truststore.jks] (org.eclipse.jetty.util.ssl.SslContextFactory)
INFO [main] jetty-9.4.38.v20210224; built: 2021-02-24T20:25:07.675Z; git: 288f3cc74549e8a913bf363250b0744f2695b8e6; jvm 11.0.13+8-LTS (org.eclipse.jetty.server.Server)
INFO [main] DefaultSessionIdManager workerName=node0 (org.eclipse.jetty.server.session)
INFO [main] No SessionScavenger set, using defaults (org.eclipse.jetty.server.session)
INFO [main] node0 Scavenging every 660000ms (org.eclipse.jetty.server.session)
INFO [main] Started o.e.j.s.ServletContextHandler@1ef5cde4{/,[jar:file:/usr/share/java/acl/acl-6.1.1.jar!/io/confluent/controlcenter/rest/static],AVAILABLE} (org.eclipse.jetty.server.handler.ContextHandler)
INFO [main] Started o.e.j.s.ServletContextHandler@5401c6a8{/ws,null,AVAILABLE} (org.eclipse.jetty.server.handler.ContextHandler)
INFO [main] Started NetworkTrafficServerConnector@5d6b5d3d{HTTP/1.1, (http/1.1)}{0.0.0.0:9021} (org.eclipse.jetty.server.AbstractConnector)
INFO [main] Started @36578ms (org.eclipse.jetty.server.Server)
INFO [_eim_c3_non_prod-4-b6c9d6bd-717d-4559-bcfe-a4c9be647b7f-StreamThread-1] name=monitoring-input-topic-progress-.count type=monitoring cluster= value=0.0 (io.confluent.controlcenter.util.StreamProgressReporter)
INFO [_eim_c3_non_prod-4-b6c9d6bd-717d-4559-bcfe-a4c9be647b7f-StreamThread-1] name=monitoring-input-topic-progress-.rate type=monitoring cluster= value=0.0 (io.confluent.controlcenter.util.StreamProgressReporter)
INFO [_eim_c3_non_prod-4-b6c9d6bd-717d-4559-bcfe-a4c9be647b7f-StreamThread-1] name=monitoring-input-topic-progress-.timestamp type=monitoring cluster= value=NaN (io.confluent.controlcenter.util.StreamProgressReporter)
INFO [_eim_c3_non_prod-4-b6c9d6bd-717d-4559-bcfe-a4c9be647b7f-StreamThread-1] name=monitoring-input-topic-progress-.min type=monitoring cluster= value=1.7976931348623157E308 (io.confluent.controlcenter.util.StreamProgressReporter)
INFO [_eim_c3_non_prod-4-b6c9d6bd-717d-4559-bcfe-a4c9be647b7f-StreamThread-1] name=metrics-input-topic-progress-lCRehAk0RqmLR04nhXKHtA.count type=metrics cluster=lCRehAk0RqmLR04nhXKHtA value=0.0 (io.confluent.controlcenter.util.StreamProgressReporter)
INFO [_eim_c3_non_prod-4-b6c9d6bd-717d-4559-bcfe-a4c9be647b7f-StreamThread-1] name=metrics-input-topic-progress-lCRehAk0RqmLR04nhXKHtA.rate type=metrics cluster=lCRehAk0RqmLR04nhXKHtA value=0.0 (io.confluent.controlcenter.util.StreamProgressReporter)
INFO [_eim_c3_non_prod-4-b6c9d6bd-717d-4559-bcfe-a4c9be647b7f-StreamThread-1] name=metrics-input-topic-progress-lCRehAk0RqmLR04nhXKHtA.timestamp type=metrics cluster=lCRehAk0RqmLR04nhXKHtA value=NaN (io.confluent.controlcenter.util.StreamProgressReporter)
INFO [_eim_c3_non_prod-4-b6c9d6bd-717d-4559-bcfe-a4c9be647b7f-StreamThread-1] name=metrics-input-topic-progress-lCRehAk0RqmLR04nhXKHtA.min type=metrics cluster=lCRehAk0RqmLR04nhXKHtA value=1.7976931348623157E308 (io.confluent.controlcenter.util.StreamProgressReporter)
WARN [control-center-heartbeat-0] misconfigured topic=_confluent-command config=segment.bytes value=1073741824 expected=134217728 (io.confluent.controlcenter.healthcheck.HealthCheck)
WARN [control-center-heartbeat-0] misconfigured topic=_confluent-command config=delete.retention.ms value=86400000 expected=259200000 (io.confluent.controlcenter.healthcheck.HealthCheck)
INFO [control-center-heartbeat-0] misconfigured topic=_confluent-metrics config=min.insync.replicas value=1 expected=2 (io.confluent.controlcenter.healthcheck.HealthCheck)
WARN [control-center-heartbeat-1] Unable to fetch consumer offsets for cluster id lCRehAk0RqmLR04nhXKHtA (io.confluent.controlcenter.data.ConsumerOffsetsFetcher)
java.util.concurrent.TimeoutException
at org.apache.kafka.common.internals.KafkaFutureImpl$SingleWaiter.await(KafkaFutureImpl.java:108)
at org.apache.kafka.common.internals.KafkaFutureImpl.get(KafkaFutureImpl.java:272)
at io.confluent.controlcenter.data.ConsumerOffsetsDao.getAllConsumerGroupDescriptions(ConsumerOffsetsDao.java:220)
at io.confluent.controlcenter.data.ConsumerOffsetsDao.getAllConsumerGroupOffsets(ConsumerOffsetsDao.java:58)
at io.confluent.controlcenter.data.ConsumerOffsetsFetcher.run(ConsumerOffsetsFetcher.java:73)
at java.base/java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:515)
at java.base/java.util.concurrent.FutureTask.runAndReset(FutureTask.java:305)
at java.base/java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.run(ScheduledThreadPoolExecutor.java:305)
at java.base/java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1128)
at java.base/java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:628)
at java.base/java.lang.Thread.run(Thread.java:829)
WARN [kafka-admin-client-thread | adminclient-3] failed fetching description for consumerGroup=_confluent-ksql-eim_ksql_non_prodquery_CSAS_SDL_STMTS_GG_347 (io.confluent.controlcenter.data.ConsumerOffsetsDao)
org.apache.kafka.common.errors.TimeoutException: Call(callName=describeConsumerGroups, deadlineMs=1654853629184, tries=1, nextAllowedTryMs=1654853629324) timed out at 1654853629224 after 1 attempt(s)
Caused by: org.apache.kafka.common.errors.DisconnectException: Cancelled describeConsumerGroups request with correlation id 168 due to node 1001 being disconnected
WARN [kafka-admin-client-thread | adminclient-3] failed fetching description for consumerGroup=connect-mongo-dci-grid-partner-test11 (io.confluent.controlcenter.data.ConsumerOffsetsDao)
org.apache.kafka.common.errors.TimeoutException: Call(callName=describeConsumerGroups, deadlineMs=1654853629184, tries=1, nextAllowedTryMs=1654853629324) timed out at 1654853629224 after 1 attempt(s)
Caused by: org.apache.kafka.common.errors.TimeoutException: Timed out waiting for a node assignment. Call: describeConsumerGroups
WARN [kafka-admin-client-thread | adminclient-3] failed fetching description for consumerGroup=_confluent-ksql-eim_ksql_non_prodquery_CSAS_SDL_STMTS_UPWARD_GG_355 (io.confluent.controlcenter.data.ConsumerOffsetsDao)
org.apache.kafka.common.errors.TimeoutException: Call(callName=describeConsumerGroups, deadlineMs=1654853629184, tries=1, nextAllowedTryMs=1654853629324) timed out at 1654853629224 after 1 attempt(s)
Caused by: org.apache.kafka.common.errors.TimeoutException: Timed out waiting to send the call. Call: describeConsumerGroups
WARN [kafka-admin-client-thread | adminclient-3] failed fetching description for consumerGroup=_eim_c3_non_prod-4 (io.confluent.controlcenter.data.ConsumerOffsetsDao)
org.apache.kafka.common.errors.TimeoutException: Call(callName=describeConsumerGroups, deadlineMs=1654853629184, tries=1, nextAllowedTryMs=1654853629324) timed out at 1654853629224 after 1 attempt(s)
Caused by: org.apache.kafka.common.errors.TimeoutException: Timed out waiting to send the call. Call: describeConsumerGroups
...
and so on...
WARN [control-center-heartbeat-1] Unable to fetch consumer offsets for cluster id lCRehAk0RqmLR04nhXKHtA (io.confluent.controlcenter.data.ConsumerOffsetsFetcher)
java.util.concurrent.TimeoutException
at org.apache.kafka.common.internals.KafkaFutureImpl$SingleWaiter.await(KafkaFutureImpl.java:108)
at org.apache.kafka.common.internals.KafkaFutureImpl.get(KafkaFutureImpl.java:272)
at io.confluent.controlcenter.data.ConsumerOffsetsDao.getAllConsumerGroupDescriptions(ConsumerOffsetsDao.java:220)
at io.confluent.controlcenter.data.ConsumerOffsetsDao.getAllConsumerGroupOffsets(ConsumerOffsetsDao.java:58)
at io.confluent.controlcenter.data.ConsumerOffsetsFetcher.run(ConsumerOffsetsFetcher.java:73)
at java.base/java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:515)
at java.base/java.util.concurrent.FutureTask.runAndReset(FutureTask.java:305)
at java.base/java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.run(ScheduledThreadPoolExecutor.java:305)
at java.base/java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1128)
at java.base/java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:628)
at java.base/java.lang.Thread.run(Thread.java:829)
and so on...
In the control-center-kafka.log I see:
INFO [control-center-heartbeat-1] Kafka version: 6.1.1-ce (org.apache.kafka.common.utils.AppInfoParser)
INFO [control-center-heartbeat-1] Kafka commitId: 73deb3aeb1f8647c (org.apache.kafka.common.utils.AppInfoParser)
INFO [control-center-heartbeat-1] Kafka startTimeMs: 1654853610852 (org.apache.kafka.common.utils.AppInfoParser)
INFO [kafka-coordinator-heartbeat-thread | _eim_c3_non_prod-4] [Consumer clientId=_eim_c3_non_prod-4-b6c9d6bd-717d-4559-bcfe-a4c9be647b7f-StreamThread-5-consumer, groupId=_eim_c3_non_prod-4] Resetting offset for partition _eim_c3_non_prod-4-monitoring-message-rekey-store-7 to position FetchPosition{offset=0, offsetEpoch=Optional.empty, currentLeader=LeaderAndEpoch{leader=Optional[10.251.6.2:9093 (id: 1002 rack: null)], epoch=0}}. (org.apache.kafka.clients.consumer.internals.SubscriptionState)
INFO [kafka-coordinator-heartbeat-thread | _eim_c3_non_prod-4] [Consumer clientId=_eim_c3_non_prod-4-b6c9d6bd-717d-4559-bcfe-a4c9be647b7f-StreamThread-5-consumer, groupId=_eim_c3_non_prod-4] Resetting offset for partition _eim_c3_non_prod-4-monitoring-trigger-event-rekey-7 to position FetchPosition{offset=0, offsetEpoch=Optional.empty, currentLeader=LeaderAndEpoch{leader=Optional[10.251.6.2:9093 (id: 1002 rack: null)], epoch=0}}. (org.apache.kafka.clients.consumer.internals.SubscriptionState)
INFO [kafka-coordinator-heartbeat-thread | _eim_c3_non_prod-4] [Consumer clientId=_eim_c3_non_prod-4-b6c9d6bd-717d-4559-bcfe-a4c9be647b7f-StreamThread-5-consumer, groupId=_eim_c3_non_prod-4] Resetting offset for partition _eim_c3_non_prod-4-MonitoringStream-ONE_MINUTE-repartition-7 to position FetchPosition{offset=0, offsetEpoch=Optional.empty, currentLeader=LeaderAndEpoch{leader=Optional[10.251.6.2:9093 (id: 1002 rack: null)], epoch=0}}. (org.apache.kafka.clients.consumer.internals.SubscriptionState)
INFO [kafka-coordinator-heartbeat-thread | _eim_c3_non_prod-4] [Consumer clientId=_eim_c3_non_prod-4-b6c9d6bd-717d-4559-bcfe-a4c9be647b7f-StreamThread-5-consumer, groupId=_eim_c3_non_prod-4] Resetting offset for partition _eim_c3_non_prod-4-aggregatedTopicPartitionTableWindows-ONE_MINUTE-repartition-7 to position FetchPosition{offset=0, offsetEpoch=Optional.empty, currentLeader=LeaderAndEpoch{leader=Optional[10.251.6.1:9093 (id: 1001 rack: null)], epoch=0}}. (org.apache.kafka.clients.consumer.internals.SubscriptionState)
and so on ...
INFO [kafka-coordinator-heartbeat-thread | _eim_c3_non_prod-4] [Consumer clientId=_eim_c3_non_prod-4-b6c9d6bd-717d-4559-bcfe-a4c9be647b7f-StreamThread-10-consumer, groupId=_eim_c3_non_prod-4] Error sending fetch request (sessionId=INVALID, epoch=INITIAL) to node 1003: (org.apache.kafka.clients.FetchSessionHandler)
org.apache.kafka.common.errors.DisconnectException
INFO [_eim_c3_non_prod-4-b6c9d6bd-717d-4559-bcfe-a4c9be647b7f-StreamThread-3] [Consumer clientId=_eim_c3_non_prod-4-b6c9d6bd-717d-4559-bcfe-a4c9be647b7f-StreamThread-3-consumer, groupId=_eim_c3_non_prod-4] Error sending fetch request (sessionId=INVALID, epoch=INITIAL) to node 1002: (org.apache.kafka.clients.FetchSessionHandler)
org.apache.kafka.common.errors.DisconnectException
INFO [kafka-coordinator-heartbeat-thread | _eim_c3_non_prod-4] [Consumer clientId=_eim_c3_non_prod-4-b6c9d6bd-717d-4559-bcfe-a4c9be647b7f-StreamThread-3-consumer, groupId=_eim_c3_non_prod-4] Error sending fetch request (sessionId=INVALID, epoch=INITIAL) to node 1001: (org.apache.kafka.clients.FetchSessionHandler)
org.apache.kafka.common.errors.DisconnectException
INFO [_eim_c3_non_prod-4-b6c9d6bd-717d-4559-bcfe-a4c9be647b7f-StreamThread-10] [Consumer clientId=_eim_c3_non_prod-4-b6c9d6bd-717d-4559-bcfe-a4c9be647b7f-StreamThread-10-consumer, groupId=_eim_c3_non_prod-4] Error sending fetch request (sessionId=INVALID, epoch=INITIAL) to node 1002: (org.apache.kafka.clients.FetchSessionHandler)
org.apache.kafka.common.errors.DisconnectException
INFO [kafka-coordinator-heartbeat-thread | _eim_c3_non_prod-4] [Consumer clientId=_eim_c3_non_prod-4-b6c9d6bd-717d-4559-bcfe-a4c9be647b7f-StreamThread-5-consumer, groupId=_eim_c3_non_prod-4] Error sending fetch request (sessionId=1478925475, epoch=1) to node 1003: (org.apache.kafka.clients.FetchSessionHandler)
org.apache.kafka.common.errors.DisconnectException
INFO [kafka-coordinator-heartbeat-thread | _eim_c3_non_prod-4] [Consumer clientId=_eim_c3_non_prod-4-b6c9d6bd-717d-4559-bcfe-a4c9be647b7f-StreamThread-6-consumer, groupId=_eim_c3_non_prod-4] Error sending fetch request (sessionId=1947312909, epoch=1) to node 1002: (org.apache.kafka.clients.FetchSessionHandler)
org.apache.kafka.common.errors.DisconnectException
and so on ...
Any ideas what can be wrong here?

