Kafka Connect Worker Fails to Update Session Key Until Exactly One Hour Later - How to Configure Retries?

143 Views Asked by At

I'm working with Kafka Connect 3.4.0 and I've encountered an issue that I'm struggling to understand. I have logs that show the Kafka Connect worker failing to read a session key from the config topic during startup. The curious part is that exactly one hour later, the session key gets updated successfully.

Here are the relevant logs:

Sep 11 16:01:23.422 [instance-id-masked] gtp-kafka-connect[2023-09-11 13:01:23,420] INFO [Worker clientId=connect-1, groupId=group-id-masked] Session key updated (org.apache.kafka.connect.runtime.distributed.DistributedHerder:2151)
Sep 11 16:01:23.449 [instance-id-masked] gtp-kafka-connectorg.apache.kafka.connect.runtime.rest.errors.ConnectRestException: This worker is still starting up and has not been able to read a session key from the config topic yet
Sep 11 17:01:23.291 [instance-id-masked] gtp-kafka-connect[2023-09-11 14:01:23,290] INFO [Worker clientId=connect-1, groupId=group-id-masked] Session key updated (org.apache.kafka.connect.runtime.distributed.DistributedHerder:2151)

I'm trying to understand why the session key wasn't updated earlier but only got updated at an exact 1-hour interval.

Questions:

  1. Is there a worker-level configuration for retrying operations like reading a session key?
  2. What could be the reason for this 1-hour delay in updating the session key?
  3. Are there any best practices or recommendations for handling such scenarios?

Any insights or suggestions would be greatly appreciated.

Thank you!

I've tried setting the max.retries and retry.backoff.ms configurations for in connector.json as follows:

{
  "max.retries": 5,
  "retry.backoff.ms": 30000,
  // other configurations
}

I was expecting that these settings would apply to worker-level operations like reading a session key from the config topic. However, it seems like these settings are not affecting the worker's ability to update the session key, which still happens at an exact 1-hour interval.

Is there a worker-level configuration that I'm missing, or do these settings only apply to the tasks within the connector?

0

There are 0 best solutions below