Multi-Region Kafka Deployments using a Single Endpoint

1.2k Views Asked by At

I am trying to setup multi-region kafka clusters that can easily scale by adding additional brokers and additional clusters. To avoid the producers having to worry about addition of new clusters, its there a way to expose Kafka using a single (or a few fixed) endpoints so the end users don't have to concern themselves with changes being made in the background.

Currently the setup relies on the AWS MSK provisioned service and I am trying to follow the setup described here https://aws.amazon.com/blogs/big-data/how-goldman-sachs-builds-cross-account-connectivity-to-their-amazon-msk-clusters-with-aws-privatelink/. Rather than exposing to another account in AWS, would it be possible to expose to the internet using a public domain with multiple clusters sitting behind a URL. I am thinking if it would be possible to control traffic to a cluster using Route 53 where I can load balance as required (e.g. 100/0 or 50/50). So far I have setup a NLB with a target group (with health checks enabled) but have not been able to send any events to Kafka thus far.

Please help me understand:

  1. If this setup is even possible
  2. What are the best alternatives if not (I do not want to use mirror maker to replicate data across 2 clusters as it does not solve the issue of abstracting the client from the underlying clusters).

The fallback for me is to add brokers from every cluster in the application and inform teams managing those applications when additional clusters are added.

1

There are 1 best solutions below

0
On

Yes, Route53 can be used for bootstrapping a client such that they do not need to provide individual broker addresses, but clients must connect to those individual brokers via their advertised.listeners. You do not want to advertise the same NLB / R53 address on each broker, as this will lead to recursive DNS requests.

In MSK, Confluent Cloud, or similarly hosted solutions, then you are already provided such a single bootstrap address.

You should not create a "stretch cluster" across clouds or cloud-regions further than availability zones due to high network latency, which the default timeouts, particularly if you use Zookeeper, do not handle well. Not even sure MSK lets you configure a cluster this way...