How to generate identities when source of truth is Apache Kafka?

1.8k Views Asked by At

I am building a system consisting of a couple of microservices. They will follow CQRS, ES and DDD aproaches. I want to use Apache Kafka as a "source of truth" - as Jay Kreps call it in many materials on Confluent's and LinkedIn's engineering blogs.

My problem and primary question is:

How to generate identities for new entities when Apache Kafka is a source of truth?

Example:

I have an Order (in an online shop). As Kafka is my source of truth so I want to put data to the Kafka first and then I use data from Kafka to populate some databases e.g. MySQL or Elastic Search. When a user is making a new Order I am adding an "newOrder" event to the log with the details of the Order (which articles and quantities have been ordered, customer data, shipping address etc). I want the Order to have an ID so that I can use it when populate data in MySQL and Elastic Search.

Do you know any techniques, best practicies how can I assign an ID (an identity) to the Order?

This would be easy when using e.g. MySQL as source of truth. In such case I would have some "id" column where MySQL would assign an ID integer.

Edit

I know the concept of GUID but I am looking for some other concepts.

2

There are 2 best solutions below

0
On

There is one 'easy' solution, but as you will see, it gets more complicated in practice.

Each message in kafka is assigned unique offset id (64 bit long) by broker. It is kind of sql-like sequence, monotonically increasing with messages, which is always sent to the client together with actual payload (key/message). It is very core of kafka protocol (clients keep polling data by sending last seen offset to brokers), so it is not going to disappear with new versions.

As long as you have single partition which never fails, it is perfect solution for your problem - nicely ordered artificial keys, which can be put into single db column, which will get resent exactly as expected in case you replay kafka stream (and you can then reconcile it against your database, perform upserts, or just fail on pk violation). You probably don't really want to fail on pk duplication, as in case of your application crash kafka will resend you part of already seen messages, so some kind of upsert/reconcilation is better. In any case, it should be working without issues.

Things get more complicated with more than one partition (which is quite common with kafka). Offsets are unique only in context of single partition and there is no relation of numbers between partitions - so id 1000 partition 0 might be a lot 'later' than id 5000 partition 1 ('later' is in quotes, because when you think about parititions in proper way, you should not really consider things between them as being ordered in time). This means that:

  • you need to enrich your primary key with partition id, which is not looking that pretty anymore

  • you lose nice visual side effect of all the orders being properly ordered in time by primary key

This still works ok as long as your kafka cluster has not failed catastrophically. In case you ever need to do full clean restart of your kafka/zookeeper environment, all offsets will get reset to 0. I don't know any way to have them starting from higher number (there is plenty of ways of changing consumer offsets, but I haven't found any to bump producer/broker offsets). At that point your entire logic is broken and there is no easy way of ever recovering from that state (except maybe changing your code and doing a trick like assume partitionid = partitionid+100 or something like that, effectively adding third part of primary key, being 'generation id').

From what I understand, assumption is that kafka is never supposed to fail in such way - when properly configured, there will be replicas, failovers, rolling updates etc etc. But are you going to bet your entire design on never hitting out-of-memory, out-of-disk-space, plain-bug-in-new-version-of-kafka-when-updating-from-obsolete-old-format etc kind of issues?

You might want to speak to people having more experience with kafka - we have hit major issues requiring a clean installation few times in development (always our fault - hitting physical RAM limit and having linux OOM kill random things, out of disk space or messing up docker startup/mounts). Possibly, with more effort, it would be possible to recover old state in every case, we just took cheap route out by resetting everything (it was dev and we are not dependent on offsets externally).

2
On

@miglanc I think you should use an independent ID mechanism (as simple as GUID). From a system perspective it shouldn't really matter how the ID "looks" like. Make sure that whatever you choose is cross-paltform (it will work with MySQL, MSSQL, Elastic Search).

If you plan to use the id in a restful webservice and you want it to look more "natural", then you can also create your own ID system. If you use 6 characters (alpha-numeric), you have millions of possibilities. Just create a service that does this for you.

IdentityService.NewId() => "X4ER4T". That way you can use it in your restful apis and feels "natural". GET api/order/X4ER4T

Also, you can create your own validation rules when generating them. You can generate them randomly or sequentially. You can store a 'batch' of ids in memory that haven't been used and speed up the process (eliminate roundtrips to the database to check for the latest ID available).

I did that for an insurance company (similar scenario than yours) and they were quite happy.