I learned about using Kafka's topics as a changelog to avoid doing synchronous RPC, but I don't understand how we deal with consistency as topics are not persistent (retention policy).
i.e, I run an application, 2 microservices:
- The User Service, is used to update users' data in the system(address, First Name, phone number...).
- The Shipping Service, uses Users' data to create a shipping order and send it to the shipping company's system.
Each service has its own db to persist the Users' data.. To communicate any changes made on a User's data, the confluent's teacher proposed to create a topic and use it as a changelog. User Service inputs the changes, other microservices can consume.
But What if:
- User X changed his address 1 year ago
- the retention policy of the changelog is 6 months
- today we add BillingService to the system.
The BillingService won't know the User X's address, so its view is inconsistent. Should I run a one-time "Call UserService to copy its full DB" when a new service enters the system? Seems ugly solution.
More tricky and challenging:
- We have a changelog with a retention policy of time T
- A consumer service failed more than time T
Therefore, it will potentially miss some changelogs. How do we deal with that? We are never confident how the service knows everything it has to know about the users.
Did some research, but found nothing. I really think I don't have enough vocabulary yet to do good research, as the problem sounds pretty common to everyone. Sorry if it exists a source dedicated to this problem that I did not find!
If the changelog topic is covering entities that are of unbounded lifetime (like your users, hopefully), that strongly suggests that the retention period for that topic should be infinite. Chances are that topic is sufficiently low volume that infinite retention is viable (consider that it can probably be partitioned).
If for some reason that's not viable, you can arrange for producers to at some period shorter than the retention period publish out "this is the state of this entity" for every entity they own to the topic. For entities which don't change very much, this is pretty wasteful and duplicative (but for those a very long to infinite retention period is more viable), for entities which do change a lot, this is a rounding error in terms of volume.
That neatly solves the first case and eventually allows for the second to be solved. For the second, there is basically no solution, which means that you have to choose the retention period for a topic such that you can guarantee that no consumer of this topic will ever be down (or not deployed) for longer than the retention period: this typically means that a retention period shorter than, say, 7 days, should be really heavily scrutinized. Note that if you have a 1 week retention period and a consumer has been down for more than a few days, you can temporarily bump up the retention period to buy you time for the consumer to get fixed, and if there's a consumer which can be down for more than a week without anybody noticing, how important is that consumer, really?