I understand that topics are immutable.
Let's say your topic is in a bad state. Sections of data that are out of order, duplicate records, etc. What is the process of cleaning up that data? How does this process impact downstream consumers?
I see a few different ways to handle this:
The consumers don't listen to that first topic, but rather listen to a cleaned up derivative.
Version the topic and rewrite the data with the de-dupe logic applied. Then have the consumers change which topic they listen to. But then I run into the situation where records are either buffered or interleaved with older records while new records continue to come in.
What are some other ways this situation is handled?
Sounds like the data flow architecture is not idempotent. Data is never out of order or duplicated by Kafka, there would be a problem with the producer. Kafka automatically removed data from topics post retention period, so just wait till that period for cleanup if you are worried about existing data only. Once data is deleted by Kafka, any consumer lagging in reads (i.e. wants to read from deleted offset) will have to set
auto.offset.reset
fromearliest
orlatest
otherwise consumer will issueOffsetOutOfRange
error.Meanwhile, if you can skip records and start polling for specific offset/partition by using
consumer.seek(partition, offset)
Solution would depend upon your business logic and incoming data pattern, but you will be better off by solving producer issues rather than handling it in consumer.