Process of cleaning data that is in a bad state

102 Views Asked by At

I understand that topics are immutable.

Let's say your topic is in a bad state. Sections of data that are out of order, duplicate records, etc. What is the process of cleaning up that data? How does this process impact downstream consumers?

I see a few different ways to handle this:

  1. The consumers don't listen to that first topic, but rather listen to a cleaned up derivative.

  2. Version the topic and rewrite the data with the de-dupe logic applied. Then have the consumers change which topic they listen to. But then I run into the situation where records are either buffered or interleaved with older records while new records continue to come in.

What are some other ways this situation is handled?

1

There are 1 best solutions below

1
On

Sounds like the data flow architecture is not idempotent. Data is never out of order or duplicated by Kafka, there would be a problem with the producer. Kafka automatically removed data from topics post retention period, so just wait till that period for cleanup if you are worried about existing data only. Once data is deleted by Kafka, any consumer lagging in reads (i.e. wants to read from deleted offset) will have to set auto.offset.reset from earliest or latest otherwise consumer will issue OffsetOutOfRange error.

Meanwhile, if you can skip records and start polling for specific offset/partition by using consumer.seek(partition, offset)

Solution would depend upon your business logic and incoming data pattern, but you will be better off by solving producer issues rather than handling it in consumer.