Do we need to remove duplicate by ourselves on at least once delivery case?

84 Views Asked by At

Apache Storm and Samza guarantee at least once delivery. It means that there may be some duplicates in the computation process. Do we need to move the duplicates by ourselves(including removing duplicate part in our code)? For example, the word count problem. If word 'boy' appear only once, but there are 2 'boy' due to some failure or latency. Storm replayed 'boy'. So is the result of 'boy's count two? Or Storm remove the duplicate for us, the result is one?

1

There are 1 best solutions below

0
zenbeni On

Storm won't remove duplicates, you have to check if you already processed the root message at the start of your stream (i.e. your spout) so you don't send it again in your topology and then mess your counters.

Idempotent Consumer pattern is what you should look at. Storing hashes of last events fetched so you can ignore them if they are accidentally sent once more is a way to achieve that for instance (ConcurrentHashMap in memory can do that or external caches like Redis, don't forget to evict these structures once you are certain you have no risk of getting the event again).