A common use case in data processing is deduplication and aggregation/rollups. Clickhouse supports both (ReplacingMergeTree:deduplication and SummingMergeTree:aggregation). We are struggling putting both together: Materialized Views cannot be used to move the data from the deduplicated table to the rollup table because they trigger on insert, which happens before the deduplication (see the note here).
Is there a way to achieve deduplication first and then do a rollup in Clickhouse?
Approaches we have been thinking of:
Doing the deduplication on insert (e.g. a Materialized View which reads from Kafka). The already deduplicated data would be written to a SummingMergeTree table which then does the rollup. The deduplication could be done using standard SQL techniques such as
group by,distinctor a window function withrow_numberand filtering byrownum=1afterwards. The downside of this approach is that deduplication is only applied within the blocks read from Kafka, but not outside. The deduplication window is not adjustable.Use a ReplacingMergeTree table letting Clickhouse do the deduplication, but additionally run an external, periodic scheduler to move the data into a SummingMergeTree table. "Moving" would be an
INSERT INTO .. SELECTstatement usingFINAL(I know, shouldn't be used) or some other SQL deduplication as outlined above.
In all the documentation, blog posts and YT videos I have read and seen so far I haven't found a recommended (if possible clickhouse only) to first deduplicate a Kafka stream by id and then performing an aggregation on the data.
If the duplicates occur within a batch, you could try and make your target table for the inserts a ReplacingMergeTree. A mv would then trigger on this table and use FINAL to insert into a summingmergetree. I believe this is your option 1 - your deduplication window is as large as the insert block size.