Clickhouse: deduplication before rollup (after reading from Kafka)

Question

Clickhouse: deduplication before rollup (after reading from Kafka)

290 Views Asked by kev At 05 June 2023 at 04:43

A common use case in data processing is deduplication and aggregation/rollups. Clickhouse supports both (ReplacingMergeTree:deduplication and SummingMergeTree:aggregation). We are struggling putting both together: Materialized Views cannot be used to move the data from the deduplicated table to the rollup table because they trigger on insert, which happens before the deduplication (see the note here).

Is there a way to achieve deduplication first and then do a rollup in Clickhouse?

Approaches we have been thinking of:

Doing the deduplication on insert (e.g. a Materialized View which reads from Kafka). The already deduplicated data would be written to a SummingMergeTree table which then does the rollup. The deduplication could be done using standard SQL techniques such as group by, distinct or a window function with row_number and filtering by rownum=1 afterwards. The downside of this approach is that deduplication is only applied within the blocks read from Kafka, but not outside. The deduplication window is not adjustable.
Use a ReplacingMergeTree table letting Clickhouse do the deduplication, but additionally run an external, periodic scheduler to move the data into a SummingMergeTree table. "Moving" would be an INSERT INTO .. SELECT statement using FINAL (I know, shouldn't be used) or some other SQL deduplication as outlined above.

In all the documentation, blog posts and YT videos I have read and seen so far I haven't found a recommended (if possible clickhouse only) to first deduplicate a Kafka stream by id and then performing an aggregation on the data.

Original Q&A

There are 2 best solutions below

**gingerwizard** · Answer 1 · 2023-06-12T10:04:26.410000

If the duplicates occur within a batch, you could try and make your target table for the inserts a ReplacingMergeTree. A mv would then trigger on this table and use FINAL to insert into a summingmergetree. I believe this is your option 1 - your deduplication window is as large as the insert block size.

**Ryadh Dahimene** · Answer 2 · 2023-06-12T10:47:11.803000

Ryadh Dahimene On 12 June 2023 at 10:47

Additionally, for your option two, you'll be able in the future to use Refreshable Materialized Views (not merged yet) for your scheduling. https://github.com/ClickHouse/ClickHouse/issues/33919

Clickhouse: deduplication before rollup (after reading from Kafka)

There are 2 best solutions below

Related Questions in CLICKHOUSE

Related Questions in CLICKHOUSE-KAFKA

Trending Questions

Popular # Hahtags

Popular Questions