I need your help to verify if RethinkDB fits my use case.
Use case
My team is building a generic Real-time aggregation platform which needs to:
- join data from a lot of Kafka topics
- Joins need to be done on raw data
- Topics have the same key
- Data in topics is sometimes a “snapshot” (updatable) and sometimes en “event” (non-updatable)
- The destination of the joined data will be some analytical OLAP DB. Clickhouse, Druid, etc. Depending on the case. These systems work with “deltas” (SCDs). Because of “snapshots”, I need stateful processing.
- Updates for snapshots can come up to 7 days later
- Topics receive around 20k msg/s with peaks up to 200k msg/s
- Data in topics is json from 100 Bytes to 5kB
- Data in topics can have duplicates
- Duplicates are deduplicated with “version” json field which is part of every topic. Data should be processed only if new_version > old_version. Or if old_version didn't exist.
I already have a POC with Cassandra with five stages:
- Cassandra Inserter - consumes from.all Kafka topics. Doing insert only for all topics in the same Cassandra table. Sharding is done on column which has the key as all the Kafka topics. So all the messages with the same key end-up in the same shard.
- For every Cassandra insert an InsertEvent is produced to Kafka
- Delta calculator - consumes InsertEvents and queries Cassandra by the sharding key. Gets all raw data and then deduplicates and creates deltas. The state is saved in another Cassandra cluster. By saving all the processed “versions”. Next time a new InsertEvent comes, we use the saved state “version” to get only two events: previous and current so we can create a DeltaEvent
- DeltaEvent is produced to Kafka
- ClickHouse / Druid ingest the data
So it's basically a 50/50 insert/read workload without updates to Cassandra.
With 14 Cassandra data nodes and 8 state nodes nodes it works OK up to 20k InsertEvent/s. With 25k InsertEvent/s the system begins to lag. Nodes have 16GB Ram and disks are network storage backed by SSD (not ideal, I know, but can't change it now). Network 10 Gbit.
RethinkDB idea
I would like to do a new POC to try RethinkDB and use changefeeds to create deltas and to deduplicate. For this I would use a single table. Primary key / sharding key would be the Kafka key and all Kafka data from all topics with the same key would be joined/upserted in a single document.
The workload would be probably 10/90 insert/update. I would use squash: true, to avoid excessive reads and reduce the amount of DeltaEvents.
- Do you think this is a good use case for RethinkDB?
- Will it scale up to 200k msg/s which would be 20k inserts/s, 180k updates/s and around 150 k/reads via changefeeds?
- I will need to delete data older than 7 days, how it will affect the insert/update/query workload?
- do you have a proposal for a system which would be a better fit for this use case?
Thanks a lot, Davor
PS: if you prefer reading a document, here it is: RethinkDB use case question.
IMHO, RehinkDB is good fit in your use case.
From RethinkDB docs
Folks at RethinkDB have tested similar scenario using workloads from the YCSB benchmark suite and reported their results.
Selecting workloads and hardware
Additionally, I suggest reading through limitations on RethinkDB. I've copied some here.
Since yours is real-time system, RethinkDB memory requirements and crash recovery are also worth a read.
Furthermore, delete performance benchmark is missing.