Is RethinkDB a good fit for a generic Real-time aggregation platform?

634 Views Asked by At

I need your help to verify if RethinkDB fits my use case.

Use case

My team is building a generic Real-time aggregation platform which needs to:

  • join data from a lot of Kafka topics
  • Joins need to be done on raw data
  • Topics have the same key
  • Data in topics is sometimes a “snapshot” (updatable) and sometimes en “event” (non-updatable)
  • The destination of the joined data will be some analytical OLAP DB. Clickhouse, Druid, etc. Depending on the case. These systems work with “deltas” (SCDs). Because of “snapshots”, I need stateful processing.
  • Updates for snapshots can come up to 7 days later
  • Topics receive around 20k msg/s with peaks up to 200k msg/s
  • Data in topics is json from 100 Bytes to 5kB
  • Data in topics can have duplicates
  • Duplicates are deduplicated with “version” json field which is part of every topic. Data should be processed only if new_version > old_version. Or if old_version didn't exist.

I already have a POC with Cassandra with five stages:

  1. Cassandra Inserter - consumes from.all Kafka topics. Doing insert only for all topics in the same Cassandra table. Sharding is done on column which has the key as all the Kafka topics. So all the messages with the same key end-up in the same shard.
  2. For every Cassandra insert an InsertEvent is produced to Kafka
  3. Delta calculator - consumes InsertEvents and queries Cassandra by the sharding key. Gets all raw data and then deduplicates and creates deltas. The state is saved in another Cassandra cluster. By saving all the processed “versions”. Next time a new InsertEvent comes, we use the saved state “version” to get only two events: previous and current so we can create a DeltaEvent
  4. DeltaEvent is produced to Kafka
  5. ClickHouse / Druid ingest the data

So it's basically a 50/50 insert/read workload without updates to Cassandra.

With 14 Cassandra data nodes and 8 state nodes nodes it works OK up to 20k InsertEvent/s. With 25k InsertEvent/s the system begins to lag. Nodes have 16GB Ram and disks are network storage backed by SSD (not ideal, I know, but can't change it now). Network 10 Gbit.

RethinkDB idea

I would like to do a new POC to try RethinkDB and use changefeeds to create deltas and to deduplicate. For this I would use a single table. Primary key / sharding key would be the Kafka key and all Kafka data from all topics with the same key would be joined/upserted in a single document.

The workload would be probably 10/90 insert/update. I would use squash: true, to avoid excessive reads and reduce the amount of DeltaEvents.

  1. Do you think this is a good use case for RethinkDB?
  2. Will it scale up to 200k msg/s which would be 20k inserts/s, 180k updates/s and around 150 k/reads via changefeeds?
  3. I will need to delete data older than 7 days, how it will affect the insert/update/query workload?
  4. do you have a proposal for a system which would be a better fit for this use case?

Thanks a lot, Davor

PS: if you prefer reading a document, here it is: RethinkDB use case question.

1

There are 1 best solutions below

0
On

IMHO, RehinkDB is good fit in your use case.

From RethinkDB docs

...RethinkDB scales to perform 1.3 million individual reads per second. ...RethinkDB performs well above 100 thousand operations per second in a mixed 50:50 read/write workload - while at the full level of durability and data integrity guarantees. ...performed all benchmarks across a range of cluster sizes, scaling up from one to 16 nodes.

Folks at RethinkDB have tested similar scenario using workloads from the YCSB benchmark suite and reported their results.

We found that in a mixed read/write workload, RethinkDB with two servers was able to perform nearly 16K queries per second (QPS) and scaled to almost 120K QPS while in a 16-node cluster. Under a read only workload and synchronous read settings, RethinkDB was able to scale from about 150K QPS on a single node up to over 550K QPS on 16 nodes. Under the same workload, in an asynchronous “outdated read” setting, RethinkDB went from 150K QPS on one server to 1.3M in a 16-node cluster.

Selecting workloads and hardware

...Out of the YCSB workload options, we chose to run workload A which comprises 50% reads and 50% update operations, and workload C which performs strictly read operations. All documents stored by the YCSB tests contain 10 fields with randomized 100 byte strings as values, with each document totaling about 1 KB in size.

Given the ease of scaling RethinkDB clusters across multiple instances, we deemed it necessary to observe performance when moving from a single RethinkDB instance to a larger cluster. We tested all of our workloads on a single instance of RethinkDB up to a 16-node cluster in varying increments of cluster size.

Additionally, I suggest reading through limitations on RethinkDB. I've copied some here.

  • There is a hard limit of 64 shards.
  • While there is no hard limit on the size of a single document, there is a recommended limit of 16MB for memory performance reasons.
  • The maximum size of a JSON query is 64M.
  • Primary keys are limited to 127 characters.
  • Secondary indexes do not store objects or null values.
  • Primary key strings may not include the null codepoint (U+0000).
  • By default, arrays on the RethinkDB server have a size limit of 100,000 elements. This can be changed on a per-query basis with the arrayLimit (or array_limit) option to run.
  • RethinkDB does not support Unicode collations, and does not normalize for identical characters with multiple codepoints (i.e, \u0065\u0301 and \u00e9 both represent the character “é” but RethinkDB treats them, and sorts them as, distinct characters).

Since yours is real-time system, RethinkDB memory requirements and crash recovery are also worth a read.

Furthermore, delete performance benchmark is missing.