Data modelling ( secondary index vs clustering key )

533 Views Asked by At

I am trying to understand if it's going to be a performance issue if I choose OPTION 1: very high unique value column as partition key ( order_id), and create indexes on store_id and status. ( i can query on order_id | store_id | status | both store&status , and also ***update(important) status based on order_id)

Option 2: store_id as partition_key and very high unique value column as clustering key ( order_id) and create secondary index on status ( so that i can filter on status) ( I can query on store_id | store&order_id | store&status | and also **update status based on store&order_id )

I would like to know what will be the performance issues in above scenarios. which one will be a better option. Thank you very much for your help and time.

1

There are 1 best solutions below

0
On

Option 1 is interesting, but you need to be careful with your indices. See your other question for more information there (especially the bit concerning querying multiple secondary indices at the same time). That may be alleviated with tables purpose built for your index lookups (further discussed below).

The advantage of the highly unique partition key is that data will be more distributed around your cluster. The downside here is that when you perform a request with WHERE store_id = 'foo' all nodes in the cluster need to be queried as there is no limit on the partition key.

Option 2 you must be careful with. If your partition key is just store_id, then every order will be placed within this partition. For each order there will be n columns added to the single row for the store representing each attribute on the order. In regards to data location all orders for a given store will be placed on the same Cassandra node.

In both cases why not pursue a lookup table for orders by status? This will remove your need for a secondary index on that field. Especially given it's relatively small cardinality.

CREATE TABLE orders_by_store_id_status (
  store_id VARCHAR,
  status   VARCHAR,
  order_id VARCHAR,
  ... <additional order fields needed to satisfy your query> ...
  PRIMARY KEY ((store_id, status), order_id)
);

This would allow you to query for all orders with a given store_id and status.

SELECT * FROM orders_by_store_id_status WHERE store_id = 'foo' AND status = 'open';

The read is fast as the partition key limits the number of nodes we perform the query against.