Why might one want to use a clustered index in a cassandra table?
For example; in a table like this:
CREATE TABLE blah (
key text,
a text,
b timestamp,
c double,
PRIMARY KEY ((key), a, b, c)
)
The clustered part is the a, b, c part of the PRIMARY KEY.
What are the benefits? What considerations are there?
Clustering keys do three main things.
1) They affect the available query pattern of your table.
2) They determine the on-disk sort order of your table.
3) They determine the uniqueness of your primary key.
Let's say that I run an ordering system and want to store product data on my website. Additionally I have several distribution centers, as well as customer contracted pricing. So when a certain customer is on my site, they can only access products that are:
Available in a distribution center (DC) in their geographic area.
Defined in their contract (so they may not necessarily have access to all products in a DC).
To keep track of those products, I'll create a table that looks like this:
For this example, if I want to see product 123, in DC 1138, for customer B-26354, I can use this query:
Maybe I want to see products available in DC 1138 for customer B-26354:
And maybe I just want to see all products in all DCs for customer B-26354:
As you can see, the clustering keys of
dcidandproductidallow me to run high-performing queries on my partition key (customerid) that are as focused as I may need.The drawback? If I want to query all products for a single DC, regardless of customer, I cannot. I'll need to build a different query table to support that. Even if I want to query just one product, I can't unless I also provide a
customeridanddcid.What if I want my data ordered a certain way? For this example, I'll take a cue from Patrick McFadin's article on Getting Started With Time Series Data Modeling, and build a table to keep track of the latest temperatures for weather stations.
By clustering on
eventtime, and specifying aDESCending ORDER BY, I can query the recorded temperatures for a particular station like this:When those values are returned, they will be in
DESCending order byeventtime.Of course, the one question that everyone (with a RDBMS background...so yes, everyone) wants to know, is how to query all results ordered by
eventtime? And again, you cannot. Of course, you can query for all rows by omitting the WHERE clause, but that won't return your data sorted in any meaningful order. It's important to remember that Cassandra can only enforce clustering order within a partition key. If you don't specify one, your data will not be ordered (at least, not in the way that you want it to be).Let me know if you have any additional questions, and I'll be happy to explain.