How can Cassandra retrieve rows only by using the partition key?

2.6k Views Asked by At

BigTable-like databases store rows sorted by their keys.

Cassandra uses the combination of partition and clustering keys to keep the data distributed and sorted; Yet, you're able to select rows only by having the partition key!

How is Cassandra architectured to work this way?

For example, a way around this in RocksDB is, you can have one default column family by partition key and another with partition and clustering combination keys and iterate over sorted data and retrieve by default column family, which you end up with very high space complexity!

Update: I guess Cassandra tries to store each column in a different key, It starts by partition key and iterated over the different "column names" - perhaps a combination of others the clustering columns. Refer to the picture of underlying storage engine -.

SELECT * From authors WHERE name = 'Tom Clancy' AND year = '1993'. In a table where "name" is partition key and "year" and "title" are the clustering columns.

The visulatiation of Cassandra Storage Layer for the above query.

2

There are 2 best solutions below

6
On

All data in Cassandra are stored by partitions, so when you have condition only on partition key(s), then you retrieve all rows that have that partition keys - they are written one after another. You can find more information in the DSE Architecture guide.

2
On

Cassandra has a partition key and a cluster key as you mentioned.

Here is a very short and clear explanation about the subject with good examples Datastax - The most important thing to know in Cassandra data modeling: The primary key.

The important take aways from this document are:

The first element in our PRIMARY KEY is what we call a partition key. The partition key has a special use in Apache Cassandra beyond showing the uniqueness of the record in the database. The other purpose, and one that very critical in distributed systems, is determining data locality.

Which explains how selecting rows only by having the partition key is part of Cassandra's design.

If the partition key has more than one column in its definition -

All columns listed after the partition key are called clustering columns. This is where we take a huge break from relational databases. Where the partition key is important for data locality, the clustering column specifies the order that the data is arranged inside the partition.

When clustering columns are designed correctly the read queries should take less time comparing to not defining the clustering columns.

Aside of the link above you can find really good explanation and examples in this stakoverflow question. (Difference between partition key, composite key and clustering key in Cassandra?).

Update:

The database stores and locates the data using a nested sort order. The data is stored in a hierarchy that the query must traverse. You have shared key for different values of the clustering columns. Take a look here: Clustering columns