Cassandra order issues

33 Views Asked by At

I encounter a problem sorting data on Cassandra using CLUSTERING ORDER BY. The returned data is not sorted as expected with the following conf:

Any idea why the result is not sorted as expected ?

CREATE KEYSPACE IF NOT EXISTS foo WITH REPLICATION = { 'class' : 'SimpleStrategy', 'replication_factor' : '1' };

CREATE TABLE IF NOT EXISTS foo.keyword (
keyword_id int,
keyword_text text,
PRIMARY KEY (keyword_text, keyword_id)
)WITH CLUSTERING ORDER BY(keyword_id DESC);

I add 5 records

INSERT INTO foo.keyword (keyword_id, keyword_text) VALUES (1,'bar1');
INSERT INTO foo.keyword (keyword_id, keyword_text) VALUES (2,'bar2');
INSERT INTO foo.keyword (keyword_id, keyword_text) VALUES (3,'bar3');
INSERT INTO foo.keyword (keyword_id, keyword_text) VALUES (4,'bar4');
INSERT INTO foo.keyword (keyword_id, keyword_text) VALUES (5,'bar5');

And SELECT :

SELECT keyword_id FROM foo.keyword;

Obtained result :

2
4
3
1
5

Expected :

5
4
3
2
1
1

There are 1 best solutions below

2
Aaron On BEST ANSWER

Any idea why the result is not sorted as expected ?

Yes. Cassandra supports ordering within a partition. When data is SELECTed without a WHERE clause, the result set is returned in order of the hashed token values of the partition key. In this case, the partition key is keyword_text.

Let's try altering that SELECT to execute the token function on keyword_text:

SELECT keyword_id, token(keyword_text) FROM foo.keyword;

 keyword_id | system.token(keyword_text)
------------+----------------------------
          2 |       -5699002630445609302
          4 |       -5246443235113384067
          3 |        1880591418861795474
          1 |        7350118602498431585
          5 |        7355007313553950035

(5 rows)

As we can see, the order is indeed indicated by the hashed token value of keyword_text. So, that's why the results are returned in this particular order.

Ok, so how do we fix it? Well, we'll need to choose a different partition key. I'm not sure about the underlying use case here, so we'll just use today's date to "bucket" the data into a partition named day_bucket.

CREATE TABLE keyword (
  day_bucket int,
  keyword_id int,
  keyword_text TEXT,
  PRIMARY KEY (day_bucket, keyword_id)
) WITH CLUSTerING ORDER BY (keyword_id DESC);

Next, we'll INSERT some data:

INSERT INTO keyword (day_bucket, keyword_id, keyword_text) VALUES (20240502, 1,'bar1');
INSERT INTO keyword (day_bucket, keyword_id, keyword_text) VALUES (20240502, 2,'bar2');
INSERT INTO keyword (day_bucket, keyword_id, keyword_text) VALUES (20240502, 3,'bar3');
INSERT INTO keyword (day_bucket, keyword_id, keyword_text) VALUES (20240502, 4,'bar4');
INSERT INTO keyword (day_bucket, keyword_id, keyword_text) VALUES (20240502, 5,'bar5');

And then rerun our query using today's date in the WHERE clasue:

SELECT keyword_id FROM keyword WHERE day_bucket = 20240502;

 keyword_id
------------
          5
          4
          3
          2
          1

(5 rows)

Note that we shouldn't put all data into a single partition. That's not good for data distribution and will lead to unbound partition growth. The idea, is to find a partition key that works well with the use case.