When doing a SELECT in Cassandra (Spark) by an SAI field. In what order are the rows returned?

126 Views Asked by At

If I have a table in Astra created like so:

CREATE TABLE rayven.mytable (
a text,
b text,
c timestamp,
PRIMARY KEY (a, c)
) WITH CLUSTERING ORDER BY (c DESC)

I then added the SAI index:

CREATE CUSTOM INDEX b_index ON mytable (b) USING 'StorageAttachedIndex';

When I query using ORDER BY:

select * from mytable where b='x' order by c desc;

I see

InvalidRequest: Error from server: code=2200 [Invalid query] message="ORDER BY with 2ndary indexes is not supported."

Since the original table is ordered by "c" descending. Can I assume that the result of the above SELECT will be in this order or there is no way of knowing or controling the order when selecting using an SAI index?

1

There are 1 best solutions below

1
On BEST ANSWER

To help illustrate this, I have created your table and inserted some data. I've then queried the table for a value of b, and included the token function on the partition key for this example.

Note: Not running in Astra, but on my local 4.0 rc1 instance. The principles remain the same, however.

Basically, all result sets are sorted by the hashed token values of the partition key, and then the CLUSTERING ORDER takes precedence within each partition:

> SELECT a, token(a), c FROM mytable WHERE b='b';

 a  | system.token(a)      | c
----+----------------------+---------------------------------
 a4 | -9170418876698302957 | 2021-05-03 14:38:42.708000+0000
 a5 |  -925545907721365710 | 2021-05-03 14:39:06.849000+0000
 a3 |   -96725737913093993 | 2021-05-03 14:40:30.942000+0000
 a3 |   -96725737913093993 | 2021-05-03 14:39:18.340000+0000
 a2 |  5060052373555595560 | 2021-05-03 14:40:30.938000+0000
 a2 |  5060052373555595560 | 2021-05-03 14:39:14.914000+0000
 a1 |  5693669818594506317 | 2021-05-03 14:38:54.426000+0000
 a1 |  5693669818594506317 | 2021-05-03 14:38:52.758000+0000

(8 rows)

As you can see here, the result set is not completely sorted by c. But initially sorted by the hashed token values of a, and then sorted by c within each partition (a).

So "no," you cannot count on the data automatically being completely sorted by c.