Cassandra follows which partitioning technique?

999 Views Asked by At

I am new to Cassandra and while reading about partitioning a database - vertical and horizontal, I got confused and would like to know whether Cassandra follows Horizontal partitioning (sharding) OR vertical partitioning technique?

Moreover, according to my understanding, as Cassandra is column oriented DB, it should follow Vertical partitioning technique. If this is not the case then can anyone please explain it in detail?

2

There are 2 best solutions below

1
On BEST ANSWER

as Cassandra is column oriented DB

This point has been discussed ad-nauseam on Stack Overflow, specifically in this answer. Cassandra is NOT a column oriented database. It is a partitioned row store. Data is organized and presented in "rows," similar to a relational database.

whether Cassandra follows Horizontal partitioning (sharding)

Technically, Cassandra is what you would call a "sharded" database, but it's almost never referred to in this way. Essentially, each node is responsible for a specific range of partitions. These partitions (tokens) are a numeric value, and with the Murmur3Partitioner range from -2^63 to +2^63-1.

In fact, in a scenario where a node is simplified to hold a single token range, you can compute the ranges based on the number of nodes in the cluster (data center) like this:

python -c 'print [str(((2**64 / 6) * i) - 2**63) for i in range(6)]'

['-9223372036854775808', '-6148914691236517206', '-3074457345618258604',
 '-2', '3074457345618258600', '6148914691236517202']

Of course with vNodes, a node is almost always responsible for multiple token ranges.

At operation time, the partition key is hashed into a token. This token tells Cassandra which node the data resides on. Consider this table:

SELECT token(studentid),studentid,fname,lname FROM student ;

 system.token(studentid) | studentid | fname | lname
-------------------------+-----------+-------+----------
    -5626264886876159064 | janderson | Jordy | Anderson
    -1472930629430174260 |   aploetz | Avery |   Ploetz
     8993000853088610283 |      mgin | Micah |      Gin

(3 rows)

As this table has a simple primary key definition of studentid, that is used as the partition key. The results of the token(studentid) function above indicate which partitions contain the data.

If there was another table which also used studentid as its partition key, that table's data would be stored on the same nodes as the student table.

In any case, this is a simplified version of what happens. Feel free to read up on vNodes (link above) as well as Cassandra: High Availability by Robbie Strickland. He has written (IMO) the best description of Cassandra's hashing and partition distribution process.

0
On

Cassandra implements partitioning on a hashing algorithm. Because of that Cassandra allows for efficient horizontal scaling (if the partition key is chosen correctly). In summary, when you create a table, you define the partitioning column(s). When you insert a record, Cassandra will take the values, hash it, and determine the node it belongs on. If you have RF configured > 1, then alternate replicas will also be chosen. How it works is no different than Oracle's hash partitions, except Oracle only does it at the storage layer, not the host layer (unless you're using Oracle sharding).