Cassandra data model - column-family

669 Views Asked by At

I checked some questions here like Understanding Cassandra Data Model and Column-family concept and data model, and some articles about Cassandra, but I'm still not clear what is it's data model.

Cassandra follows a column-family data model, which is similar to key-value data model. In column-family you have data in rows and columns, so 2 dimensional structure and on top of that you have a grouping in column families? I suppose this is organized in column families to be able to partition the database across several nodes?

How are rows and columns grouped into column families? Why do we have column families?

For example let's say we have database of messages, as rows:

id: 123, message: {author: 'A', recipient: 'X', text: 'asd'}
id: 124, message: {author: 'B', recipient: 'X', text: 'asdf'}
id: 125, message: {author: 'C', recipient: 'Y', text: 'a'}

How and why would we organize this around column-family data model?

NOTE: Please correct or expand on example if necessary.

1

There are 1 best solutions below

0
On

Kinda wrong question. Instead of modeling around data, model around how your going to query the data. What do you want to read? You create your data model around that since the storage is strict on how you can access data. Most likely the id is not the key, if you want the author or recipient as on reads you use that as the partition key, with the unique id (use uuid not auto inc) as a clustering index. ie:

CREATE TABLE message_by_recipient (
  author text,
  recipient text,
  id timeuuid,
  data text,
  PRIMARY KEY (recipient, id)
) WITH CLUSTERING ORDER BY (id DESC)

Then to see the five newest emails to "bob"

select * from message_by_recipient where recipient = 'bob' limit 5

using timeuuid for id will guarantee uniqueness without a auto increment bottleneck and also provide sorting by time. You may duplicate writes on a new message, writing to multiple tables so each read is a single look up. If data can get large, may want to replace it with a uuid (type 4) and store it in a blob store or distributed file system (ie s3) keyed by it. It would reduce impact on C* and also reduce the cost of the denormalization.