Cassandra Data Model - Column Families into CQL Table

606 Views Asked by At

Assume that I have a sample User table with fields (userid, username, dob, loc, email). I have different columns names like one user can have only userid, user name and second user can have userid, username, dob and third users may have other columns and so on (representing wide column store). Each user can have their own set of columns. How to represent this in a single CQL table as data is sparse across different columns. Do I need to go with multiple user tables to represent this info?

Please suggest how to convert sample table into CQL table.

1

There are 1 best solutions below

1
On

I see what you're asking. No, you should be able to build all columns into a single table. Let's say that I create a users table like this:

CREATE TABLE users(
  userid bigint,
  username text, 
  email text,
  dob text,
  loc text,
  mobile text,
  landlinenum text,
  PRIMARY KEY (userid));

And then I INSERT some rows like this:

INSERT INTO users (userid,username,email,loc,mobile) VALUES (0,'mreynolds','[email protected]','Minneapolis','111-555-1234');
INSERT INTO users (userid,username,email,loc,landlinenum,dob) VALUES (1,'jcobb','[email protected]','Minneapolis','111-555-3464','19620227');
INSERT INTO users (userid,username,email,loc,mobile) VALUES (2,'dbook','[email protected]','New York','111-555-2349');
INSERT INTO users (userid,username,email,loc,mobile,dob) VALUES (3,'stam','[email protected]','San Francisco','111-555-8899','19750416');
INSERT INTO users (userid,username,email,loc,dob) VALUES (4,'rtam','[email protected]','San Francisco','19810724');

If I run an unbound query to pull back all rows and columns, I see this:

aploetz@cqlsh:stackoverflow> SELECT * FROM users;

 userid | dob      | email                | landlinenum  | loc           | mobile       | username
--------+----------+----------------------+--------------+---------------+--------------+-----------
      2 |     null | [email protected] |         null |      New York | 111-555-2349 |     dbook
      3 | 19750416 |   [email protected] |         null | San Francisco | 111-555-8899 |      stam
      4 | 19810724 |   [email protected] |         null | San Francisco |         null |      rtam
      0 |     null |     [email protected] |         null |   Minneapolis | 111-555-1234 | mreynolds
      1 | 19620227 |   [email protected] | 111-555-3464 |   Minneapolis |         null |     jcobb

(5 rows)

So this should tell you a couple of things.

  1. Cassandra isn't really "schema-less" in the same sense that it was in the pre-CQL world. You do have to define each column in your table definition.

  2. With #1 being said, you do not have to provide a value for each column.

  3. Columns which have not been assigned a value display as "null." Keep-in-mind, that they are not true null values, they just display that way.

  4. In my INSERTs, I only specified the columns for which I had corresponding values. I did not set the others to empty or null. Explicitly setting a column value to null creates a tombstone.

Now to my point about understanding your query requirements, you usually want to create a table for each query required. I have the "users" table keyed by userid...but really, how useful is that? Applications typically query by things like email or loc.

To handle email, I would create a new table with the same column definitions called "users_by_email." The main difference (other than the name) would be the PRIMARY KEY definition: PRIMARY KEY (email,username)

Then I could query a user by email like this:

aploetz@cqlsh:stackoverflow> SELECT * FROM users_by_email WHERE email='[email protected]';

 email              | username | dob      | landlinenum | loc           | mobile | userid
--------------------+----------+----------+-------------+---------------+--------+--------
 [email protected] |     rtam | 19810724 |        null | San Francisco |   null |      4

(1 rows)

That's why knowing your query patterns makes a difference. In the first example, the "users" table cannot support a query by email.