Cassandra data model guidance

167 Views Asked by At

I have a question on Cassandra data modeling. Sorry for little long post.

I am taking a hypothetical situation here. Let's say I have a master server which collects data from the machines (1 or many) on network. The data of the other machines is in the form like machine details, status, is connected or not, is up or down, is desktop or laptop, something like this. I have queries like this

  1. Given the master server id get the list of machines connected to it
  2. Given the machine id, get the machine details.
  3. Given the status of machine (nothing else) get the list of machines (one or many)
  4. Given the flag is_connected get the list of machines which are connected to master server?
  5. Given the flag is_up get the list of machines?

So as per Cassandra, we should create a column family for each query (approximately). My worry is for query #3, #4, #5 above, the where clause for those queries is status, is_connected and is_up respectively, so to satisfy those queries I must create table which has these flags as either partition key or cluster keys.

CREATE TABLE server (
    server_id text,
    server_name text,
    status text,
    .
    .
    .
    .
    .
    other information,
    PRIMARY KEY (server_id))

CREATE TABLE machine (
    machine_id text,
    machine_name text,
    status boolean,
    is_connected boolean,
    is_up boolean,
    .
    .
    .
    .
    .
    other information,
    PRIMARY KEY (machine_id))

  CREATE TABLE machine_by_status (
      machine_id text,
      machine_name text,
      status boolean,
      is_connected boolean,
      is_up boolean,
      .
      .
      .
      .
      .
      other information,
      PRIMARY KEY (status, machine_id))

  CREATE TABLE machine_by_connected (
      machine_id text,
      machine_name text,
      status boolean,
      is_connected boolean,
      is_up boolean,
      .
      .
      .
      .
      .
      other information,
      PRIMARY KEY (is_connected, machine_id))

  CREATE TABLE machine_by_up_down (
      machine_id text,
      machine_name text,
      status boolean,
      is_connected boolean,
      is_up boolean,
      .
      .
      .
      .
      .
      other information,
      PRIMARY KEY (is_up, machine_id))

But the value of these flags may change over the period of time. The values could be multiple. If these are part of either partition key or cluster keys then I cannot update or change it. So once my column family is created and a record is added with some value of flag then for the new value how can I update that record or if I add new record then how can I remove the old record? I want to avoid read before write.

The frequency of data collection may vary so I cannot use fixed value of TTL so that Cassandra can remove the old value. I understand that above model has problems like it may create hotspots or may create imbalance cluster and that is the reason I need guidance. How can I handle this situation. My client application can query based on these flags only on few pages. Client does not have other data to query.

So how can I create column family to satisfy query #3, #4, #5? Your guidance will help me to come up with good data model in this case.

Thank you in advance.

1

There are 1 best solutions below

1
On BEST ANSWER

The flag status, is_connected, is_up all of them are low cardinality partition.

Let's say all of the machine is up, so all your data will be in a single partition (on same node), will create hotspot, will not be scaleable etc.

So instead of making low cardinality column (status, is_connected, is_up) as partition key. Create separate table for each flag value.

Example is_up :

CREATE TABLE up_machines ( 
    machine_id text PRIMARY KEY, 
    machine_name text, 
); 

CREATE TABLE down_machines ( 
    machine_id text PRIMARY KEY, 
    machine_name text, 
); 

Now if you need the up machine list then you can just select all from the up_machines table. Similarly create table for other flags.

Another thing instead of selecting all record at ones, use driver pagination system

https://docs.datastax.com/en/developer/java-driver/2.1/manual/paging/

Note : If a machine status changed, you have to delete from one table and insert into another. Deleting records create tombstone. If this frequently happens huge tombstone can be generated. About Deletes and Tombstones in Cassandra