DCE Cassandra 3.9 slow secondary index creation during joining existing cluster

380 Views Asked by At

We have cassandra cluster with 32 nodes, average node size is about 1TB. Node configuration 1xIntel Xeon E3-1271v3, 32GB ram, 2x3TB HDD. We have one DB with some small tables and one big table, that holds is about 90-95% of total cluster size.

I try to add additional nodes to this cluster, but suddenly find out, that adding one node to existing cluster take is about 13-14 days for joining to cluster. Build secondary indexes take most of this time and all this time i see that all compactor threads take all available CPU.

I have changed cassandra config to extends limits:

  • concurrent_compactors: 4
  • compaction_throughput_mb_per_sec: 0

Cassandra full config

Schema

Is about 1 year ago we also add new nodes to this cluster and extend it from 16 nodes to 32 nodes cluster, average node size was 1TB before cluster extends. Cassandra version was 2.1. One node joining time was 1-1.5days.

So the question how can we speed up this process ? Did we miss something ?

Thanks.

1

There are 1 best solutions below

0
On

This one is a bit longer so I can't put it into comment ... sorry.

I know that this sounds a bit strange, especially for a later stage of your project, but the thing is with the indexes the situation won't get any better over time. I would strongly recommend to start making your own tables instead of just putting index on following stuff. Depending on how often the data is accessed you can use "inverted indexes".

CREATE INDEX links_by_author_url_idx ON keyspace.links_by_author (url);


CREATE INDEX docs_url_idx ON keyspace.docs (url);


CREATE INDEX om_master_object_id_idx ON keyspace.om (master_object_id);


CREATE INDEX actions_pday_idx ON keyspace.actions (pday);


CREATE INDEX authors_yauid_idx ON keyspace.authors (yauid);

CREATE INDEX authors_login_lr_idx ON keyspace.authors (login_lr);

CREATE INDEX authors_login_idx ON keyspace.authors (login);

CREATE INDEX authors_email_idx ON keyspace.authors (email);

CREATE INDEX authors_name_idx ON keyspace.authors (name);

Basically every index that you have here enables you to "search" over base entities to find them by some condition. Most of the conditions are actually pretty narrow which is a good news. But the thing is the indexes will become massive (already did), especially on docs and authors. But I guess doc's is more problematic.

You should consider making separate tables for this. Every index that you create will be there on every node in the cluster and in the end you will hold far more data than you really need too because under the hood data is multiplied per node. When you add replication factor to this system is using a lot of space without you even being aware.

The problem with joining nodes is that when they receive new data all the data in the cluster needs to be rebuilt ... for every single node in the cluster and this is costing you a lot of time. So basically you loose all the benefits of "easy node joining" that cassandra has.

Now you might think that space will become problem when you write the data into your new schema that is denormalized ....

If space is the problem you can use a technique called inverted indexes where you just put the id of the information into the search table and then you make second load in the main table. I used this on some project where space was the issue but since you have all the main stuff indexed space will probably not be a problem because you are already using a lot more than you think. (my bet would be that you will also probably save significantly on space)

Anyway all the indexes should become tables ... if consistency is problem, use batches (don't use materialized views yet because you might loose data).

My honest tip is that you stay away from indexes. I know it's hell to refactor this plus it's hard to get the time to refactor :( But I think it should be manageable.