Our use case is load bulk data into our live production Cassandra cluster. We have to load bulk data in Cassandra on daily basis. We came across sstableloader. We have few queries around same:
1: When we are loading bulk data into our live production cluster using sstableloader, do we have a chance of dirty read?(Basically does sstableloader load all data at once or it continues to update as and when it is getting data?) Dirty read is not acceptable in our production environment.
2: When we are loading bulk data into our live production cluster, does it affect cluster availability?(Basically since we are loading a huge amount of data into live production cluster, does it affect its performance? Do we need to increase cluster nodes for making it highly available during bulk loading?)
3: If there is possibility of dirty read in live production cluster using sstableloader, please suggest alternate tool which can avoid this issue. We want all bulk data to appear at once and not incremental.
Thanks!
SStableloader loads the data incrementally. It will not load everything in at once.
It will most definitely have an impact. How severe this impact is depends on the size of the data that is streamed in as well as many other factors. You can throttle the throughput with options in the sstableloader which might help in that regard. Run this use case on a test cluster and see the impact sstableloader will have with your dataset.
There is not really a way to make this work without giving at least a small timeframe where the data is 'dirty' unless you are willing to take downtime.
For example, for the more adventurous, you might be adding the SSTables directly into the data folders of all your nodes and run nodetool refresh. However, this will not be exactly simultaneous and therefore prone to dirty reads or failed reads for a short period of time.