Update a column in a table having huge data (80mn+ rows) in cassandra

170 Views Asked by At

I have a table in Cassandra which has almost 80 million+ records(may be more than that). I have updated the schama which adds a new column in the table. Now I need to update the column values. I wrote a migration script to do that using cassandra-driver. Tried batching, token but the data is so huge that it is taking more than 3 hrs and still not updating the records (process getting terminated after 2-3 hrs.) What is the best way to handle this type of migration ? Is there any other way to achieve this?

Token example

1

There are 1 best solutions below

0
On

Usually for such things it's easier to use Spark (although I'm not sure hot it works with Amazon Keyspaces). It's quite hard to do range scan correctly - you need to handle edge cases, etc. (I have an example for Java driver that uses the same algorithm as Spark Cassandra Connector and DSBulk).

You can use Python with Spark and Cassandra Connector to update your data - the complexity of update will depend on your algorithm.

Another approach is put logic into your App - if it receives from Cassandra null for given column, you can return calculated value.