I need to insert the huge amount of data by using Python DataStax driver for Cassandra. As a result I cannot use execute( ) request. execute_async( ) is much faster.
But I faced the problem of losing data during calling execute_async( ). If I use execute( ), everything is ok. But, If I use execute_async( ) (for the SAME insert queries), the only about 5-7% of my request executed correctly (and no any errors were occured). And in a case I add time.sleep( 0.01 ) after each of 1000 insert request (by using execute_async( ) ), it's again ok.
No any data lose (case 1):
for query in queries:
session.execute( query )
No any data lose (case 2):
counter = 0
for query in queries:
session.execute_async( query )
counter += 1
if counter % 1000 == 0:
time.sleep( 0.01 )
Data losing:
for query in queries:
session.execute_async( query )
Is there any reason why it could be?
Cluster has 2 nodes
[cqlsh 5.0.1 | Cassandra 3.11.2 | CQL spec 3.4.4 | Native protocol v4]
DataStax Python driver version 3.14.0
Python 3.6
Since
execute_async
is a non-blocking query, your code is not waiting for completion of the request before proceeding. The reason why you probably observe no data loss when you add a 10mssleep
after each execution is because that gives enough time for requests to be processed before you are reading data back.You need something in your code that waits for completion of the requests before reading data back, i.e.:
You may want to evaluate using
execute_concurrent
for submitting many queries and having the driver manage the concurrency level for you.