Cassandra execute_async request lose data

1.5k Views Asked by At

I need to insert the huge amount of data by using Python DataStax driver for Cassandra. As a result I cannot use execute( ) request. execute_async( ) is much faster.

But I faced the problem of losing data during calling execute_async( ). If I use execute( ), everything is ok. But, If I use execute_async( ) (for the SAME insert queries), the only about 5-7% of my request executed correctly (and no any errors were occured). And in a case I add time.sleep( 0.01 ) after each of 1000 insert request (by using execute_async( ) ), it's again ok.

No any data lose (case 1):

for query in queries:
    session.execute( query )

No any data lose (case 2):

counter = 0
for query in queries:
    session.execute_async( query )
    counter += 1
    if counter % 1000 == 0:
        time.sleep( 0.01 )

Data losing:

for query in queries:
    session.execute_async( query )

Is there any reason why it could be?

Cluster has 2 nodes

[cqlsh 5.0.1 | Cassandra 3.11.2 | CQL spec 3.4.4 | Native protocol v4]

DataStax Python driver version 3.14.0

Python 3.6

1

There are 1 best solutions below

0
On

Since execute_async is a non-blocking query, your code is not waiting for completion of the request before proceeding. The reason why you probably observe no data loss when you add a 10ms sleep after each execution is because that gives enough time for requests to be processed before you are reading data back.

You need something in your code that waits for completion of the requests before reading data back, i.e.:

futures = []
for query in queries:
    futures.push(session.execute(query))

for f in futures:
    f.result() # blocks until query is complete

You may want to evaluate using execute_concurrent for submitting many queries and having the driver manage the concurrency level for you.