We have two CAS queries. It was working just fine with 2 containers per region. We have increased containers from 2 to 3 then we started seeing the WriteTimeoutException. The traffic is same or even less compared to the regular business hours. Cassandra is in 3 regions and each cluster has 3 hosts.
Not sure what could be the reason for these errors, but the change was increase in the application container by one. Appreciate if any help here to debug further.
UPDATE order_sequences USING TTL 10 set instance_name = ? where id_name = ? IF instance_name = null", ConsistencyLevel.QUORUM)
UPDATE order_sequences SET next_id= ? where id_name= ? IF next_id= ? AND instance_name = ?", ConsistencyLevel.QUORUM),
Error stack:
com.datastax.driver.core.exceptions.WriteTimeoutException: Cassandra timeout during CAS write query at consistency SERIAL (7 replica were required but only 0 acknowledged the write) at
com.datastax.driver.core.exceptions.WriteTimeoutException.copy(WriteTimeoutException.java:85) at
com.datastax.driver.core.exceptions.WriteTimeoutException.copy(WriteTimeoutException.java:23) at
com.datastax.driver.core.DriverThrowables.propagateCause(DriverThrowables.java:35) at
com.datastax.driver.core.ChainedResultSetFuture.getUninterruptibly(ChainedResultSetFuture.java:59) at
com.datastax.driver.core.NewRelicChainedResultSetFuture.getUninterruptibly(NewRelicChainedResultSetFuture.java:11) at com.datastax.driver.core.AbstractSession.execute(AbstractSession.java:58) at
CAS write are a specialized metric which are triggered when a compare and set is conducted. LWT transaction is known as compare and set (CAS); replica data is compared and any data found to be out of date is set to the most consistent value.
In Cassandra, the process combines the Paxos protocol with normal read and write operations to accomplish the compare and set operation.
The Paxos protocol is implemented as a series of phases:
• Prepare/Promise • Read/Results • Propose/Accept • Commit/Acknowledge
These four phases require four round trips between a node proposing a lightweight transaction and any cluster replicas involved in the transaction. The performance will be affected. Consequently, reserve lightweight transactions for situations where concurrency must be considered.
For example, the following series of operations can fail:
DELETE ... INSERT .... IF NOT EXISTS SELECT ....
The following series of operations will work:
DELETE ... IF EXISTS INSERT .... IF NOT EXISTS SELECT .....
Would strongly recommend you to check the "CAS write latency" statistics from "nodetool proxyhistograms" command, it provides a histogram of network statistics at the time of the command.
Could you please let me know in case if you are still facing this error ?