Does DSBulk with maxErrors=0 retry failed queries?

201 Views Asked by At

I'm using dsbulk to load data into Cassandra cluster. Configuration currently includes -maxErrors 0 to fail fast in case of any issue.

It's not clear for me how retry strategy defined by

advanced.retry-policy.class = 
             "com.datastax.oss.dsbulk.workflow.commons.policies.retry.MultipleRetryPolicy"

advanced.retry-policy.max-retries = 10

works with 0 allowed errors.

Will failed query be retried 10 times before entire operation is aborted or retries will not be performed at all?

The entire load process is aborted in case of at least one issue but it's not clear from the logs if failed query is retried or not.

2

There are 2 best solutions below

1
On

We need additional details to help triage this.

  1. What is the ./dsbulk --version output? If this is any lesser than 1.10.0, I'd strongly recommend that you download the latest version for your OS from here.
  2. What is the Cassandra® version against which you're running this load?
  3. What is the full command that was run?
  4. What is the console (or log file) errors say?
  5. What is the output and contents of the logs directory?
  6. Could you re-run the load command (i.e. ./dsbulk load) by including --dsbulk.log.stmt.level EXTENDED

Having said that, DataStax Bulk Loader (aka DSBulk in short) offers a replay strategy by which you could resume the failed operation to process the new and failed records after you've fixed the underlying problem (e.g. data problem, cluster instability, etc.,) by using the checkpoint file (default or if you've overwritten it using --dsbulk.log.checkpoint.file string). The checkpoint information will already be available in both the console and in the log file for convenience.

1
On

It depends on the type of failure since not all failures are retried.

The MultipleRetryPolicy will retry an operation only for the following conditions:

  1. On write timeout - retry operation on the same coordinator.
  2. On unavailable - retry operation on the next coordinator in the query plan.
  3. On aborted request - retry operation if ClosedConnectionException or HeartbeatException on the next coordinator in the query plan.
  4. On error response - retry operation if NOT WriteFailureException on the next coordinator in the query plan.

For a bit of context:

  • A write timeout occurs when replicas do not reply to the coordinator within the write timeout period.
  • An unavailable error occurs when there are not enough replicas alive (because they were down, unresponsive, overloaded, etc) to satisfy the required consistency level.
  • An aborted request is when an operation is aborted before the driver has heard back from the coordinator, usually because the app was terminated (shut down).
  • An error response is a recoverable error such as an overloaded coordinator so the query is retried on the next available host.

In all these conditions, an operation will only be retried when the number of retry attempts has not exceeded the maximum allowed retries (max-retries). If max-retries has been reached OR the operation is not retried then the operation is marked as failed.

For details, see DSBulk's MultipleRetryPolicy on GitHub. Cheers!