In Cloud Spanner, from an API perspective, does partitionDML follows the same behavior of retries / timeouts?

27 Views Asked by At

From an API perspective, does partitionDML follows the same behavior of retries / timeouts? are all partitions retried? Or only the failed ones?

  • how often and what is the maximum interval this would take place? 1s - 5s , etc… is this configurable?

How this should be managed from the app perspective? If a partition DML fails - aware of idempotence and the fact it’s all or nothing - but if one fails does the customer need to manage anything accordingly?

From a troubleshooting perspective … which kind of errors do we get at the api? In other words, how can we troubleshoot if a specific partition has failed(how to fix a range of data)? This is for a payment application and therefore consistency is important.

1

There are 1 best solutions below

0
Naren Mehra On

This is all documented in https://cloud.google.com/spanner/docs/dml-partitioned#execution-transactions

If the partitioned DML statement succeeds, then Spanner ran the statement at least once against each partition of the key range.

Note the "at least once"... which is why DML statements need to be idempotent - i.e. can be executed multiple times with no side effects.This severely limits the types of updates that can occur.

If the execution of the statement causes an error, then execution stops across all partitions and Spanner returns that error for the entire operation

This may mean that some partitions have succeeded and some have failed, and some have not even been attempted.

consistency is important .

Partitioned DML is only consistent and atomic within a partition not across the whole table.

If you want to be sure of consistency, then maybe write code to perform batches of normal transactions where a read/update transaction is performed on sets of rows, and the logic of failure/retry and reporting is handled by the application. If you need consistency use non-partitioned DML.

  • are all partitions retried? Or only the failed ones? For transient errors, such as row modification contention, only the failed partitions are retried.

  • how often and what is the maximum interval this would take place? 1s - 5s , etc… is this configurable? There is an overall timeout on the entire ExecutePartitionedDML operation of 1hr (IIRC), there is no way to specify a lower timeout or timeouts on each partition.

from troubleshooting perspective … which kind of errors do we get at the api? In other words, how can customer troubleshoot that if a specific partition has failed(how to fix a range of data)?

You will get an error from the first partition that fails in a similar way to which you would get an error from a normal DML - the error will include the reason... But if they are expecting many errors, this will be a very slow way of identifying problematic rows