I'm sorry that I can't post my code snippets. I have a Go script that scans through the DynamoDB database and makes modifications to the entries. Everything is done sequentially (no go routines are involved). However, when I was running this on a large database, I got a ProvisionedThroughputExceededException. I'm running the script locally.

I'm using aws-sdk-go-v2, which should have a 20-second exponential back-off implementation when this error is triggered. Since provisioned write capacities are on a per-second basis, shouldn't the SDK automatically make the script wait when the capacity is reached, until the next second when newer capacities are allocated? I'm using UpdateItem, PutItem, and DeleteItem operations.

One guess I have is that when I have many requests in a short amount of time, it actually consumes capacity in the future, when the database is busy processing requests made in the past. However, I got the exception after a few seconds of execution, which was way shorter than 20 seconds.

What's the proper way of handling this exception? Catching it, waiting a few seconds and retrying it feels a bit arbitrary. I don't understand why the SDK isn't taking care of this already.

2

There are 2 best solutions below

0
On BEST ANSWER

The Go API (e.g., see https://github.com/aws/aws-sdk-go/blob/main/service/dynamodb/errors.go) claims that "The Amazon Web Services SDKs for DynamoDB automatically retry requests that receive this exception [ProvisionedThroughputExceededException]. Your request is eventually successful, unless your retry queue is too large to finish.". In your case, there is no parallelism and just one outstanding request at each time, so the retry queue only has one item. So with all of this, you are right and should not be seeing ProvisionedThroughputExceededException at all - or at least, not without a 20 second delay first.

My only guess on why you're seeing is caused by the parameter DefaultMaxAttempts int = 3 . My guess (which I can't base on any code - I'm not familiar with this Go library) is that the code does not actually reach a full 20 second wait, and during three retry attempts it only covers much less than 20 seconds. If this is the case, can you please try increasing this "max attempts" parameter and seeing if it helps (at least to increase the retry period to the full 20 seconds)?

4
On

You can implement a token bucket system in your script to keep you RCU and WCU units within an acceptable range based on your table configuration and other client usage of the table. If you are processing every item and speed is not a concern, try not to exceed 1_000 WCU and 3_000 RCU per second to ensure you won't get throttled at the partition level.

The reason it is not in the SDK is that there is no universal best way to handle this situation. Having the SDK "wait" might mean there is too much work in the queue and it won't get processed before a lambda timeout. Or, the thottling is happening at a partition level and not at the table level, so the SDK should not wait, as future requests may not hit that partition. Or, it is not clear how long to wait, as other clients are also consuming capacity. Or the throttling is happening at a GSI level, and future requests may not impact the GSI.