How to set proper timeout values for Cadence activities(local and regular activities, with or without retry)?

3.4k Views Asked by At

So there are so many timeout values:

For local activity:

  • ScheduleToClose timeout

For regular activity without retry:

  • ScheduleToStart timeout
  • ScheduleToClose timeout
  • StartToClose timeout
  • Heartbeat timeout

And then more values in retryOptions:

  • ExpirationInterval
  • InitialInterval
  • BackoffCoefficient
  • MaximumInterval
  • MaximumAttempts

And retryOptions can be applied onto localActivity or regular activity.

How do I use them together with what expectation?

1

There are 1 best solutions below

0
On

TL;DR

The easiest way of using timeouts:

Regular Activity with retry:

  1. Use StartToClose as timeout of each attempt
  2. Leave ScheduleToStart and SchedueToClose empty
  3. If StartToClose is too large(like 10 mins), then set Heartbeat timeout to a smaller value like 10s. Call heartbeat API inside activity regularly.
  4. Use retryOptions.InitialInterval, retryOptions.BackoffCoefficient, retryOptions.MaximumInterval to control backoff.
  5. Use retryOptions.ExperiationInterval as overall timeout of all attempts.
  6. Leave retryOptions.MaximumAttempts empty.

Regular Activity without retry:

  1. Use ScheduleToClose for overall timeout
  2. Leave ScheduleToStart and StartToClose empty
  3. If ScheduleToClose is too large(like 10 mins), then set Heartbeat timeout to a smaller value like 10s. Call heartbeat API inside activity regularly.

LocalActivity without retry: Use ScheduleToClose for overall timeout

LocalActivity with retry:

  1. Use ScheduleToClose as timeout of each attempt.
  2. Use retryOptions.InitialInterval, retryOptions.BackoffCoefficient, retryOptions.MaximumInterval to control backoff.
  3. Use retryOptions.ExperiationInterval as overall timeout of all attempts.
  4. Leave retryOptions.MaximumAttempts empty.

More TL;DR

Because activity should be idempotent, all activity should set retry policy. Temporal has set an infinite retry policy for any activity by default. Cadence should do the same IMO.

iWF also set default infinite retry for State APIs to match Temporal activity.

What and Why

Basics without Retry

Things are easier to understand in the world without retry. Because Cadence started from it.

  • ScheduleToClose timeout is the overall end-to-end timeout from a workflow's perspective.

  • ScheduleToStart timeout is the time that activity worker needed to start an activity. Exceeding this timeout, activity will return an ScheduleToStart timeout error/exception to workflow

  • StartToClose timeout is the time that an activity needed to run. Exceeding this will return StartToClose to workflow.

  • Requirement and defaults:

    • Either ScheduleToClose is provided or both of ScheduleToStart and StartToClose are provided.
    • If only ScheduleToClose, then ScheduleToStart and StartToClose are default to it.
    • If only ScheduleToStart and StartToClose are provided, then ScheduleToClose = ScheduleToStart + StartToClose.
    • All of them are capped by workflowTimeout. (e.g. if workflowTimeout is 1hour, set 2 hour for ScheduleToClose will still get 1 hour :ScheduleToClose=Min(ScheduleToClose, workflowTimeout) )

So why are they?

You may notice that ScheduleToClose is only useful when ScheduleToClose < ScheduleToStart + StartToClose. Because if ScheduleToClose >= ScheduleToStart+StartToClose the ScheduleToClose timeout is already enforced by the combination of the other two, and it become meaningless.

So the main use case of ScheduleToClose being less than the sum of two is that people want to limit the overall timeout of the activity but give more timeout for scheduleToStart or startToClose. This is extremely rare use case.

Also the main use case that people want to distinguish ScheduleToStart and StartToClose is that the workflow may need to do some special handling for ScheduleToStart timeout error. This is also very rare use case.

Therefore, you can understand why in TL;DR that I recommend only using ScheduleToClose but leave the other two empty. Because only in some rare cases you may need it. If you can't think of the use case, then you do not need it.

LocalActivity doesn't have ScheduleToStart/StartToClose because it's started directly inside workflow worker without server scheduling involved.

Heartbeat timeout

Heartbeat is very important for long running activity, to prevent it from getting stuck. Not only bugs can cause activity getting stuck, regular deployment/host restart/failure could also cause it. Because without heartbeat, Cadence server couldn't know whether or not the activity is still being worked on. See more details about here Solutions to fix stuck timers / activities in Cadence/SWF/StepFunctions

RetryOptions and Activity with Retry

First of all, here RetryOptions is for server side backoff retry -- meaning that the retry is managed automatically by Cadence without interacting with workflows. Because retry is managed by Cadence, the activity has to be specially handled in Cadence history that the started event can not written until the activity is closed. Here is some reference: Why an activity task is scheduled but not started?

In fact, workflow can do client side retry on their own. This means workflow will be managing the retry logic. You can write your own retry function, or there is some helper function in SDK, like Workflow.retry in Cadence-java-client. Client side retry will show all start events immediately, but there will be many events in the history when retrying for a single activity. It's not recommended because of performance issue.

So what do the options mean:

  • ExpirationInterval:

    • It replaces the ScheduleToClose timeout to become the actual overall timeout of the activity for all attempts.
    • It's also capped to workflow timeout like other three timeout options. ScheduleToClose = Min(ScheduleToClose, workflowTimeout)
    • The timeout of each attempt is StartToClose, but StartToClose defaults to ScheduleToClose like explanation above.
    • ScheduleToClose will be extended to ExpirationInterval: ScheduleToClose = Max(ScheduleToClose, ExpirationInterval), and this happens before ScheduleToClose is copied to ScheduleToClose and StartToClose.
  • InitialInterval: the interval of first retry

  • BackoffCoefficient: self explained

  • MaximumInterval: maximum of the interval during retry

  • MaximumAttempts: the maximum attempts. If existing with ExpirationInterval, then retry stops when either one of them is exceeded.

  • Requirements and defaults:

  • Either MaximumAttempts or ExpirationInterval is required. ExpirationInterval is set to workflowTimeout if not provided.

Since ExpirationInterval is always there, and in fact it's more useful. Most of the time it's harder to use MaximumAttempts, because it's easily messed up with backoffCoefficient(e.g. when backoffCoefficient>1, the end to end timeout can be really large if not careful). So I would recommend just use ExpirationInterval. Unless you really need it.