Letting Concourse retry a build which failed because of a flaky issue

723 Views Asked by At

According to Concourse documentation

If any step in the build plan fails, the build will fail and subsequent steps will not be executed

It makes sense. However I'm wondering how I could deal with flaky steps.

For instance if I have a pipeline with

  • a get step with trigger: true
  • and then a task which performs several operations, including an HTTP call to an external service.

If the HTTP call fails because of a temporary network error then it makes sense that Concourse fails the build. But I would also appreciate if I could have a way to tell Concourse that this type of errors does not mean that the current version is corrupted and that it should automatically retry to build it after some time.

I've looked for it in the Concourse documentation but couldn't find such feature. Is it possible?

2

There are 2 best solutions below

0
On

Using attempts as explained in the other answer can be an option. But, before going that route, I would think more about the possible consequences and alternatives.

Attempts has two potential problems:

  1. it cannot know wether the failure is due to a flake or to a real error. If it is due to a real error, it will keep banging on the task for, say, 10 times, potentially consuming compute resource (it depends on how heavy the task is).
  2. it will work as expected only if the task is as focused as possible and idempotent. For example, if the flake HTTP request you mention comes after other operations that make a change to the external world, then you must ensure (when designing the task) that redoing such operations due to a flake to the HTTP request is safe.

If you know that your task is not subject to these kind of problems, then attempts can make sense.

On the other hand, this discussion makes us realize that maybe we can restructure the pipeline to be more Concourse idiomatic.

Since you mention an HTTP request, another option is to proxy that HTTP request via a Concourse resource (see https://concourse-ci.org/implementing-resource-types.html). Once done, the side-effect is visible in the pipeline (instead of being hidden in the task) and its success could be made optional with try or another hook modifier (see https://concourse-ci.org/try-step.html and https://concourse-ci.org/modifier-and-hook-steps.html).

The trade-off in this case is the time to write your own Concourse resource (in case you don't find a community-provided one). Only you are in the position to take this decision. What I can say is that writing a resource is not that complicated, once you get familiar with the concept. For some tricks on quick iterations during development, that apply to any Concourse resource, you can have a look at https://github.com/Pix4D/cogito/blob/master/CONTRIBUTING.md#quick-iterations-during-development.

0
On

Check out the attempts step modifier, the example from the doc:

plan:
- get: foo
- task: unit
  file: foo/unit.yml
  attempts: 10

It will attempt to run the task 10 times before it declares the task failed.