Polly retry not always catching HttpRequestException

5.4k Views Asked by At

My .NET Core 3.1 app uses Polly 7.1.0 retry and bulkhead policies for http resilience. The retry policy uses HandleTransientHttpError() to catch possible HttpRequestException.

Now http requests fired with MyClient sometimes return an HttpRequestException. Around half of them are caught and retried by Polly. The other half however ends up in my try-catch-block and I have to retry them manually. This happens before the maximum number of retries is exhausted.

How did I manage to create a race condition preventing Polly from catching all exceptions? And how can I fix this?

I register the policies with the IHttpClientFactory as follows.

public void ConfigureServices(IServiceCollection services)
{
    services.AddHttpClient<MyClient>(c =>
    {
        c.BaseAddress = new Uri("https://my.base.url.com/");
        c.Timeout = TimeSpan.FromHours(5); // Generous timeout to accomodate for retries
    })
        .AddPolicyHandler(GetHttpResiliencePolicy());
}

private static AsyncPolicyWrap<HttpResponseMessage> GetHttpResiliencePolicy()
{
    var delay = Backoff.DecorrelatedJitterBackoffV2(medianFirstRetryDelay: TimeSpan.FromSeconds(1), retryCount: 5);

    var retryPolicy = HttpPolicyExtensions
            .HandleTransientHttpError() // This should catch HttpRequestException
            .OrResult(msg => msg.StatusCode == HttpStatusCode.NotFound)
            .WaitAndRetryAsync(
                sleepDurations: delay,
                onRetry: (response, delay, retryCount, context) => LogRetry(response, retryCount, context));

    var throttlePolicy = Policy.BulkheadAsync<HttpResponseMessage>(maxParallelization: 50, maxQueuingActions: int.MaxValue);

    return Policy.WrapAsync(retryPolicy, throttlePolicy);
}

The MyClient that is firing the http requests looks as follows.

public async Task<TOut> PostAsync<TOut>(Uri requestUri, string jsonString)
{
    try
    {
        using (var content = new StringContent(jsonString, Encoding.UTF8, "application/json"))
        using (var response = await httpClient.PostAsync(requestUri, content)) // This throws HttpRequestException
        {
            // Handle response
        }
    }
    catch (HttpRequestException ex)
    {
        // This should never be hit, but unfortunately is
    }
}

Here is some additional information, although I'm not sure that it's relevant.

  1. Since the HttpClient is DI-registered transiently, there are 10 instances of it flying around per unit of work.
  2. Per unit of work, the client fires ~400 http requests.
  3. The http requests are lenghty (5 min duration, 30 MB response)
1

There are 1 best solutions below

6
On

Retry and the HttpRequestException

Whenever we are talking about Polly policies then we can distinguish two different exceptions:

  • handled
  • unhandled.

Handled exception

  • It triggers a certain kind of behaviour of the given policy (in this case the HttpRequestException).
  • The handled exception is thrown again if the policy can't succeeded.
  • If there is an other policy then it may or may not handle that exception.

Unhandled exception

  • It does not induce any sort of reaction (for example WebException in our case).
  • The unhandled exception is flowed through the policy.
  • If there is an other policy then it may or may not handle that exception.

"Around half of them are caught and retried by Polly.
The other half however ends up in my try-catch-block"

This can happen if some of your retries run out of attempts. In other words there are some requests which could not succeeded in 6 attempts (5 retry and 1 initial attempt).

This can be easily verified with one of the following two tools:

  • onRetry + context
  • Fallback + context

onRetry + context

The onRetry is called when the retry policy is triggered but before the sleep duration. The delegate receives the retryCount. So to be able to connect / relate separate log entries of the same request you need to use some sort of correlation id. The simplest way to have one can be coded like this:

public static class ContextExtensions
{
    private const string Key = "CorrelationId";

    public static Context SetCorrelation(this Context context, Guid? id = null)
    {
        context[Key] = id ?? Guid.NewGuid();
        return context;
    }

    public static Guid? GetCorrelation(this Context context)
    {
        if (!context.TryGetValue(Key, out var id))
            return null;

        if (id is Guid correlation)
            return correlation;

        return null;
    }
}

Here is a simplified example:
The to be executed method

private async Task<string> Test() 
{ 
    await Task.Delay(1000); 
    throw new CustomException(""); 
}

The policy

var retryPolicy = Policy<string>
    .Handle<CustomException>()
    .WaitAndRetryAsync(5, _ => TimeSpan.FromSeconds(1),
        (result, delay, retryCount, context) =>
        {
            var id = context.GetCorrelation();
            Console.WriteLine($"{id} - #{retryCount} retry.");
        });

The usage

var context = new Context().SetCorrelation();
try
{
    await retryPolicy.ExecuteAsync(async (ctx) => await Test(), context);
}
catch (CustomException)
{
    Console.WriteLine($"{context.GetCorrelation()} - All retry has been failed.");
}

The sample output

3319cf18-5e31-40e0-8faf-1fba0517f80d - #1 retry.
3319cf18-5e31-40e0-8faf-1fba0517f80d - #2 retry.
3319cf18-5e31-40e0-8faf-1fba0517f80d - #3 retry.
3319cf18-5e31-40e0-8faf-1fba0517f80d - #4 retry.
3319cf18-5e31-40e0-8faf-1fba0517f80d - #5 retry.
3319cf18-5e31-40e0-8faf-1fba0517f80d - All retry has been failed.

Fallback

As it was being said whenever the policy can't succeed then it will re-throw the handled exception. In other words if a policy fails then it escalates the problem to the next level (next outer policy).

Here is a simplified example:
The policy

var fallbackPolicy = Policy<string>
    .Handle<CustomException>()
    .FallbackAsync(async (result, ctx, ct) =>
    {
        await Task.FromException<CustomException>(result.Exception);
        return result.Result; //it will never be executed << just to compile
    }, 
    (result, ctx) =>
    {
        Console.WriteLine($"{ctx.GetCorrelation()} - All retry has been failed.");
        return Task.CompletedTask;
    });

The usage

var context = new Context().SetCorrelation();
try
{
    var strategy = Policy.WrapAsync(fallbackPolicy, retryPolicy);  
    await strategy.ExecuteAsync(async (ctx) => await Test(), context);
}
catch (CustomException)
{
    Console.WriteLine($"{context.GetCorrelation()} - All policies failed.");
}

The sample output

169a270e-acf7-45fd-8036-9bd1c034c5d6 - #1 retry.
169a270e-acf7-45fd-8036-9bd1c034c5d6 - #2 retry.
169a270e-acf7-45fd-8036-9bd1c034c5d6 - #3 retry.
169a270e-acf7-45fd-8036-9bd1c034c5d6 - #4 retry.
169a270e-acf7-45fd-8036-9bd1c034c5d6 - #5 retry.
169a270e-acf7-45fd-8036-9bd1c034c5d6 - All retry has been failed.
169a270e-acf7-45fd-8036-9bd1c034c5d6 - All policies failed.