LazyCache concurrent requests return same bad result

399 Views Asked by At

I am trying LazyCache in C# and .NET 6.

It is failing one of our tests, so I made a smaller reproduction here.

I expect LazyCache to do a thread safe cache so that one failed request does not impact others. Specifically, I have set up a FailingCacher such that the first underlying call fails, and the value from that not be cached. Subsequent calls should succeed and be cached, and the value should come from cache thereafter.

When I do 5 calls in sequence, this works as expected.

But when 5 calls are done concurrently, this fails and the same bad value is returned on 4 or 5 calls, which should not happen.

I expected to get the same outcome as the sequential case? Is it possible for LazyCache to do this? Do I need manual locking as well? But isn't that what LazyCache offers?

The code is:

using LazyCache;

public class FailingCacher
{
    private int _callCount = 0;
    private const string CacheTokenKey = "abc123";
    private readonly IAppCache _appCache;

    public FailingCacher(IAppCache appCache)
    {
        _appCache = appCache;
    }

    public int CallCount => _callCount;

    public string GetTheValue()
    {
        try
        {
            return _appCache.GetOrAdd(CacheTokenKey, GetInternal);
        }
        catch (Exception)
        {
            return null;
        }
    }

    private string GetInternal()
    {
        var currentCallCount = Interlocked.Increment(ref _callCount);

        if (currentCallCount < 2)
        {
            throw new Exception($"Call {currentCallCount} fails, subsequent calls succeed");
        }

        return $"Success at call {currentCallCount}";
    }

    public void ClearCache()
    {
        _appCache.Remove(CacheTokenKey);
    }
}

The main program is

using LazyCache;
using System.Collections.Concurrent;

void VerifyResults(List<string> list, string testType)
{
    var allCount = list.Count;
    var failedCount = list.Count(x => x == null);
    var successCount = list.Count(x => x != null);

    var expectedValueCount = list.Count(x => x == "Success at call 2");

    string PassOrFail(bool cond) => cond ? "Pass" : "Fail";

    Console.WriteLine($"{testType} All result count {allCount}: {PassOrFail(allCount == 5)}");
    Console.WriteLine($"{testType} Failed count {failedCount}: {PassOrFail(failedCount == 1)}");
    Console.WriteLine($"{testType} Success count {successCount}: {PassOrFail(successCount == 4)}");
    Console.WriteLine($"{testType} Expected Value count {expectedValueCount}: {PassOrFail(expectedValueCount == 4)}");
}

void SequentialCallsReturnsFirstSuccessfulCachedValue()
{
    var cacher = new FailingCacher(new CachingService());
    cacher.ClearCache();

    var results = new List<string>();

    for (var i = 0; i < 5; i++)
    {
        var value = cacher.GetTheValue();
        results.Add(value);
    }

    Console.WriteLine($"Sequential inner call count: {cacher.CallCount}");
    VerifyResults(results, "Sequential");
}

async Task ConcurrentCallsReturnsFirstSuccessfulCachedValue()
{
    var cacher = new FailingCacher(new CachingService());
    cacher.ClearCache();

    var tasks = new List<Task>();
    var stack = new ConcurrentStack<string>();

    for (var i = 0; i < 5; i++)
    {
        var task = Task.Run(() =>
        {
            var value = cacher.GetTheValue();
            stack.Push(value);
        });
        tasks.Add(task);
    }

    await Task.WhenAll(tasks);

    var results = stack.ToList();

    Console.WriteLine($"Concurrent inner call count: {cacher.CallCount}");
    VerifyResults(results, "Concurrent");
}


SequentialCallsReturnsFirstSuccessfulCachedValue();
await ConcurrentCallsReturnsFirstSuccessfulCachedValue();
Console.WriteLine("Done");

and the .csproj file is

<Project Sdk="Microsoft.NET.Sdk">
    <PropertyGroup>
    <OutputType>Exe</OutputType>
    <TargetFramework>net6.0</TargetFramework>
    <ImplicitUsings>enable</ImplicitUsings>
  </PropertyGroup>
    <ItemGroup>
    <PackageReference Include="LazyCache" Version="2.4.0" />
  </ItemGroup>
</Project>

The output is

Sequential inner call count: 2
Sequential All result count 5: Pass
Sequential Failed count 1: Pass
Sequential Success count 4: Pass
Sequential Expected Value count 4: Pass
Concurrent inner call count: 1
Concurrent All result count 5: Pass
Concurrent Failed count 5: Fail
Concurrent Success count 0: Fail
Concurrent Expected Value count 0: Fail
Done

As you can see, the Sequential part works as expected, the Concurrent does not. The concurrent output seems to indicate that the failed value from the first thread is being returned to other threads as well, although it should not be cached.

There are definitely timing-dependent issues as some runs show

Concurrent inner call count: 2
Concurrent All result count 5: Pass
Concurrent Failed count 4: Fail
Concurrent Success count 1: Fail
Concurrent Expected Value count 1: Fail
Done

This is much simplified; the original test was all async and had http mocks. But the issue is exactly the same.

1

There are 1 best solutions below

0
On

What should a cache do with multiple threads?

I'll try and expand on the comments above as an answer. If it's not correct, make a different answer.

Assume that 20 requests come at the same time on different threads. All of them want an value either from cache or from the "underlying request" that populates the cache.

The cache is not populated, so the underlying request must be called, preferably once only. LazyCache has "Guaranteed single evaluation of your factory delegate": Lazy<T> guarantees that these simultaneous threads block while the underlying request is called once, using one of those threads. It saves us the trouble of coding double-checked locking, which many people do not correctly do when hand-rolling the code.

If the underlying request succeeds, no problem, all threads get the success value, and it it cached for future calls. If it fails, we should try again: LazyCache "Stops you inadvertently caching an exception (for future calls) by removing Lazys that evaluate to an exception".

We expect the cache to either serve a value from cache, or call the underlying request to populate the cache for future calls, so it is unexpected if it does neither.

But what about handling the other currently-waiting threads? Using sequential requests could have different outcomes:

  1. If the first underlying request fails quickly, and the second succeeds quickly, then there is benefit in sequential calls, keeping those threads waiting until there is a success value.

  2. But if the first underlying request fails, the second could fail as well for the same reason. Failures can be slow. What happens if they're all failing with a HTTP timeout after 30 seconds each? If the underlying requests are executed sequentially, then the 20th request to be processed could then take 600 seconds (10 minutes) to fail.

The best case of #1 is better, but the worst case of #2 is a very bad outcome in multiple ways: it could lock up a large number of threads at that choke-point; and even if underlying request 20 somehow succeeds, that thread that has waited 9 minutes is very unlikely to be viable after that time: The request that spawned it will have been cancelled, or message locks duration and function execution max time long exceeded, etc. The problem would be compounded badly, such that even diagnosing the root cause would be a lot harder.

So it is better to fail fast, by returning a value that is neither cached nor from the current thread.

The behaviour chosen is:

Assuming the first one fails and the second succeeds: T0 the cache is called and the first request begins. It will finish at T1, and all other cache calls from T0-T1 will receive that same result.

There is no one best way to handle all these cases at once, only trade-offs. This behaviour is a simple and reasonable choice.

You can put retry logic in the "underlying request" method if you want to modify this behaviour. Lazy and LazyCache do not set a maximum of that method's execution time, but consider the context in which it is made; there could be some limit. e.g. in a web app, how long do you have at max to return a response? This could also increase the blast radius of complete failure; If all underlying requests take 1 second, three retries takes 3 seconds, and if all fail; then all other cache calls from T0-T3 will all wait until T3 and then receive that same result.