Making I/O bound operation asynchronous

399 Views Asked by At

I am developing a web application that will read Excel files and make some validation.

I use ExcelDataReader library, which lacks async methods.

Read() method takes care of reading the next rows in a sheet, so it is mostly I/O bound (it just advances the stream to a certain position, so data from the row is available).

To increase my throughput, I wrapped it in Task.Run, like:

await Task.Run(() => excelDataReader.Read());

From benchmarks I can see it executes faster (or at least no longer) than regular synchronous call excelDataReader.Read()

From posts I read I see Task.Run is recommended only in UI scenarios, when we need to unblock the UI thread.

So, the question is does it make sense to wrap I/O bound operations in Task.Run ?

Or am I missing something?

EDIT:

benchmark results

enter image description here

So the second question would be - if my approach is just wrong, why it performs better?

ANOTHER EDIT:

Below is benchmark code that showed that asynchronous version is much faster:

using BenchmarkDotNet.Attributes;
using ExcelDataReader;
using System.Text;

namespace ExcelDataReaderTests;

public class ExcelDataReaderTester
{
    private const string DirectoryWithManyBigExcelFiles = @"Directory With Many Big Excel Files";

    [Benchmark]
    public async Task TestAsync()
    {
        Encoding.RegisterProvider(CodePagesEncodingProvider.Instance);

        var directory = Directory.GetFiles(DirectoryWithManyBigExcelFiles);

        var tasks = new List<Task>();
        foreach (var file in directory)
        {
            tasks.Add(UseExcelDataReadAsync(file));
        }
        await Task.WhenAll(tasks);
    }

    [Benchmark]
    public void Test()
    {
        Encoding.RegisterProvider(CodePagesEncodingProvider.Instance);
        var directory = Directory.GetFiles(DirectoryWithManyBigExcelFiles);

        foreach (var file in directory)
        {
            UseExcelDataRead(file);
        }
    }

    private static void UseExcelDataRead(string filePath)
    {
        var fileStream = File.OpenRead(filePath);
        var excelDataReader = ExcelReaderFactory.CreateReader(fileStream);

        while (excelDataReader.Read())
        {
            var x = excelDataReader.GetFieldType(0);
        }
    }

    private static async Task UseExcelDataReadAsync(string filePath)
    {
        var fileStream = File.OpenRead(filePath);
        var excelDataReader = ExcelReaderFactory.CreateReader(fileStream);

        while (await Task.Run(() => excelDataReader.Read()))
        {
            var x = excelDataReader.GetFieldType(0);
        }
    }
}

And the benchmark are run with:

BenchmarkRunner.Run(typeof(ExcelDataReaderTester).Assembly);

I am using BechmarkDotNet for benchmarking.

2

There are 2 best solutions below

4
On

ExcelDataReader loads all data in memory so it makes no sense to provide asynchronous methods. The xlsx format is a ZIP package containing XML files which can't be read line by line. There's no IO involved in ExcelDataReader.Read.

The real IO is performed by CreateReader. Unfortunately there's no CreateReaderAsync and no plan to add this yet

In general, pushing an IO operation to another thread won't make it faster.

In a desktop application, if you want to process the contents of an Excel file without blocking the UI thread, put the entire processing code into a method and call it with Task.Run. For more complex processing you can use eg DataFlow or Channels to execute multiple processing steps in the background.

In a web application though, each request is server by a different ThreadPool thread. There's no UI thread to block. await Task.Run(...) simply moves execution from one ThreadPool thread to another.

4
On

You technically do multithreaded programming by wrapping the synchronous code with Task.Run.

await Task.Run(() => excelDataReader.Read());

But you won't do asynchronous programming with this. It may look like you do, but you actually don't because what's inside is not an I/O operation.

The excelDataReader.Read() will be simply picked up by a threadpool thread, and the thread will synchronously run the action, which means it is going to be blocked. If you receive many requests enough to saturate all your threadpool threads, what you're going to suffer is Thread Starvation. Your server won't be responsive in the end.

And I want to add two comments on this statement. From posts I read I see Task.Run is recommended only in UI scenarios, when we need to unblock the UI thread.

  1. The general purpose of Task.Run is not for unblocking the UI thread. It is recommended for CPU Bound job or something you can fire&forget.
  2. Where I/O job involved, you can avoid blocking the UI thread by introducing async/await.

You need to update your code that outputs the benchmark result. Async alone doesn't increase the performance, instead it increases thread availability.