Trying to download huge amount of files more efficiently

Question

Trying to download huge amount of files more efficiently

2k Views Asked by Smutjes At 20 November 2021 at 19:44

I'm trying to download approx. 45.000 image files from an API. The image files have less than 50kb each. With my code this will take 2-3 Hours.

Is there an more efficient way in C# to download them?

private static readonly string baseUrl =
    "http://url.com/Handlers/Image.ashx?imageid={0}&type=image";
internal static void DownloadAllMissingPictures(List<ListObject> ImagesToDownload,
    string imageFolderPath)
{
    Parallel.ForEach(Partitioner.Create(0, ImagesToDownload.Count), range =>
    {
        for (var i = range.Item1; i < range.Item2; i++)
        {
            string ImageID = ImagesToDownload[i].ImageId;

            using (var webClient = new WebClient())
            {
                string url = String.Format(baseUrl, ImageID);
                string file = String.Format(@"{0}\{1}.jpg", imageFolderPath,
                    ImagesToDownload[i].ImageId);

                byte[] data = webClient.DownloadData(url);

                using (MemoryStream mem = new MemoryStream(data))
                {
                    using (var image = Image.FromStream(mem))
                    {
                        image.Save(file, ImageFormat.Jpeg);
                    }
                }                    
            }
        }
    });
}

Original Q&A

There are 3 best solutions below

**TomTom** · Answer 1 · 2021-11-20T19:47:48.403000

Likely not.

One thing to think about though - besides NOT using WebClient as it has been replaced by HttpClient long time ago, you just missed the memo. I suggest a fast run through the documentation.

Regardless what you think you do with Parallel.Foreach - you are limited by the parallel connection settings (ServicePointManager, HttpClientHandler).

You should read the manuals for those and experiment with higher limits, because right now that is quite likely limiting your parallelism to a quite low number and possibly can handle 3-4 times the limit.

Maximum concurrent requests for WebClient, HttpWebRequest, and HttpClient

has a deeper explanation.

**Theodor Zoulias** · Answer 2 · 2021-11-20T21:08:49.433000

The Parallel.ForEach method is not well suited for I/O-bound operations, because it requires a thread for each parallel workflow, and threads are not cheap resources. You can make it work by increasing the number of threads that the ThreadPool creates immediately on demand, with the SetMinThreads method, but that's not as efficient as using asynchronous programming and async/await. With asynchronous programming a thread is not required while the file is downloaded, or while the file is saved in the disc, so it is possible to download dozens of files concurrently using only a handful of threads.

Using the Partitioner for creating ranges is a useful technique when parallelizing extremely granular (lightweight) workloads, like adding or comparing numbers. In your case the workload is quite coarse (chunky), so using ranges is more likely to slow things down than speed them up. Using ranges prevents balancing the workload, in case some files take longer to download than others.

My suggestion is to use the Parallel.ForEachAsync method (introduced in .NET 6), which is designed specifically for parallelizing asynchronous I/O operations. Here is how you can use this method in order to download the files in parallel, with a specific degree of parallelism, and cancellation support:

private static readonly string _baseUrlPattern =
    "http://url.com/Handlers/Image.ashx?imageid={0}&type=image";

private static readonly HttpClient _httpClient = new HttpClient();

internal static void DownloadAllMissingPictures(
    IEnumerable<ListObject> imagesToDownload, string imageFolderPath,
    CancellationToken cancellationToken = default)
{
    var parallelOptions = new ParallelOptions()
    {
        MaxDegreeOfParallelism = 10,
        CancellationToken = cancellationToken,
    };
    Parallel.ForEachAsync(imagesToDownload, parallelOptions, async (image, ct) =>
    {
        string imageId = image.ImageId;
        string url = String.Format(_baseUrlPattern, imageId);
        string filePath = Path.Combine(imageFolderPath, imageId);
        using HttpResponseMessage response = await _httpClient.GetAsync(url, ct);
        response.EnsureSuccessStatusCode();
        using FileStream fileStream = File.OpenWrite(filePath);
        await response.Content.CopyToAsync(fileStream);
    }).Wait();
}

The Parallel.ForEachAsync method returns a Task. It's recommended that Tasks are awaited, but taking into account that you are probably not familiar with asynchronous programming yet, let's just Wait it instead for the time being.

In case the implementation above does not improve the performance of the whole procedure, you could experiment with the MaxDegreeOfParallelism configuration, and also with the settings mentioned in this question: How to increase the outgoing HTTP requests quota in .NET Core?

**Smutjes** · Answer 3 · 2021-11-21T16:45:26.373000

I tested some vraiations of your suggestions. The Code by Theodor Zoulias was my favourite.

It works fine and fast with approx 1.200 downloads per Minute.

This is the final Code i'm using now:

    private static readonly string _baseUrlPattern = "http://url.com/Handlers/Image.ashx?imageId={0}&type=card";

    private static readonly HttpClient _httpClient = new HttpClient();

    internal static void DownloadAllMissingPictures(CancellationToken cancellationToken = default)
    {
        ServicePointManager.DefaultConnectionLimit = 8;

        var parallelOptions = new ParallelOptions()
        {
            MaxDegreeOfParallelism = 10,
            CancellationToken = cancellationToken,
        };
        Parallel.ForEachAsync(ListWithImagesToDownload, parallelOptions, async (image, ct) =>
        {
            string imageId = image.identifiers.ImageId;
            string url = String.Format(_baseUrlPattern, imageId);
            string filePath = Path.Combine(imageFolderPath, $"{imageId}.jpg");

            using HttpResponseMessage response = await _httpClient.GetAsync(url, ct);
            response.EnsureSuccessStatusCode();

            using FileStream fileStream = File.OpenWrite(filePath);
            await response.Content.CopyToAsync(fileStream);
        }).Wait();
    }

The Code Idea by TomTom is fine, but stops after one loop. So i can't tell you wich impact the MaxConnectionsPerServer has on the Download speed.

I'm sorry i can't share some experience with you too. Bit as i said, i'm still a beginner with less than one year of programming experience.

Trying to download huge amount of files more efficiently

There are 3 best solutions below

Related Questions in C#

Related Questions in PARALLEL-PROCESSING

Related Questions in WEBCLIENT

Related Questions in PARALLEL.FOREACH

Related Questions in PARALLEL.FOREACHASYNC

Trending Questions

Popular # Hahtags

Popular Questions