Process ZipArchive entries in parallel

52 Views Asked by At

I need to process ZipArchive entries as strings. Currently I have code like this:

using (ZipArchive archive = ZipFile.OpenRead(zipFileName))
{
    foreach (ZipArchiveEntry entry in archive.Entries)
    {
        using (StreamReader sr = new StreamReader(entry.Open()))
        {
            string s = sr.ReadToEnd();
            // doing something with s
        }
    }
}

The processing could be much faster if it was done on several CPU cores in parallel using Parallel.ForEach or a similar loop. The problem is that ZipArchive is not thread-safe.

Perhaps, we could use the Partitioner class to get ranges from ZipArchive.Entries to feed them into a Parallel.ForEach loop and then open the zip archive again and every entry in the loop body using a new instance of ZipArchive to be thread-safe, but I have no good idea how to do that. Is it possible?

If not, is there another reliable way to process zip archive entries in parallel if we just need to read them?

2

There are 2 best solutions below

0
TecMan On BEST ANSWER

If my assumption is right, the multi-threaded version of my code processing a ZipArchive should look like this:

using (ZipArchive archive = ZipFile.OpenRead(zipFileName))
{
    var ranges = Partitioner.Create(0, archive.Entries.Count);
    
    Parallel.ForEach(ranges, range =>
    {
        using (ZipArchive archive2 = ZipFile.OpenRead(zipFileName))
            for (int i = range.Item1; i < range.Item2; i++)
            {
                ZipArchiveEntry entry = archive2.Entries[i];
                using (StreamReader sr = new StreamReader(entry.Open()))
                {
                    string s = sr.ReadToEnd();
                    // doing something with s
                }
            }
    }
}

P.S. Just for general information. My time measurements show that this version of code works 25%-40% slower compared to the original one-threaded code. So that's a question whether we should process a zip archive from multiple threads. Don't forget to measure performance of multi-threaded code for your archives to be sure that this approach helps to boost performance.

3
Mark Adler On

Just have each thread create its own ZipArchive and ZipArchiveEntry objects. It takes very little time to step through the central directory, so give each thread its own number n for which entry to process, and then a given thread will step through the entries n times to get to its entry. There should be no problem having multiple ZipArchive objects reading the same zip file.