Why does using ProcessPoolExecutor only yield a 2x speed-up on an 8-core machine?

89 Views Asked by At

I'm trying to leverage concurrent.futures.ProcessPoolExecutor to speed up CPU-bound work in Python, and I'm seeing less than 2x speed-up in execution time improvement running on a machine with 8 cores (4 cores and HyperThreading). I'm struggling to understand why.

I have the following script:

import asyncio
from concurrent.futures import ProcessPoolExecutor
from time import sleep

CALCULATION_COUNT = 1000000

def calculate(size: int):
    # Meaningless work to keep the CPU occupied
    n = 2
    for i in range(1, size):
        n *= (i % 4) + 1
        n /= (i % 4) + 1

    return 2


# Sequential version

def run_calculations():
    return calculate(CALCULATION_COUNT)


# Parallel version

def _chunks(chunk_size):
    for _ in range(0, CALCULATION_COUNT, chunk_size):
        yield None

async def run_calculations_mp():
    loop = asyncio.get_running_loop()
    tasks = []
    chunk_size = CALCULATION_COUNT // 8
    with ProcessPoolExecutor() as executor:
        for _ in _chunks(chunk_size):
            tasks.append(loop.run_in_executor(executor, calculate, chunk_size))

    await asyncio.gather(*tasks)

and the following code for benchmarking the functions in that script using the pytest plugin pytest_benchmark:

import asyncio

from my_module import *

def test_benchmark_sequential(benchmark):
    benchmark(run_calculations)

def test_benchmark_parallel(benchmark):
    def run_sync():
        asyncio.run(run_calculations_mp())
    benchmark(run_sync)

I run my benchmark like this:

$ pytest \
    --benchmark-max-time=20 \
    --benchmark-warmup=on \
    --benchmark-warmup-iterations=20

Expectation

I'd expect the run_calculations_mp function to finish significantly faster than the run_calculations function. I'm running on an Intel Core i7-8565U with 4 cores and HyperThreading for a total of 8 cores. I would expect the parallelized version to have a theoretical upper bound to its speed-up of 8x; less when factoring in the overhead of spinning up sub-processes.

Reality

When benchmarking the two functions against one another using pytest-benchmark, I get a 1.77x speed-up.

Questions

I'm confused as to why I'm only seeing a <2x speed-up in this completely contrived example. I'm not doing any I/O work; I'm not allocating a lot of memory. I don't believe I'm serializing and sending a lot of information back and forth between my worker processes from looking at the program, which might otherwise have constituted a bottleneck. Any pointers as to why this speed-up isn't significantly bigger would be greatly appreciated. Thank you!

Things I've tried

  • Using multiprocessing.Pool instead of asyncio and ProcessPoolExecutor, but I still only get a speed-up of about 1.72x.

Here's a link to the results of my benchmark.

0

There are 0 best solutions below