I'm trying to leverage concurrent.futures.ProcessPoolExecutor to speed up CPU-bound work in Python, and I'm seeing less than 2x speed-up in execution time improvement running on a machine with 8 cores (4 cores and HyperThreading). I'm struggling to understand why.
I have the following script:
import asyncio
from concurrent.futures import ProcessPoolExecutor
from time import sleep
CALCULATION_COUNT = 1000000
def calculate(size: int):
# Meaningless work to keep the CPU occupied
n = 2
for i in range(1, size):
n *= (i % 4) + 1
n /= (i % 4) + 1
return 2
# Sequential version
def run_calculations():
return calculate(CALCULATION_COUNT)
# Parallel version
def _chunks(chunk_size):
for _ in range(0, CALCULATION_COUNT, chunk_size):
yield None
async def run_calculations_mp():
loop = asyncio.get_running_loop()
tasks = []
chunk_size = CALCULATION_COUNT // 8
with ProcessPoolExecutor() as executor:
for _ in _chunks(chunk_size):
tasks.append(loop.run_in_executor(executor, calculate, chunk_size))
await asyncio.gather(*tasks)
and the following code for benchmarking the functions in that script using the pytest plugin pytest_benchmark:
import asyncio
from my_module import *
def test_benchmark_sequential(benchmark):
benchmark(run_calculations)
def test_benchmark_parallel(benchmark):
def run_sync():
asyncio.run(run_calculations_mp())
benchmark(run_sync)
I run my benchmark like this:
$ pytest \
--benchmark-max-time=20 \
--benchmark-warmup=on \
--benchmark-warmup-iterations=20
Expectation
I'd expect the run_calculations_mp function to finish significantly faster than the run_calculations function. I'm running on an Intel Core i7-8565U with 4 cores and HyperThreading for a total of 8 cores. I would expect the parallelized version to have a theoretical upper bound to its speed-up of 8x; less when factoring in the overhead of spinning up sub-processes.
Reality
When benchmarking the two functions against one another using pytest-benchmark, I get a 1.77x speed-up.
Questions
I'm confused as to why I'm only seeing a <2x speed-up in this completely contrived example. I'm not doing any I/O work; I'm not allocating a lot of memory. I don't believe I'm serializing and sending a lot of information back and forth between my worker processes from looking at the program, which might otherwise have constituted a bottleneck. Any pointers as to why this speed-up isn't significantly bigger would be greatly appreciated. Thank you!
Things I've tried
- Using
multiprocessing.Poolinstead ofasyncioandProcessPoolExecutor, but I still only get a speed-up of about 1.72x.
Here's a link to the results of my benchmark.