import multiprocessing as mp
import numpy as np
pool = mp.Pool( processes = 4 )
inp = np.linspace( 0.01, 1.99, 100 )
result = pool.map_async( func, inp ) #Line1 ( func is some Python function which acts on input )
output = result.get() #Line2
So, I was trying to parallelize some code in Python, using a .map_async()
method on a multiprocessing.Pool()
instance.
I noticed that while
Line1
takes around a thousandth of a second,
Line2
takes about .3 seconds.
Is there a better way to do this or a way to get around the bottleneck caused by Line2
,
or
am I doing something wrong here?
( I am rather new to this. )
Do not panic, many users do the very same - Paid more than received.
This is a common lecture not on using some "promising" syntax-constructor, but on paying the actual costs for using it.
The story is long, the effect was straightforward - you expected a low hanging fruit, but had to pay an immense cost of process-instantiation, work-package re-distribution and for collection of results, all that circus just for doing but a few rounds of
func()
-calls.Wow?
Stop!
Parallelisation was brought to me that will SPEEDUP processing?!?
Well, who told you that any such ( potential ) speedup is for free?
Let's be quantitative and rather measure the actual code-execution time, instead of emotions, right?
Benchmarking is always a fair move.
It helps us, mortals, to escape from just expectations
and get ourselves into quantitative records-of-evidence supported knowledge:
AS-IS test:
Before moving forwards, one ought record this pair:
This will set the span among the performance envelopes from a pure-
[SERIAL]
[SEQ]-of-calls, to an un-optimisedjoblib.Parallel()
or any other, if one wishes to extend the experiment with any other tools, like a saidmultiprocessing.Pool()
or other.Test-case A:
Intent:
so as to measure the cost of a { process | job }-instantiation, we need a NOP-work-package payload, that will spend almost nothing "there" but return "back" and will not require to pay any additional add-on costs ( be it for any input parameters' transmissions or returning any value )
So, the setup-overhead add-on costs comparison is here:
Using a strategy of
joblib.delayed()
onjoblib.Parallel()
task-processing:Using a strategy of a lightweight
.map_async()
method on amultiprocessing.Pool()
instance:So,
the first set of pain and surprises
comes straight at the actual cost-of-doing-NOTHING in a concurrent pool of
joblib.Parallel()
:So, this scientifically fair and rigorous test started from this simplest ever case, already showing the benchmarked costs of all the associated code-execution processing setup-overheads a smallest ever
joblib.Parallel()
penalty sine-qua-non.This forwards us into a direction, where real-world algorithms do live - best with next adding some larger and larger "payload"-sizes into the testing loop.
Now, we know the penalty
for going into a "just"-
[CONCURRENT]
code-execution - and next?Using this systematic and lightweight approach, we may go forwards in the story, as we will need to also benchmark the add-on costs and other Amdahl's Law indirect effects of
{ remote-job-PAR-XFER(s) | remote-job-MEM.alloc(s) | remote-job-CPU-bound-processing | remote-job-fileIO(s) }
A function template like this may help in re-testing ( as you see there will be a lot to re-run, while the O/S noise and some additional artifacts will step into the actual cost-of-use patterns ):
Test-case B:
Once we have paid the up-front cost, the next most common mistake is to forget the costs of memory allocations. So, lets test it:
In case your platform will stop to be able to allocate the requested memory-blocks, there we head-bang into another kind of problems ( with a class of hidden glass-ceilings if trying to go-parallel in a physical-resources agnostic manner ). One may edit the
SIZE1D
scaling, so as to at least fit into the platform RAM addressing / sizing capabilites, yet, the performance envelopes of the real-world problem computing are still of our great interest here:may yield
a cost-to-pay, being anything between
0.1 [s]
and+9 [s]
(!!)just for doing STILL NOTHING, but now also without forgetting about some realistic MEM-allocation add-on costs "there"
Test-case C:
kindly read the tail sections of this post
Test-case D:
kindly read the tail sections of this post
Epilogue:
For each and every "promise", the fair best next step is first to cross-validate the actual code-execution costs, before starting any code re-engineering. The sum of real-world platform's add-on costs may devastate any expected speedups, even if the original, overhead-naive Amdahl's Law might have created some expected speedup-effects.
As Mr. Walter E. Deming has expressed many times, without DATA we make ourselves left to just OPINIONS.
A bonus part:
having read as far as here, one might already found, that there is not any kind of "drawback" or "error" in the
#Line2
per se, but the carefull design practice will show any better syntax-constructor, that spend less to achieve more ( as actual resources ( CPU, MEM, IOs, O/S ) permit on the code-execution platform ). Anything else is not principally different from a just blind telling Fortune.