Why does joblib parallel execution make runtime much slower?

1.1k Views Asked by At

I want to shuffle values in a 3D numpy-array, but only when they are > 0.

When I run my function with a single core, it is much faster than with even 2 cores. It is way beyond the overhead of creating new python processes. What am I missing?

The following code outputs:

random shuffling of markers started
time in serial execution:                          1.0288s
time executing in parallel with num_cores=1:       0.9056s
time executing in parallel with num_cores=2:     273.5253s
import numpy as np
import time
from random import shuffle
from joblib import Parallel, delayed  
import multiprocessing

import numpy as np

def randomizeVoxels(V,markerLUT):
    V_rand=V.copy()
    # the xyz naming here does not match outer convention, which will depend on permutation
    for ix in range(V.shape[0]):
        for iy in range(V.shape[1]):
            if V[ix,iy]>0:
                V_rand[ix,iy]=markerLUT[V[ix,iy]]

    return V_rand

V_ori=np.arange(1000000,-1000000,-1).reshape(100,100,200)

V_rand=V_ori.copy()

listMarkers=np.unique(V_ori)
listMarkers=[val for val in listMarkers if val>0]

print("random shuffling of markers started\n")

reassignedMarkers=listMarkers.copy()
#random shuffling of original markers
shuffle(reassignedMarkers)

markerLUT={}
for i,iMark in enumerate(listMarkers):
    markerLUT[iMark]=reassignedMarkers[i]

tic=time.perf_counter()

for ix in range(len(V_ori)):
    for iy in range(len(V_ori[0])):
        for iz in range(len(V_ori[0][0])):
            if V_ori[ix,iy,iz]>0:
                V_rand[ix,iy,iz]=markerLUT[V_ori[ix,iy,iz]]

toc=time.perf_counter()

print("time in serial execution: \t\t\t{: >4.4f} s".format(toc-tic))

#######################################################################3

num_cores = 1

V_rand=V_ori.copy()

tic=time.perf_counter()

results= Parallel(n_jobs=num_cores)\
    (delayed(randomizeVoxels)\
        (V_ori[imSlice,:,:],
        markerLUT
        )for imSlice in range(V_ori.shape[0]))

for i,resTuple in enumerate(results):
    V_rand[i,:,:]=resTuple

toc=time.perf_counter() 

print("time executing in parallel with num_cores={}:\t{: >4.4f} s".format(num_cores,toc-tic))    

num_cores = 2

V_rand=V_ori.copy()

MASK = "time executing in parallel with num_cores={}:\t {: >4.4f}s"

tic=time.perf_counter() #----------------------------- [PERF-me]

results= Parallel(n_jobs=num_cores)\
    (delayed(randomizeVoxels)\
        (V_ori[imSlice,:,:],
        markerLUT
        )for imSlice in range(V_ori.shape[0]))

for i,resTuple in enumerate(results):
    V_rand[i,:,:]=resTuple

toc=time.perf_counter() #----------------------------- [PERF-me]

print( MASK.format(num_cores,toc-tic) )
2

There are 2 best solutions below

0
On

Q : "What am I missing?"

Most probably the memory-I/O bottlenecks.

enter image description here

While the numpy-part of the processing seems to be pretty shallow here (shuffle does not compute a bit, but moves data between a pair of locations, doesn't it?), for the most of the time, this will not permit "time-enough" (by doing any useful work) so as to get the memory-I/O-s be masked by re-ordered CPU-core instructions (ref. latency-costs for straight + cross-QPI memory-I/O ops at the lowest levels of the contemporary super-scalar CISC architectures with highly speculative branch predictions (not useful for memory-I/O bound non-branching tightly crafted sections) and multi-core and many-core NUMA designs ).

This is most probably why even the first spin-off concurrent process (no matter if enforced for camping on the same (here a shared-CPU-core time by an interleaving pair of a two-step dancing processes, again memory-I/O bound, with even worse chances for latency masking on shared memory-I/O channels...) or any other (here adding cross-QPI add-on latency costs if having to perform non-local memory-I/O, again worsening chances for memory-I/O latency-masking) CPU-core.

CPU-core hopping, enforced by the colliding effects of the CPU-clock Boost policy (later starting to violate the Thermal Management, thus hopping the process to camp on a next, colder CPU-core) will invalidate all CPU-core cache benefits, by not having the pre-cached data on the next, colder, core available, thus having to re-fetch all (once pre-cached already into the fastest L1data cache) data again (perhaps, for array-objects with larger memory footprints, having even a need to cross-QPI fetch), so harnessing more cores does not have a trivial effect on the resulting efficiency.

enter image description here

;o)
The numpy high performance & smart processing is here not the one to be blamed - the very opposite - it clearly demasks the CPU "starvation" state - known for ages to be The Very Performance Ceiling for all our modern CPUs - this is why we see so many-core CPUs, that try to circumvent this bottleneck by having more and more cores - see the commented silicon-level analysis referenced above.

Last but not least
the code as-is contains immense count of opportunities to improve it's performance, numpy-smart-vectorised being the first one to name, avoiding range()-loops, so there are more tips to follow, all of which will finally ring the headbang into the very same trouble - the CPU-starvation ceiling

0
On

The reason using multiple processes can be slower than a single process is because multiprocessing requires that arguments are serialized before being sent to worker processes. This introduces additional overhead that scales with the amount of data that must be serialized.

The Python multiprocessing library uses pickle for this, while joblib has a custom serializer loky. The documentation can be found here: https://joblib.readthedocs.io/en/latest/parallel.html#serialization-processes

For more, here is another StackOverflow answer that gives more context about why this serialization is needed. https://joblib.readthedocs.io/en/latest/parallel.html#serialization-processes