What exactly happens internally when we run joblib.Parallel inside the function that we pass in delayed? Is it a good coding practice? Why does it increase the computation time? is it because of the overhead of managing(switching between) more parallel processes?
import time
from joblib import Parallel, delayed
def sqr(i):
return i*i
def sqr_chunk(chunk): # processing chunks in parallel
return Parallel(n_jobs=2)(delayed(sqr)(i) for i in chunk)
def sqr_sub_chunk(sub_chunk): # processing sub_chunks in parallel
return Parallel(n_jobs=2)(delayed(sqr_chunk)(chunk) for chunk in sub_chunk)
def avg(l):
s=0
for i in l:
s+=i
return s/len(l)
l0, l1, l2 = [], [], []
for i in range(20):
l = list(range(1000))
t1 = time.time()
result1 = Parallel(n_jobs=2)(delayed(sqr)(i) for i in l)
t2 = time.time()
l0+=[t2-t1]
chunks = [list(range(i,i+100)) for i in range(0,1000,100)]
t1 = time.time()
result2 = Parallel(n_jobs=2)(delayed(sqr_chunk)(chunk) for chunk in chunks)
t2 = time.time()
l1+=[t2-t1]
sub_chunks = [[i[:50],i[50:]] for i in chunks]
t1 = time.time()
result3 = Parallel(n_jobs=2)(delayed(sqr_sub_chunk)(sub_chunk) for sub_chunk in sub_chunks)
t2 = time.time()
l2+=[t2-t1]
print(avg(l0))
print(avg(l1))
print(avg(l2))
"""
output :
0.058841276168823245
0.14938125610351563
0.10537683963775635
"""
The function
Parallelof Joblib creates multiple jobs which can be either threads or processes depending on the backend used. The overheads are explained in the documentation:In your case, the chunks (including each integer) need to be serialized/unserialized (typically using pickle) which is particularly expensive, not to mention the inter-process communication overhead.
There are other backends, like the ones based on threads but they are limited by the GIL:
In your case, the GIL is not released. Indeed, the GIL is needed for all pure-Python code (strong limitation of the CPython interpreter).
There is no other possible way to parallelize a pure-Python code. If you want a faster code, then I advise you to try Numpy. This should be actually much faster than a parallel Python code run with CPython because CPython is an (slow) interpreter (while Numpy functions are mostly native ones). It can also be combined with Numba and Cython (e.g. to use multiple threads without the GIL). Be aware that integers types of these modules are native so they have a limited/fixed size not more than 64-bits).