I want to shuffle values in a 3D numpy
-array, but only when they are > 0.
When I run my function with a single core, it is much faster than with even 2 cores. It is way beyond the overhead of creating new python processes. What am I missing?
The following code outputs:
random shuffling of markers started
time in serial execution: 1.0288s
time executing in parallel with num_cores=1: 0.9056s
time executing in parallel with num_cores=2: 273.5253s
import numpy as np
import time
from random import shuffle
from joblib import Parallel, delayed
import multiprocessing
import numpy as np
def randomizeVoxels(V,markerLUT):
V_rand=V.copy()
# the xyz naming here does not match outer convention, which will depend on permutation
for ix in range(V.shape[0]):
for iy in range(V.shape[1]):
if V[ix,iy]>0:
V_rand[ix,iy]=markerLUT[V[ix,iy]]
return V_rand
V_ori=np.arange(1000000,-1000000,-1).reshape(100,100,200)
V_rand=V_ori.copy()
listMarkers=np.unique(V_ori)
listMarkers=[val for val in listMarkers if val>0]
print("random shuffling of markers started\n")
reassignedMarkers=listMarkers.copy()
#random shuffling of original markers
shuffle(reassignedMarkers)
markerLUT={}
for i,iMark in enumerate(listMarkers):
markerLUT[iMark]=reassignedMarkers[i]
tic=time.perf_counter()
for ix in range(len(V_ori)):
for iy in range(len(V_ori[0])):
for iz in range(len(V_ori[0][0])):
if V_ori[ix,iy,iz]>0:
V_rand[ix,iy,iz]=markerLUT[V_ori[ix,iy,iz]]
toc=time.perf_counter()
print("time in serial execution: \t\t\t{: >4.4f} s".format(toc-tic))
#######################################################################3
num_cores = 1
V_rand=V_ori.copy()
tic=time.perf_counter()
results= Parallel(n_jobs=num_cores)\
(delayed(randomizeVoxels)\
(V_ori[imSlice,:,:],
markerLUT
)for imSlice in range(V_ori.shape[0]))
for i,resTuple in enumerate(results):
V_rand[i,:,:]=resTuple
toc=time.perf_counter()
print("time executing in parallel with num_cores={}:\t{: >4.4f} s".format(num_cores,toc-tic))
num_cores = 2
V_rand=V_ori.copy()
MASK = "time executing in parallel with num_cores={}:\t {: >4.4f}s"
tic=time.perf_counter() #----------------------------- [PERF-me]
results= Parallel(n_jobs=num_cores)\
(delayed(randomizeVoxels)\
(V_ori[imSlice,:,:],
markerLUT
)for imSlice in range(V_ori.shape[0]))
for i,resTuple in enumerate(results):
V_rand[i,:,:]=resTuple
toc=time.perf_counter() #----------------------------- [PERF-me]
print( MASK.format(num_cores,toc-tic) )
Most probably the memory-I/O bottlenecks.
While the
numpy
-part of the processing seems to be pretty shallow here (shuffle does not compute a bit, but moves data between a pair of locations, doesn't it?), for the most of the time, this will not permit "time-enough" (by doing any useful work) so as to get the memory-I/O-s be masked by re-ordered CPU-core instructions (ref. latency-costs for straight + cross-QPI memory-I/O ops at the lowest levels of the contemporary super-scalar CISC architectures with highly speculative branch predictions (not useful for memory-I/O bound non-branching tightly crafted sections) and multi-core and many-core NUMA designs ).This is most probably why even the first spin-off concurrent process (no matter if enforced for camping on the same (here a shared-CPU-core time by an interleaving pair of a two-step dancing processes, again memory-I/O bound, with even worse chances for latency masking on shared memory-I/O channels...) or any other (here adding cross-QPI add-on latency costs if having to perform non-local memory-I/O, again worsening chances for memory-I/O latency-masking) CPU-core.
CPU-core hopping, enforced by the colliding effects of the CPU-clock Boost policy (later starting to violate the Thermal Management, thus hopping the process to camp on a next, colder CPU-core) will invalidate all CPU-core cache benefits, by not having the pre-cached data on the next, colder, core available, thus having to re-fetch all (once pre-cached already into the fastest L1data cache) data again (perhaps, for array-objects with larger memory footprints, having even a need to cross-QPI fetch), so harnessing more cores does not have a trivial effect on the resulting efficiency.