I am new to the parallelizing paradigm. I already have the algorithm working in its serial form but I can't parallelize it.
I have asked around and some have told me that the way my program is written can't be parallelized.
Here is my serial code, any help or recommendation will be kindly appreciated.
import numpy as np #Importar la librería de numpy para usar sus funciones
if __name__ == "__main__":
#Numero de elementos debe ser par
N = 1_000_000_000#int(input("Número de iteraciones/intervalos (número par): "))
#Declara la funcion a integrar
f = lambda x : x*0.5
#Limites
a = 3#int(input("Dame el limite inferior: "))
b = 10#int(input("Dame el limite superior: "))
#delta de x
dx = (b-a)/N
#División de intervalos
x = np.linspace(a,b,N+1)
y = f(x)
#Suma de cada uno de los intervalos
resultado = dx/3 * np.sum(y[0:-1:2] + 4*y[1::2] + y[2::2])
print("N = " + str(N))
print("El valor de la integral es: ")
print(resultado)
A more important side of the same coin is not how to parallelise a program ( or its part ), but at what add-on costs' penalties that will actually happen. From several options how to parallelise, the best one should be selected. Here, due to large data/memory-footprints ( an adverse O( n )-scaling in SpaceDOMAIN, due to sizes above
V * 8 * 1E9 [B]
(sum of memory-footprints of all objects used, incl. interim storage variables), that next, indirectly, increases the O( n )-scaling in TimeDOMAIN (duration), due to memory-I/O volume & bottle-necking of all RAM-I/O channels available ) a kind of fine-grain parallelism, called vectorisation, seems to fit best, as it adds almost zero add-on costs, yet helping reduce the RAM-I/O costs for an extremely low-complexityf(x)
-lambda, which makes memory pre-fetches of in-contiguous blocks (due to index-hopping) almost impossible to latency-mask.There are at least two places, where a code-conversion will help, most on CPU micro-architectures, for which their native vector-instructions can get harnessed in most recent versions for indeed an HPC-grade parallel execution of
numpy
-vectorised code.a )
we may spend less time in producing/storing interim objects and use the brave
numpy
-code smart-vectorised trick:If in doubt, feel free to read more about the costs of various kinds of computing / storage related access latencies.
b )
we may reduce, at least some parts of the repetitive memory-access patterns, that ( as expressed above, due to sizes of about ~1E9, cannot remain in-cache and will have to be re-executed and re-fetched from physical RAM ) still meet the computation:
While the mock-up code does not aspire to solve all corner cases, the core-benefit arrives from using RAM-contiguous memory-layouts, that are very efficiently
numpy.sum()
-ed and from avoiding of replicated-visits to memory-areas, re-visited just due to imperatively dictated (non-contiguous) index-jumping (numpy
can optimise some of its own indexing so as to maximise memory-access pattern/cache-hits, yet the "outer", coded index-jumping is almost always beyond of the reach of such smart, but hard-wired,numpy
-optimisation know-how ( the less a silicon-based thinking or a clairvoyance ;o) )