I am working on a project which include some simple array operations in a huge array. i.e. A example here
function singleoperation!(A::Array,B::Array,C::Array)
@simd for k in eachindex(A)
@inbounds C[k] = A[k] * B[k] / (A[k] +B[k]);
end
I try to parallelize it to get a faster speed. To parallelize it, I am using distirbuded and share array function, which just modified a bit on the function I just show:
@everywhere function paralleloperation(A::SharedArray,B::SharedArray,C::SharedArray)
@sync @distributed for k in eachindex(A)
@inbounds C[k] = A[k] * B[k] / (A[k] +B[k]);
end
end
However, there has no time difference between two functions even I am using 4 threads (with the try on R7-5800x and I7-9750H CPU). Can I know anythings I can improve in this code? Thanks a lot! I will post the full testing code in below:
using Distributed
addprocs(4)
@everywhere begin
using SharedArrays
using BenchmarkTools
end
@everywhere function paralleloperation!(A::SharedArray,B::SharedArray,C::SharedArray)
@sync @distributed for k in eachindex(A)
@inbounds C[k] = A[k] * B[k] / (A[k] +B[k]);
end
end
function singleoperation!(A::Array,B::Array,C::Array)
@simd for k in eachindex(A)
@inbounds C[k] = A[k] * B[k] / (A[k] +B[k]);
end
end
N = 128;
A,B,C = fill(0,N,N,N),fill(.2,N,N,N),fill(.3,N,N,N);
AN,BN,CN = SharedArray(fill(0,N,N,N)),SharedArray(fill(.2,N,N,N)),SharedArray(fill(.3,N,N,N));
@benchmark singleoperation!(A,B,C);
BenchmarkTools.Trial: 1612 samples with 1 evaluation.
Range (min … max): 2.582 ms … 9.358 ms ┊ GC (min … max): 0.00% … 0.00%
Time (median): 2.796 ms ┊ GC (median): 0.00%
Time (mean ± σ): 3.086 ms ± 790.997 μs ┊ GC (mean ± σ): 0.00% ± 0.00%
@benchmark paralleloperation!(AN,BN,CN);
BenchmarkTools.Trial: 1404 samples with 1 evaluation.
Range (min … max): 2.538 ms … 17.651 ms ┊ GC (min … max): 0.00% … 0.00%
Time (median): 3.154 ms ┊ GC (median): 0.00%
Time (mean ± σ): 3.548 ms ± 1.238 ms ┊ GC (mean ± σ): 0.08% ± 1.65%
As the comments note, this looks like perhaps more of a job for multithreading than multiprocessing. The best approach in detail will generally depend on whether you are CPU-bound or memory-bandwith-bound. With so simple a calculation as in the example, it may well be the latter, in which case you will reach a point of diminishing returns from adding additional threads, and and may want to turn to something featuring explicit memory modelling, and/or to GPUs.
However, one very easy general-purpose approach would be to use the multithreading built-in to LoopVectorization.jl
which gives us
Now, the fact that the singlethreaded LoopVectorization
@turboversion is almost perfectly tied with the singlethreaded@inbounds @simdversion is to me a hint that we are probably memory-bandwidth bound here (usually@turbois notably faster than@inbounds @simd, so the tie suggests that the actual calculation is not the bottleneck) -- in which case the multithreaded version is only helping us by getting us access to a bit more memory bandwidth (though with diminishing returns, assuming there is some main memory bus that can only go so fast regardless of how many cores it can talk to).To get a bit more insight, let's try making the arithmetic a bit harder:
then sure enough
now we're closer to CPU-bound, and now threading and SIMD-vectorization is the difference between 2.6 seconds and 90 ms!
If your real problem is going to be as memory-bound as the example problem, you may consider working on GPU, on a server optimized for memory bandwidth, and/or using a package that puts a lot of effort into memory modelling.
Some other packages you might check out could include Octavian.jl (CPU), Tullio.jl (CPU or GPU), and GemmKernels.jl (GPU).