I am trying to parallelize some code used in the Gauss-Seidel algorithm to approximate the solution of a Linear Equation System.
In brief, for an NxN
matrix, during one iteration, I am doing sqrt(N)
sessions of parallel computation, one by one. During one session of parallel computation, I distribute the task of calculating sqrt(N)
values from a vector among the available workers.
The code involved in a parallel computation session is this:
future_results(1:num_workers) = parallel.FevalFuture;
for i = 1:num_workers
start_itv = buck_bound+1 + (i - 1) * worker_length;
end_itv = min(buck_bound+1 + i * worker_length - 1, ends_of_buckets(current_bucket));
future_results(i) = parfeval(p, @hybrid_parallel_function, 3, A, b, x, x_last, buck_bound, n, start_itv, end_itv);
end
for i = 1:num_workers
[~, arr, start_itv, end_itv] = fetchNext(future_results(i));
x(start_itv:end_itv) = arr;
end
The function called by parfeval
is this:
function [x_par, start_itv, end_itv] = hybrid_parallel_function (A, b, x, x_last, buck_bound, n, start_itv, end_itv)
x_par = zeros(end_itv - start_itv + 1, 1);
for i = start_itv:end_itv
x_par(i-start_itv+1) = b(i);
x_par(i-start_itv+1) = x_par(i-start_itv+1) - A(i, 1:buck_bound) * x(1:buck_bound);
x_par(i-start_itv+1) = x_par(i-start_itv+1) - A(i, buck_bound+1:i-1) * x_last(buck_bound+1:i-1);
x_par(i-start_itv+1) = x_par(i-start_itv+1) - A(i, i+1:n) * x_last(i+1:n);
x_par(i-start_itv+1) = x_par(i-start_itv+1) / A(i, i);
end
end
The entire code can be found here: https://pastebin.com/hRQ5Ugqz
The matlab profiler for a 1000x1000
matrix. The parallel code is between 20 to 135 times slower than its serial counterpart, depending on the chosen coefficient matrix (and still much faster than spmd
).
The parfeval computation might be lazily split between the lines 50 and 57? Still, I cannot explain to myself why there is this major overhead. It seems to have something to do with the number of times parfeval is called: I did lower the execution time by lowering the parfeval calls.
Is there something that can be further optimized? Do I have to resort to writing the code in C++?
Please help. Thank you very much!
There's a few possibilities here. Most importantly is the simple fact that if you're using the
'local'
cluster type, then the workers are running in single-threaded code. In situations where the "serial" code is actually taking advantage of MATLAB's intrinsic multi-threading, then you're already taking full advantage of the available CPU hardware, and using parallel workers cannot gain you anything. It's not certain that this is the case for you, but I'd strongly suspect it given the code.There are overheads to running in parallel, and as you've observed, running fewer
parfeval
calls lowers these overheads. Your code as written copies the whole of theA
matrix to each worker multiple times. You dont' need to changeA
, so you could useparallel.pool.Constant
to avoid those repeated copies.While
parfeval
is more flexible, it tends to be less efficient thanparfor
in cases whereparfor
can be applied.Yes, you can expect the workers to start working as soon as the first
parfeval
call has completed.(Sorry, this isn't really a proper "answer", so some kind soul will probably come along and delete this soon, but there's much too much to fit into a comment).