I'm new to Erlang, so for training I try to implement standard functions from scratch. I've tried to create parallel implementation of map/2 function from lists module. But my implementation works very slow. Could you point me, if I did any principal mistakes in my implementation:
-module( my_pmap ).
-export([ pmap/2 ]).
-export([ map/4, collect/3 ]).
map( F, Value, Indx, SenderPid ) ->
SenderPid ! { Indx, F( Value ) }.
pmap( F, List ) ->
CollectorPid = spawn_link( my_pmap, collect, [ length( List ), [], self() ] ),
lists:foldl(
fun( X, Indx ) ->
spawn_link( my_pmap, map, [ F, X, Indx, CollectorPid ] ),
Indx + 1
end,
1,
List ),
Mapped =
receive
{ collected, M } ->
M
end,
Sorted = lists:sort(
fun( { Indx1, _ }, { Indx2, _ } ) ->
Indx1 < Indx2
end,
Mapped ),
[ Val || { _Indx, Val } <- Sorted ].
collect( 0, List, SenderPid ) ->
SenderPid ! { collected, List };
collect( N, List, SenderPid ) when N > 0 ->
receive
Mapped ->
collect( N - 1, [ Mapped | List ], SenderPid )
end.
And here is results of testing:
1> c(my_pmap).
{ok,my_pmap}
2> timer:tc( my_pmap, pmap, [ fun(X) -> X*X*X*X end, lists:seq( 1, 10000 ) ] ).
{137804,
[1,16,81,256,625,1296,2401,4096,6561,10000,14641,20736,
28561,38416,50625,65536,83521,104976,130321,160000,194481,
234256,279841,331776,390625,456976,531441|...]}
3> timer:tc( lists, map, [ fun(X) -> X*X*X*X end, lists:seq( 1, 10000 ) ] ).
{44136,
[1,16,81,256,625,1296,2401,4096,6561,10000,14641,20736,
28561,38416,50625,65536,83521,104976,130321,160000,194481,
234256,279841,331776,390625,456976,531441|...]}
As you might have seen 0,137804 sec. vs. 0,044136 sec.
Thanks
The comments are correct. The problem is that spawning processes are cheap but it does have a cost. Multiplying A number three times is very fast and the overhead of spawning a new process kills your performance.
Partitioning the list into fragments and processing each fragment in a separate process will probably be faster. If you know you have 8 cores, you could try to split it in 8 fragments. Things like pmap can be implemented in Erlang, but it is not a strength of Erlang. A system like the Haskell GHC runtime has sparks which is a better tool for fine-grained parallelism like this. Also, multiplying like that is an obvious candidate for either SIMD instructions in SSE or a GPU. Erlang has no solution for this either, but again, GHC has
accelerate
andrepa
which are libraries for handling this situation.On the other hand, you can get a good speedup in Erlang by simply using processes to handle a couple of fragments as hinted. Also note that parallel computation often performs badly at low N (like 10000) because of the communication overhead. You need way larger problems to reap the benefits.