Here is my attempt to benchmark the performance of Intel TBB flow graph. Here is the setup:
- One broadcast node sending
continue_msg
toN
successor nodes
( abroadcast_node<continue_msg>
) - Each successor node performs a computation that takes
t
seconds. - The total computation time when performed serially is
Tserial = N * t
- The ideal computation time, if all cores are used, is
Tpar(ideal) = N * t / C
, whereC
is the number of cores. - The speedup is defined as
Tpar(actual) / Tserial
- I tested the code with gcc5 on a 16 core PC.
Here are the results showing the speedup as a function of the processing time of individually task ( i.e. body ):
t = 100 microsecond , speed-up = 14
t = 10 microsecond , speed-up = 7
t = 1 microsecond , speed-up = 1
As can been for lightweight tasks ( whose computation takes less than 1 microseconds ), the parallel code is actually slower that the serial code.
Here are my questions:
1 ) Are these results inline with Intel TBB benchmarks?
2 ) It there a better paradigm, than a flow graph for the case, when there are thousands of tasks each taking less than 1 microsecond?
The Overhead of Parallelism
Your cost model is wrong.
The ideal parallel computation time is:
where
Pstart
is how long it takes to start your parallelism andPend
is the time taken to finish it. It's not unusual forPstart
to be on the order of tens of milliseconds.I'm not sure if you're familiar with OpenMP (though it's a good thing to know), but, for our purposes it's a threading model that divides work between a team of threads. The following chart shows the overhead of some of the commands in relation to the size of the thread team:
The takeaway is that getting your parallelism going (the
parallel for
line) is potentially expensive and grows with the number of threads. Ending parallelism (thebarrier
line) has similar costs.In fact, if you look at TBB's tutorial, Section 3.2.2 ("Automatic Chunking") says:
Achieving Better Speed
The best way to speed up code like this is to only perform the operations in parallel if there are a large number of them and/or to adjust the number of threads doing the work so that each thread has plenty to do. In TBB you could achieve similar behaviour like so:
where you'd want to adjust
1000
to a number high enough that you start to see gains from parallelism.You could also reduce the number of threads, since this reduces the overhead somewhat:
TBB also performs dynamic load balancing (see here). This is great if you expect your loop iterations/tasks to have a broad distribution of run-times, but represents unnecessary overhead if the expected run-times are the same. I'm not sure if TBB has static scheduling, but it may be worth looking into.
In case people end up here without a strong commitment to TBB, in OpenMP you'd do something like: