I am reading the papers that describe Cilk's work stealing scheduling performance.
1) My understanding is that the scheduler does not know the tasks of the critical path but just tries to maintain its execution in any case by stealing tasks that are not 'deep' in the task graph. Is that correct?
2) Also, does Cilk work stealing scheduler assume that all tasks are of similar complexity? If tasks are not uniform in complexity, isn't it that the scheduler will be less flexible in achieving the best performance i.e. the best load balancing?
Perhaps the best way to answer (1) is that Cilk scheduling is breadth-first when stealing a task, but depth-first otherwise.
The answer to (2) is that the Cilk task scheduler is oblivious to the complexity of the tasks.
A key principle to understand is "Brent's Lemma" (discovered earlier by Graham). The lemma shows that (under PRAM assumptions) that given a greedy scheduler (one that never lets a worker idle if there is work to do), then the absolute best possible schedule is no more than 2x faster than the worst possible schedule, no matter what the complexities of the tasks are.
The intuition behind the 2x is limit is that during execution, if no worker is working on a task on the critical path (munching on the critical path), then every worker is busy (munching on the total work). The critical path plus the total work can't exceed twice the total work of the algorithm.
The PRAM assumption is probably the weakest link to real machines, where cache misses can take on the order of 100x more time than cache hits. So worrying about the complexity of tasks is less important than cache considerations. Depending on how Cilk is used, it plays very well with cache (recursive programs) or badly (back-to-back cilk_for loops).