I'm trying to understand the effect of work stealing on recursive tasks: One of the advantages of work stealing is that the current worker/thread is likely to execute its own spawned tasks; increasing data locality. However, what happens in the common case when a worker join on its spawned tasks? For example:
Future<String> a=pool.submit(()->doA());
b=doB();
return a.get()+b;
I think here the current thread will get blocked, thus unable to take the work from its own queue, thus another worker will have to steal those works. This would deny the locality advantage of work stealing. However, according to wikipedia (https://en.wikipedia.org/wiki/Work_stealing) "Work stealing is designed for a "strict" fork–join model of parallel computation" I must have some mistake in my reasoning, but I can not found it.
More in the details, consider the following code:
Future<String> res=pool.submit(()->{
Future<String> a=pool.submit(()->doA());
b=doB();
return a.get()+b;
});
res.get();
This code should start the computation inside a worker. Such worker will spawn a new task. Then he try to get the result for this nested task. How is this nested task executed?
The fork-join pool provides a high performance, parallel, fine-grined task execution framework for Java Programmers.
It solves the problem by divide and conquer. Splitiing the task into sub-tasks. A task creates a sub-task by fork() method.
When task client submits/invokes/executes a fork join task, the task gets into a shared queue, this shared queue is used to feed unshared double ended queue (aka "deque") managed by the WorkerThread.
One or more WorkerThreads are called a Fork-Join pool.
A WorkerThread pulls off the task from the shared queue, and they go a head and process the work (using the unshared queue).
Each WorkerThread (which is actually a Java thread) in Fork-Join-Pool runs in a loop that continually scans for (sub-)task to execute.
The goal is to try to keep the WorkerThreads as busy as possible so we want them to always have something to do.
The goal is maximize processor core utilization.
Each WorkerThread has it's, double ended queue (aka "deque") that serves as its main source of task.
Above and beyond, that other shared queue that used to get, non-fork join task into the fork-join pool, in 1st place.
The "deque" are implemented by WorkQueue (which is a Java class that's nested inside ForkJoinPool). Some important methods in that class are push(), pop() and poll().
At some point, the task can't make any progress, since it's wait for the sub-task to complete by the join() method .
This join is different from a join in a Java thread.
In Java Thread Join if a task doesn't return a result, is blocks, and waits to this other thread to finish.
If in Fork-Join the join() happens to block, then the WorkerThread stops working on the current thread and start executing the sub-task..
Whenever you are calling a fork() inside compute method inside of a RecursiveTask<> or RecursiveAction that is always running in the context of thread in the fork-join-pool.
If a task a RecursiveTask<> or RecursiveAction is running by the WorkerThread calls fork(), that new ForkJoinTask* is pushed onto the head of that workers "deque" thread.
It pushes it, in LIFO order, Last In First Out.
When we call join() for this task, that task would be poped from the head of the "deque" (top of the stack) and is run to completion(keep running till it's done) in the WorkerThread.
Why do we do LIFO? Why do we push at the front and pop at the front ? In order to improve locality of reference, improve cache performance, so that you get the processing as quickly as possible, sometimes called fresh work before stale.
The ForkJoinTask enables fine-grained data parallelism.
A ForkJoinTask is lighter than a Java thread, it doesn't have it's own runtime stack .
A ForkJoinTask associates a chunk of data along with a computation on that data.
A real Java thread has it's own stack, registers, a lot of other resources that allow it to be scheduled independently managed by the thread scheduler, that the operating system has internally.
A large number of ForkJoinTask can run in a much smaller number of WorkerThreads in the Fork-Join-Pool.
The number of WorkerThreads is typically (if not specified) a function of the number of cores. Each WorkerThread is a Java thread object with all it's accoutrements that you expect from a normal thread.
A ForkJoinTask has two important methods that control parrallel processing and merging the results, these are fork() and join().
A fork() arranges to asynchronously execute this task in the appropriate thread pool. A fork() is like a lightversion of Thread.start().
A fork() doesn't create a Java worker thread (at least not directly), But eventually, it will run on a Java Thread.
It doesn't start to run immediately but instead it places the sub-Task on the head of the work queue.
A join() returns of the computation when the sub-Task is finished. Join in a Fork-Join pool is different from the classic Java thread join. Java thread is used as a barrier syncronizer to wait for another thread to finish and then you join with it (You can't proceed until the other one is done).
A join in a regular thread blocks the calling thread.
The join in a Fork-Join pool doesn't simply block the calling thread, instead, the WorkerThread is assigned to run the pending sub-tasks.
When the WorkerThread encounteres a join() it processes any other sub-tasks until it notices the target sub-task is done. The WorkerThreads doesn't return to the caller until this sub-tasks results are finished.
A Join in a fork-join task is not block, it holds the current task, so the computation to be continued only after it's sub-tasks created by the join() finished.
The WorkerThread figures out, that the task is blocked till the sub-task is finished so it start to work on the sub-task.
A WorkerThread processes it's own "deque" in a LIFO order, by popping (sub-)tasks from it's own "deque".
WORK STEALING
When a WorkerThread doesn't have anything else to do - "idle" .If WorkerThread's own queue is empty, it will go and try to "steal" a sub-task from the tail of some other busy thread "deque", that is selected randomlly, inorder to maximaize core utilization.
The tasks are "stolen" in FIFO order, since an older stolen task may provide a large unit of work.
Push() and pop() are only called by the owning worker thread (To the top of the "deque") that is the reason they are most efficient they use wait-free "Compare-And-Swap" CAS operations. CAS is a hardware level of atomically check and set the value of lock in memory – it never blocks . push() and pop() has a very lightweight locking.
Poll() may be called from another thread to "steal" as sub-task. When we call poll() that's because the other thread has been randomly assigned to try to "steal" sub-task from the end of this deque in FIFO order. Poll() is initiated by another thread, as a result it may not always be wait-free, so sometimes it has to "yield" and come back and try later again. "stealing" is fast but it may not be as fast as pushing and popping.