Work-item branch divergence in OpenCL, how does it work?

297 Views Asked by At

I'm studying something about OpenCL and I don't understand very well the concept of "work-item divergence or Divergent Control Flow".

As we can see in the picture below, there are some warp or wavefront, depends of the model of the GPU that executes one instruction or another instruction.

Example

Now, my question is: all the warp/wavefront will execute the if condition and later the else condition or only one of these (only the if or only the else) as a normal control flow of a program.

This question can be very stupid, but on the web, I didn't find anything and with other material, I don't understand the point.

1

There are 1 best solutions below

3
pmdj On

The key to understanding the GPU-style SIMD execution model is that all threads in a wavefront/SIMD group always execute the exact same instruction at the same time. If a thread doesn't need to run an instruction that at least one other thread must execute, there won't be any side effects (register values won't change, etc.), but it still costs as much in terms of performance as if it really did run it.

If the branching condition is either true or false for all threads in a wavefront/SIMD group, then all threads only run the one branch, and the other branch is skipped. So if the condition is the same for almost all threads in your workload, or if you can arrange for the condition to be the same for all threads in a group, then you don't pay the divergence cost. (Or it becomes negligible.)

If there is a frequent divergence within the group, the whole wavefront needs to execute both branches. When this happens, the threads which don't need to actually run the code, will still step through those instructions required by the other threads at exactly the same time as those other threads, it just has no effect. Unlike hardware CPU threads, a GPU thread can't run different code from other threads (in the same SIMD group), it can only run the same code on different data, or it has to wait until the other threads have finished the code it doesn't need to run.