Which GPU execution dependencies have fixed latency (causing 'Wait' stalls)?

778 Views Asked by At

With recent NVIDIA micro-architectures, there's a new (?) taxonomy of warp stall reasons / warp scheduler states. One of these is:

Wait : Warp was stalled waiting on a fixed latency execution dependency.

As @GregSmith explains, fixed-latency instructions are: "Math, bitwise [and] register movement". But what are fixed-latency "execution dependencies"? Are these just "waiting for somebody else's fixed-latency instruction to conclude before we can issue it ourselves"?

1

There are 1 best solutions below

1
On

Execution dependencies are dependencies that need to be resolved before the next instruction can be issued. These include register operands and predicates. The WAIT stall reason will be issued between instructions that have fixed latency. The compiler can choose to add additional waits between instructions to the same pipeline if the pipeline issue frequency is not 1 warp per cycle (e.g. FMA and ALU pipe can issue every other cycle on GV100 - GA100).

EXAMPLE 1 - No dependencies - compiler added waits

IADD  R0, R1, R2;  # R0 = R1 + R2
// stall = wait for 1 additional cycle
IADD  R4, R5, R6;  # R4 = R5 + R6
// stall = wait for 1 additional cycle
IADD  R8, R9, R10; # R8 = R9 + R10

If the compiler did not add wait cycles then the stall reason would be math_throttle. This can also show up if the warp is ready to issue the instruction (all dependencies resolved) and another warp is issuing an instruction to the target pipeline.

EXAMPLE 2 - Wait stalls due to read after write dependency

IADD  R0, R1, R2;  # R0 = R1 + R2
// stall - wait for fixed number of cycles to clear read after write
IADD  R0, R0, R3;  # R0 += R3
// stall - wait for fixed number of cycles to clear read after write
IADD  R0, R0, R4;  # R0 += R4