I'm trying to update some older CUDA code (pre CUDA 9.0), and I'm having some difficulty updating usage of warp shuffles (e.g., __shfl).

Basically the relevant part of the kernel might be something like this:

int f = d[threadIdx.x];
int warpLeader = <something in [0,32)>;

// Point being, some threads in the warp get removed by i < stop
for(int i = k; i < stop; i+=skip)
{
   // Point being, potentially more threads don't see the shuffle below.
   if(mem[threadIdx.x + i/2] == foo)
   {
      // Pre CUDA 9.0.
      f = __shfl(f, warpLeader); 
  }
}

Maybe that's not the best example (real code too complex to post), but the two things accomplished easily with the old intrinsics were:

  • Shuffle/broadcast to whatever threads happen to be here at this time.
  • Still get to use the warp-relative thread index.

I'm not sure how to do the above post CUDA 9.0.

This question is almost/partially answered here: How can I synchronize threads within warp in conditional while statement in CUDA?, but I think that post has a few unresolved questions.

I don't believe __shfl_sync(__activemask(), ...) will work. This was noted in the linked question and many other places online.

The linked question says to use coalesced_group, but my understanding is that this type of cooperative_group re-ranks the threads, so if you had a warpLeader (on [0, 32)) in mind as above, I'm not sure there's a way to "figure out" its new rank in the coalesced_group.

(Also, based on the truncated comment conversation in the linked question, it seems unclear if coalesced_group is just a nice wrapper for __activemask() or not anyway ...)

It is possible to iteratively build up a mask using __ballot_sync as described in the linked question, but for code similar to the above, that can become pretty tedious. Is this our only way forward for CUDA > 9.0?

1

There are 1 best solutions below

3
On

I don't believe __shfl_sync(__activemask(), ...) will work. This was noted in the linked question and many other places online.

The linked question doesn't show any such usage. Furthermore, the canonical blog specifically says that usage is the one that satisfies this:

Shuffle/broadcast to whatever threads happen to be here at this time.

The blog states that this is incorrect usage:

//
// Incorrect use of __activemask()
//
if (threadIdx.x < NUM_ELEMENTS) { 
    unsigned mask = __activemask(); 
    val = input[threadIdx.x]; 
    for (int offset = 16; offset > 0; offset /= 2)
        val += __shfl_down_sync(mask, val, offset);

(which is conceptually similar to the usage given in your linked question.)

But for "opportunistic" usage, as defined in that blog, it actually gives an example of usage in listing 9 that is similar to the one that you state "won't work". It certainly does work following exactly the definition you gave:

Shuffle/broadcast to whatever threads happen to be here at this time.

If your algorithm intent is exactly that, it should work fine. However, for many cases, that isn't really a correct description of the algorithm intent. In those cases, the blog recommends a stepwise process to arrive at a correct mask:

  1. Don’t just use FULL_MASK (i.e. 0xffffffff for 32 threads) as the mask value. If not all threads in the warp can reach the primitive according to the program logic, then using FULL_MASK may cause the program to hang.

  2. Don’t just use __activemask() as the mask value. __activemask() tells you what threads happen to be convergent when the function is called, which can be different from what you want to be in the collective operation.

  3. Do analyze the program logic and understand the membership requirements. Compute the mask ahead based on your program logic.

  4. If your program does opportunistic warp-synchronous programming, use “detective” functions such as __activemask() and __match_all_sync() to find the right mask.

  5. Use __syncwarp() to separate operations with intra-warp dependences. Do not assume lock-step execution.

Note that steps 1 and 2 are not contradictory to other comments. If you know for certain that you intend the entire warp to participate (not typically known in a "opportunistic" setting) then it is perfectly fine to use a hardcoded full mask.

If you really do intend the opportunistic definition you gave, there is nothing wrong with the usage of __activemask() to supply the mask, and in fact the blog gives a usage example of that, and step 4 also confirms that usage, for that case.