Pre 8.x equivalent of __reduce_max_sync() in CUDA

250 Views Asked by At

cuda-memcheck has detected a race condition in the code that does the following:

condition = /*different in each thread*/;
shared int owner[nWarps];
/* ... owner[i] is initialized to blockDim.x+1 */
if(condition) {
    owner[threadIdx.x/32] = threadIdx.x;
}

So basically this code computes the owner thread for each warp based on some condition. For some warp there could be no owner, but for some the number of owners can be more than 1, and then a race condition happens because multiple threads assign a value to the same shared memory region.

After trying the docs, I think what I need can be done with:

const uint32_t mask = __ballot_sync(0xffffffff, condition);
if(mask != 0) {
    const unsigned max_owner = __reduce_max_sync(mask, threadIdx.x);
    if(threadIdx.x == max_owner) {
        // at most 1 thread assigns here per warp
        owner[threadIdx.x/32] = max_owner;
    }
}

However, my attempt has 2 issues:

  1. I don't really need to find the max thread - it's enough to select any 1 thread for each warp if there is a thread with condition==true
  2. It requires CUDA compute capability 8.x, while I need to support devices of 5.2 compute capability

Could you please help me solve the above issues?

1

There are 1 best solutions below

0
On

The following function seems to solve the problem:

void SetOwnerThread(int* dest, const bool condition) {
  const uint32_t mask = __ballot_sync(0xffffffff, condition);
  if(!mask) {
    return;
  }
  const uint32_t lowest_bit = mask & -mask;
  const uint32_t my_bit = (1 << (threadIdx.x & 31));
  if(lowest_bit == my_bit) {
    dest = threadIdx.x;
  }
}