In the cuda examples, e.g. here, __match_all_sync__match_any_sync is used.
Here is an example where a warp is split into multiple (one or more) groups that each keep track of their own atomic counter.
// increment the value at ptr by 1 and return the old value
__device__ int atomicAggInc(int* ptr) {
int pred;
//const auto mask = __match_all_sync(__activemask(), ptr, &pred); //error, should be any_sync, not all_sync
const auto mask = __match_any_sync(__activemask(), ptr, &pred);
const auto leader = __ffs(mask) - 1; // select a leader
int res;
const auto lane_id = ThreadId() % warpSize;
if (lane_id == leader) { // leader does the update
res = atomicAdd(ptr, __popc(mask));
}
res = __shfl_sync(mask, res, leader); // get leader’s old value
return res + __popc(mask & ((1 << lane_id) - 1)); //compute old value
}
The __match_any_sync here splits up the threads in the warp into groups that have the same ptr value, so that each group can update its own ptr atomically without getting in the way of other threads.
I know the nvcc compiler (since cuda 9) does this sort of optimization under the hood automatically, but this is just about the mechanics of __match_any_sync
Is there a way to do this pre compute capability 7?
EDIT: The blog article has now been modified to reflect
__match_any_sync()rather than__match_all_sync(), so any commentary to that effect below should be disregarded. The answer below is edited to reflect this.Based on your statement:
we will focus on a replacement for
__match_any_syncitself, not any other form of rewriting theatomicAggIncfunction. Therefore, we must provide a mask that has the same value as would be returned by__match_any_sync()on cc7.0 or higher architectures.I believe this will require a loop, which broadcasts the
ptrvalue, in the worst case one iteration for each thread in the warp (since each thread could have a uniqueptrvalue) and testing which threads have the same value. There are various ways we could "optimize" this loop for this function, so as to possibly reduce the trip count from 32 to some lesser value, based on the actualptrvalues in each thread, but such optimization in my view introduces considerable complexity, which makes the worst-case processing time longer (as is typical of early-exit optimizations). So I will demonstrate a fairly simple method without this optimization.The other consideration is what to do in the case of the warp not being converged? For that, we can employ
__activemask()to identify that case.Here is a worked example:
(CentOS 7, CUDA 10.1.243, with device 0 being Tesla V100, device 1 being a cc3.5 device).
I've added an optional optimization for the case where the warp is diverged (i.e.
tmaskis not0xFFFFFFFF). This can be selected by definingUSE_OPT.