Some intrinsics named with `_sync()` appended in CUDA 9; semantics same?

Question

Some intrinsics named with `_sync()` appended in CUDA 9; semantics same?

626 Views Asked by einpoklum At 31 July 2025 at 03:36

In CUDA 9, nVIDIA seems to have this new notion of "cooperative groups"; and for some reason not entirely clear to me, __ballot() is now (= CUDA 9) deprecated in favor of __ballot_sync(). Is that an alias or have the semantics changed?

... similar question for other builtins which now have __sync() added to their names.

Original Q&A

There are 1 best solutions below

**Robert Crovella** · Accepted Answer

No the semantics are not the same. The function calls themselves are different, one is not an alias for another, new functionality has been exposed, and the implementation behavior is now different between Volta architecture and previous architectures.

First of all, to set the ground work, it's necessary to be cognizant of the fact that Volta introduced the possibility for independent thread scheduling, by introducing a per-thread program counter and other changes. As a result of this, it's possible for Volta to behave in a non-warp-synchronous behavior for extended periods of time, and during periods of execution when previous architectures might still be warp-synchronous.

Most of the warp intrinsics work by only delivering expected results for threads that are actually participating (i.e. are actually active for the issue of that instruction, in that cycle). The programmer can now be explicit about which threads are expected to participate, via the new mask parameter. However there are some requirements, in particular on Pascal and previous architectures. From the programming guide:

Note, however, that for Pascal and earlier architectures, all threads in mask must execute the same warp intrinsic instruction in convergence, and the union of all values in mask must be equal to the warp's active mask.

On Volta, however, the warp execution engine will bring about the necessary synchronization/participation amongst the indicated threads in the mask, in order to make the desired/indicated operation valid (assuming the appropriate _sync version of the instrinsic is used). To be clear, the warp execution engine will reconverge threads that are diverged on volta in order to match the mask, however it will not overcome programmer induced errors such as preventing a thread from participating in a _sync() intrinsic via conditional statements.

This related question discusses the mask parameter. This answer is not intended to address all possible questions that may arise from independent thread scheduling and the impact on warp level intrinsics. For that, I encourage reading of the programming guide.

Some intrinsics named with `_sync()` appended in CUDA 9; semantics same?

There are 1 best solutions below

Related Questions in CUDA

Related Questions in PTX

Related Questions in GPU-WARP

Trending Questions

Popular # Hahtags

Popular Questions