What is the correct way to support `__shfl()` and `__shfl_sync()` instructions?

1.1k Views Asked by At

From my understanding, CUDA 10.1 removed the shfl instructions:

PTX ISA version 6.4 removes the following features:

Support for shfl and vote instructions without the .sync qualifier has been removed for .targetsm_70 and higher. This support was deprecated since PTX ISA version 6.0 as documented in PTX ISA version 6.2.

What is the correct way to support shfl future and past CUDA versions?

My current methods (shared below) result in the error using CUDA 10.1:

ptxas ... line 466727; error   : Instruction 'shfl' without '.sync' is not supported on .target sm_70 and higher from PTX ISA version 6.4
template <typename T>
__device__ static __forceinline__
T _shfl_up(T var, unsigned int delta, int width=WARPSIZE, unsigned mask=MEMBERMASK)
{
#if (__CUDACC_VER_MAJOR__ >= 9)
  var = __shfl_up_sync(mask, var, delta, width);
#else
  var = __shfl_up(var, delta, width);
#endif
  return var;
}

Also, I would like to add that one of the dependencies of my project is CUB and I believe they utilize the same method to split up _sync() and older shfl instructions. I am not sure what I am doing wrong.

1

There are 1 best solutions below

0
On BEST ANSWER

I was doing the right thing, turns out another dependency didn't have support for sync, created a pull request for it: https://github.com/moderngpu/moderngpu/pull/32

template <typename T>
__device__ static __forceinline__
T _shfl_up(T var, unsigned int delta, int width=WARPSIZE, unsigned mask=MEMBERMASK)
{
#if ( __CUDA_ARCH__ >= 300)
#if (__CUDACC_VER_MAJOR__ >= 9)
  var = __shfl_up_sync(mask, var, delta, width);
#else
  var = __shfl_up(var, delta, width);
#endif
#endif
  return var;
}