There is AMD HIP C++ which is very similar to CUDA C++. Also AMD created Hipify to convert CUDA C++ to HIP C++ (Portable C++ Code) which can be executed on both nVidia GPU and AMD GPU: https://github.com/GPUOpen-ProfessionalCompute-Tools/HIP
- There are requirements to use
shfl
operations on nVidia GPU: https://github.com/GPUOpen-ProfessionalCompute-Tools/HIP/tree/master/samples/2_Cookbook/4_shfl#requirement-for-nvidia
requirement for nvidia
please make sure you have a 3.0 or higher compute capable device in order to use warp shfl operations and add -gencode arch=compute=30, code=sm_30 nvcc flag in the Makefile while using this application.
- Also noted that HIP supports
shfl
for 64 wavesize (WARP-size) on AMD: https://github.com/GPUOpen-ProfessionalCompute-Tools/HIP/blob/master/docs/markdown/hip_faq.md#why-use-hip-rather-than-supporting-cuda-directly
In addition, HIP defines portable mechanisms to query architectural features, and supports a larger 64-bit wavesize which expands the return type for cross-lane functions like ballot and shuffle from 32-bit ints to 64-bit ints.
But which of AMD GPUs does support functions shfl
, or does any AMD GPU support shfl
because on AMD GPU it implemented by using Local-memory without hardware instruction register-to-register?
nVidia GPU required 3.0 or higher compute capable (CUDA CC), but what are the requirements for using shfl
operations on AMD GPU using HIP C++?
Yes, there are new instructions in GPU GCN3 such as
ds_bpermute
andds_permute
which can provide the functionality such as__shfl()
and even moreThese
ds_bpermute
andds_permute
instructions use only route of Local memory (LDS 8.6 TB/s), but don't actually use Local memory, this allows to accelerate data exchange between threads: 8.6 TB/s < speed < 51.6 TB/s: http://gpuopen.com/amd-gcn-assembly-cross-lane-operations/http://gpuopen.com/amd-gcn-assembly-cross-lane-operations/
For example,
wave_shr
-instruction (Wavefront shift right) for Scan algorithm:More about GCN3: https://github.com/olvaffe/gpu-docs/raw/master/amd-open-gpu-docs/AMD_GCN3_Instruction_Set_Architecture.pdf