Is there a way to explicitly map a thread to a specific warp in CUDA?

449 Views Asked by At

Say, dynamic analysis was done on a CUDA program such that certain threads were better off being in the same warp.

For example, let's pretend we have 1024 cuda threads and a warp size of 32. After dynamic analysis we find out that threads 989, 243, 819, ..., 42 (32 total threads listed) should be on the same warp. We determined that they should be on the same warp because they have little to no divergence in code execution -- (they may not necessarily have been on the same warp when performing dynamic analysis of the CUDA program).

Is there a way to control thread to warp scheduling in CUDA? If not, is there another GPU programming language that would offer this explicit warp scheduling. If not, what could be done (possibly even a very low level approach to solve this problem)? I am hoping there is at least an answer to this last question as that is probably how CUDA was implemented -- unless warp scheduling is done at the hardware level, which would be unfortunate. Thanks!

1

There are 1 best solutions below

1
On BEST ANSWER

No, you don't get to pick the assignment of threads to warps. The support for this statement is covered here.

However, the thing that makes a thread take on a particular character for behavior is the thread code you write, not anything locked into hardware. Whether that is data access pattern, or particular path through control flow, it is all controlled by the programmer.

One of the primary ways a CUDA thread takes on its character is through the generation of a globally unique thread ID, which is typical boilerplate for any CUDA code, e.g.:

int idx=threadIdx.x+blockDim.x*blockIdx.x;

creates a canonical, globally unique 1D thread index, for each thread.

But there's no particular reason it has to be this way. I could just as easily do:

int private_idx = threadIdx.x+blockDim.x*blockIdx.x;
int idx = desired_idx[private_idx];

and then the threads could be numbered in any order you want. If your desired_idx array had a grouping of numbers as you suggest:

989, 243, 819, ..., 42

Then those adjacent threads would take on the behavior consistent with that ordering.