I am wondering if the warp scheduling order of a CUDA application is deterministic.
Specifically I am wondering if the ordering of warp execution will stay the same with multiple runs of the same kernel with the same input data on the same device. If not, is there anything that could force ordering of warp execution (say in the case when debugging an order dependent algorithm)?
The precise behavior of CUDA warp scheduling is not defined. Therefore you cannot depend on it being deterministic. In particular, if multiple warps are ready to be executed in a given issue slot, there is no description of which warp will be selected by the warp scheduler(s).
There is no external method to precisely control the order of warp execution.
It's certainly possible to build code that determines warp ID, and forces warps to execute in a particular order. Something like this:
allowing only one warp to execute at a time will be very inefficient, of course.
In general, the best parallelizable algorithms have little or no order dependence.