I'm using CUDA 7.0 with a nVidia 980 GTX for some image processing. In a particular iteration, multiple tiles are processed independently via 15-20 kernel calls and multiple cuFFT FFT/IFFT API calls.
Because of this, I've placed each tile within it's own CUDA stream so each tile executes it's string of operations asynchronously with respect to the host. Each tile is the same size within an iteration so they share a cuFFT plan. The host thread moves through the commands quickly in an attempt to keep the GPU loaded with work. I'm experiencing a periodic race condition while these operations are being processed in parallel though and had a question about cuFFT in particular. If I place a cuFFT plan in a stream 0 using cuFFTSetStream() for tile 0, and the FFT for tile 0 hasn't actually been executed on the GPU yet before the host sets the shared cuFFT plan's stream to stream 1 for tile 1 before it issues tile 1's work on the GPU, what is the behavior of cuFFTExec() for this plan?
More succinctly, does a call to cufftExec() execute in the stream the plan was set to at the time of the cufftExec() call regardless if cuFFTSetStream() is used to change the stream for subsequent tiles before the previous FFT calls have actually begun/completed?
I apologize for not posting code, but I'm not able to post my actual source.
EDIT: As pointed out in the comments, if the same plan (same created handle) is used for simultaneous FFT execution on the same device via streams, then the user is responsible for managing separate work areas for each usage of such plan. The question seemed to have a focus on the stream behavior itself, and my remaining answer focuses on that as well, but this is an important point.
Let me pretend you said stream 1 and stream 2, just so we can avoid any possible confusion around the NULL stream.
CUFFT should respect the stream that was defined for the plan at the time the plan was passed to CUFFT via
cufftExecXXX()
. Subsequent changes to the plan viacufftSetStream()
should have no effect on the stream used for previously issuedcufftExecXXX()
calls.We can verify this with a fairly simple test, using the profiler. Consider the following test code:
We're just doing two forward FFTs in a row, switching streams in-between the two. We'll use an nvtx marker to clearly identify the point at which the plan stream association change request occurs. Now let's look at the
nvprof --print-api-trace
output (removing the lengthy start-up preamble):We see that each FFT operation requires 3 kernel calls. In between, we see our nvtx marker indicating when the request for a plan stream change was made, and it's no surprise that this takes place after the first 3 kernel launches, but before the last 3. Finally, we note that essentially all of the execution time is absorbed in the final
cudaDeviceSynchronize()
call. All of the preceding calls are asynchronous and so execute more-or-less "immediately" in the first millisecond of execution. The final synchronize absorbs all the processing time of the 6 kernels, amounting to about 150 milliseconds.So if the
cufftSetStream
were to have an effect on the first iteration of thecufftExecC2C()
call, we would expect to see some or all of the first 3 kernels launched into the same stream as that used for the last 3 kernels. But when we look at thenvprof --print-gpu-trace
output:we see that in fact the first 3 kernels are issued into the first stream, and the last 3 kernels are issued into the second stream, just as requested. (And the sum total execution time of all kernels is approximately 150ms, just as suggested by the api trace output.) Since the underlying kernel launches are asynchronous and are issued prior to the return of the
cufftExecC2C()
call, if you think about this carefully you'll come to the conclusion that it has to be this way. The stream to launch a kernel into is specified at kernel launch time. (And of course I think this is considered "preferred" behavior.)