I have a working app which uses Cuda / C++, but sometimes, because of memory leaks, throws exception. I need to be able to reset the GPU on live, my app is a server so it has to stay available.
I tried something like this, but it doesnt seems to work:
try
{
// do process using GPU
}
catch (std::exception &e)
{
// catching exception from cuda only
cudaSetDevice(0);
CUDA_RETURN_(cudaDeviceReset());
}
My idea is to reset the device each times I get an exception from the GPU, but I cannot manage to make it working. :( Btw, for some reasons, I cannot fix every problems of my Cuda code, I need a temporary solution. Thanks !
The only method to restore proper device functionality after a non-recoverable ("sticky") CUDA error is to terminate the host process that initiated (i.e. issued the CUDA runtime API calls that led to) the error.
Therefore, for a single-process application, the only method is to terminate the application.
It should be possible to design a multi-process application, where the initial ("parent") process makes no usage of CUDA whatsoever, and spawns a child process that uses the GPU. When the child process encounters an unrecoverable CUDA error, it must terminate.
The parent process can, optionally, monitor the child process. If it determines that the child process has terminated, it can re-spawn the process and restore CUDA functional behavior.
Sticky vs. non-sticky errors are covered elsewhere, such as here.
An example of a proper multi-process app that uses e.g.
fork()to spawn a child process that uses CUDA is available in the CUDA sample codesimpleIPC. Here is a rough example assembled from thesimpleIPCexample (for linux):For windows, the only changes need should be to use a windows IPC mechanism for host interprocess communication.