Initializing CUDA in a newly-created process can take quite some time as long as a half-second or more on many server-grade machines of today. As @RobertCrovella explains, CUDA initialization usually includes establishment of a Unified Memory model, which involves harmonizing of device and host memory maps. This can take quite a long time for machines with a lot of memory; and there might be other factors contributing to this long delay.
This effect becomes quite annoying when you want to run a sequence of CUDA-utilizing processes, which do not use complicated virtual memory mappings: They each have to wait their their long wait - despite the fact that "essentially", they could just re-use whether initializations CUDA made the last time (perhaps with a bit of cleanup code).
Now, obviously, if you somehow rewrote the code for all those processes to execute within a single process - that would save you those long initialization costs. But isn't there a simpler approach? What about:
- Passing the same state information / CUDA context between processes?
- Telling CUDA to ignore most host memory altogether?
- Making the Unified Memory harmonization more lazy than it is now, so that it only happens to the extent that it's actually necessary?
- Starting CUDA with Unified Memory disabled?
- Keeping some daemon process on the side and latching on to it's already-initialized CUDA state?
What you are asking about already exists. It is called MPS (MULTI-PROCESS SERVICE), and it basically keeps a single GPU context alive at all times with a daemon process that emulates the driver API. The initial target application is MPI, but it does bascially what you envisage.
Read more here:
https://docs.nvidia.com/deploy/pdf/CUDA_Multi_Process_Service_Overview.pdf
http://on-demand.gputechconf.com/gtc/2015/presentation/S5584-Priyanka-Sah.pdf