I am having interesting and weird issue.
When I start docker container with gpu it works fine and I see all the gpus in docker. However, few hours or few days later, I can't use gpus in docker.
When I do nvidia-smi
in docker machine. I see this msg
"Failed to initialize NVML: Unknown Error"
However, in the host machine, I see all the gpus with nvidia-smi. Also, when I restart the docker machine. It totally works fine and showing all gpus.
My Inference Docker machine should be turned on all the time and do the inference depends on server requests. Does any one have same issue or the solution for this problem?
There is a workaround that I tried and found it work. Please check this link in case you need full detail: https://github.com/NVIDIA/nvidia-docker/issues/1730
I summarize the cause of the problem and elaborate on a solution here for your convenience.
Cause:
The host performs daemon-reload (or a similar activity). If the container uses systemd to manage cgroups, daemon-reload "triggers reloading any Unit files that have references to NVIDIA GPUs." Then, your container loses access the reloaded GPU references.
How to check if your problem is caused by the issue:
When your container still has GPU access, open a "host" terminal and run
Then, go back to your container. If nvidia-smi in the container has the problem right away, you may continue to use the workarounds.
Workarounds:
Although I saw in one discussion that NVIDIA planned to release a formal fix in mid Jun, as of July 8, 2023, I did not see it yet. So, this should be still useful for you, especially when you just can't update your container stack.
The easiest way is to disable cgroups in your containers through docker daemon.json. If disabling cgroups does not hurt you, here is the steps. All is done in the host system.
Then, within the file, add this parameter setting.
Do not forget to add a comma before this parameter setting. It is a well-known JSON syntax, but I think some may not be familiar with it. This is an example edited file from my machine.
As for the last step, restart the docker service in the host.
Note: if your container runs its own NVIDIA driver, the above steps will not work, but the reference link has more detail for dealing with it. I elaborate only on a simple solution that I expect many people will find it useful.