I am having interesting and weird issue.
When I start docker container with gpu it works fine and I see all the gpus in docker. However, few hours or few days later, I can't use gpus in docker.
When I do nvidia-smi in docker machine. I see this msg
"Failed to initialize NVML: Unknown Error"
However, in the host machine, I see all the gpus with nvidia-smi. Also, when I restart the docker machine. It totally works fine and showing all gpus.
My Inference Docker machine should be turned on all the time and do the inference depends on server requests. Does any one have same issue or the solution for this problem?
I had the same issue, I just ran
screen watch -n 1 nvidia-smiin the container and now it works continuously.