I got a K8S+DinD issue:
- launch Kubernetes cluster
- start a main docker image and a DinD image inside this cluster
- when running a job requesting GPU, got error
could not select device driver "nvidia" with capabilities: [[gpu]]
Full error
http://localhost:2375/v1.40/containers/long-hash-string/start: Internal Server Error ("could not select device driver "nvidia" with capabilities: [[gpu]]")
exec
to the DinD image inside of K8S pod, nvidia-smi
is not available.
Some debugging and it seems it's due to the DinD is missing the Nvidia-docker-toolkit, I had the same error when I ran the same job directly on my local laptop docker, I fixed the same error by installing nvidia-docker2 sudo apt-get install -y nvidia-docker2
.
I'm thinking maybe I can try to install nvidia-docker2 to the DinD 19.03 (docker:19.03-dind), but not sure how to do it? By multiple stage docker build?
Thank you very much!
update:
pod spec:
spec:
containers:
- name: dind-daemon
image: docker:19.03-dind
I got it working myself.
Referring to
But since this post is 3 year ago from now, I did spent quite some time to match up the dependencies versions, repo migration over 3 years, etc.
My modified version of Dockerfile to build it
When I use
exec
to login into the Docker-in-Docker container, I can successfully runnvidia-smi
(which previously return not found error then cannot run any GPU resource related docker run)Welcome to pull my image at
brandsight/dind:nvidia-docker