I want to use gpu for multiple pod like cpu. Like 1 cpu can be dividied in 0.x m i want the same for gpu. If i run with 1 gpu, one pod will take the gpu and other pods won't be able to use it. When i tried to install the nvidia gpu operator i got into above issue where the pod are stuck in init phase.
1. Issue or feature description
I have install k3s with all the related configs to use nvidia gpu and its working. But in yaml file i can't use fractional value for gpu like
limits:
nvidia.com/gpu: 0.5
2. Steps to reproduce the issue
- Install k3s
- Nvidia headless server driver and then verified by nvidia-smi
- Nvidia container tool kit
- Install gpu device plugin, you should see the pod via
kubectl create -f https://raw.githubusercontent.com/NVIDIA/k8s-device-plugin/v0.14.2/nvidia-device-plugin.yml
- Check nvidia with ctr so container can use the gpu as well. Also verified by nvidia-smi
- Integrate nvidia with k3s and deploy test cuda image.
- Here where i have the issue.
I have used the config.toml.tmpl file available at https://k3d.io/v4.4.8/usage/guides/cuda/config.toml.tmpl
Although with the default above file i can't able to run my runtime as the gpu pod give me this error
Warning FailedCreatePodSandBox 2s (x4 over 41s) kubelet Failed to create pod sandbox: rpc error: code = Unknown desc = failed to create containerd task: failed to start shim: exec: "containerd-shim": executable file not found in $PATH: unknown
With work around. I make this changes in configuration
[plugins.cri.containerd.runtimes.runc.options]
BinaryName = "/usr/bin/nvidia-container-runtime"
[plugins.cri.containerd.runtimes.runc]
# ---- changed from 'io.containerd.runtime.v1.linux' for GPU support
runtime_type = "io.containerd.runc.v2"
# runtime_type = "io.containerd.runtime.v1.linux"
# ---- added for GPU support
[plugins.linux]
runtime = "nvidia-container-runtime"
With above configuration after restarting stoping and starting with command sudo k3s server
I can run my pod with 1 gpu.
Below are some of the configuration and outputs: Device plugin container logs :
I1118 07:55:45.920375 1 main.go:256] Retreiving plugins.
I1118 07:55:45.920407 1 factory.go:107] Detected NVML platform: found NVML library
I1118 07:55:45.920424 1 factory.go:107] Detected non-Tegra platform: /sys/devices/soc0/family file not found
I1118 07:55:45.920851 1 server.go:165] Starting GRPC server for 'nvidia.com/gpu'
I1118 07:55:45.921242 1 server.go:117] Starting to serve 'nvidia.com/gpu' on /var/lib/kubelet/device-plugins/nvidia-gpu.sock
I1118 07:55:45.923008 1 server.go:125] Registered device plugin for 'nvidia.com/gpu' with Kubelet
Node details :
NAME STATUS ROLES AGE VERSION INTERNAL-IP EXTERNAL-IP OS-IMAGE KERNEL-VERSION CONTAINER-RUNTIME
devops-s2h Ready control-plane,master 4d3h v1.27.7+k3s2 172.16.11.243 <none> Ubuntu 22.04.3 LTS 6.2.0-36-generic containerd://1.7.7-k3s1.27
uname -a
Linux devops-S2H 6.2.0-36-generic #37~22.04.1-Ubuntu SMP PREEMPT_DYNAMIC Mon Oct 9 15:34:04 UTC 2 x86_64 x86_64 x86_64 GNU/Linux
nvidia-smi
+---------------------------------------------------------------------------------------+
| NVIDIA-SMI 535.129.03 Driver Version: 535.129.03 CUDA Version: 12.2 |
|-----------------------------------------+----------------------+----------------------+
| GPU Name Persistence-M | Bus-Id Disp.A | Volatile Uncorr. ECC |
| Fan Temp Perf Pwr:Usage/Cap | Memory-Usage | GPU-Util Compute M. |
| | | MIG M. |
|=========================================+======================+======================|
| 0 Quadro M4000 Off | 00000000:07:00.0 On | N/A |
| 47% 42C P8 14W / 120W | 336MiB / 8192MiB | 0% Default |
| | | N/A |
+-----------------------------------------+----------------------+----------------------+
+---------------------------------------------------------------------------------------+
| Processes: |
| GPU GI CI PID Type Process name GPU Memory |
| ID ID Usage |
|=======================================================================================|
| 0 N/A N/A 1784 G /usr/bin/gnome-shell 116MiB |
| 0 N/A N/A 3292 C+G ...29837382,8897893935064066579,262144 211MiB |
+---------------------------------------------------------------------------------------+
In the end, i just want to use gpu with many pods.