Using gpu in fraction in k3s

217 Views Asked by At

I want to use gpu for multiple pod like cpu. Like 1 cpu can be dividied in 0.x m i want the same for gpu. If i run with 1 gpu, one pod will take the gpu and other pods won't be able to use it. When i tried to install the nvidia gpu operator i got into above issue where the pod are stuck in init phase.

1. Issue or feature description

I have install k3s with all the related configs to use nvidia gpu and its working. But in yaml file i can't use fractional value for gpu like

            limits:
              nvidia.com/gpu: 0.5

2. Steps to reproduce the issue

  1. Install k3s
  2. Nvidia headless server driver and then verified by nvidia-smi
  3. Nvidia container tool kit
  4. Install gpu device plugin, you should see the pod via kubectl create -f https://raw.githubusercontent.com/NVIDIA/k8s-device-plugin/v0.14.2/nvidia-device-plugin.yml
  5. Check nvidia with ctr so container can use the gpu as well. Also verified by nvidia-smi
  6. Integrate nvidia with k3s and deploy test cuda image.
  7. Here where i have the issue. I have used the config.toml.tmpl file available at https://k3d.io/v4.4.8/usage/guides/cuda/config.toml.tmpl Although with the default above file i can't able to run my runtime as the gpu pod give me this error Warning FailedCreatePodSandBox 2s (x4 over 41s) kubelet Failed to create pod sandbox: rpc error: code = Unknown desc = failed to create containerd task: failed to start shim: exec: "containerd-shim": executable file not found in $PATH: unknown

With work around. I make this changes in configuration

[plugins.cri.containerd.runtimes.runc.options]
  BinaryName = "/usr/bin/nvidia-container-runtime"

[plugins.cri.containerd.runtimes.runc]
  # ---- changed from 'io.containerd.runtime.v1.linux' for GPU support
  runtime_type = "io.containerd.runc.v2"
#  runtime_type = "io.containerd.runtime.v1.linux"
# ---- added for GPU support
[plugins.linux]
  runtime = "nvidia-container-runtime"

With above configuration after restarting stoping and starting with command sudo k3s server I can run my pod with 1 gpu.

Below are some of the configuration and outputs: Device plugin container logs :

I1118 07:55:45.920375       1 main.go:256] Retreiving plugins.
I1118 07:55:45.920407       1 factory.go:107] Detected NVML platform: found NVML library
I1118 07:55:45.920424       1 factory.go:107] Detected non-Tegra platform: /sys/devices/soc0/family file not found
I1118 07:55:45.920851       1 server.go:165] Starting GRPC server for 'nvidia.com/gpu'
I1118 07:55:45.921242       1 server.go:117] Starting to serve 'nvidia.com/gpu' on /var/lib/kubelet/device-plugins/nvidia-gpu.sock
I1118 07:55:45.923008       1 server.go:125] Registered device plugin for 'nvidia.com/gpu' with Kubelet

Node details :

NAME         STATUS   ROLES                  AGE    VERSION        INTERNAL-IP     EXTERNAL-IP   OS-IMAGE             KERNEL-VERSION     CONTAINER-RUNTIME
devops-s2h   Ready    control-plane,master   4d3h   v1.27.7+k3s2   172.16.11.243   <none>        Ubuntu 22.04.3 LTS   6.2.0-36-generic   containerd://1.7.7-k3s1.27

uname -a Linux devops-S2H 6.2.0-36-generic #37~22.04.1-Ubuntu SMP PREEMPT_DYNAMIC Mon Oct 9 15:34:04 UTC 2 x86_64 x86_64 x86_64 GNU/Linux nvidia-smi


+---------------------------------------------------------------------------------------+
| NVIDIA-SMI 535.129.03             Driver Version: 535.129.03   CUDA Version: 12.2     |
|-----------------------------------------+----------------------+----------------------+
| GPU  Name                 Persistence-M | Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp   Perf          Pwr:Usage/Cap |         Memory-Usage | GPU-Util  Compute M. |
|                                         |                      |               MIG M. |
|=========================================+======================+======================|
|   0  Quadro M4000                   Off | 00000000:07:00.0  On |                  N/A |
| 47%   42C    P8              14W / 120W |    336MiB /  8192MiB |      0%      Default |
|                                         |                      |                  N/A |
+-----------------------------------------+----------------------+----------------------+
                                                                                         
+---------------------------------------------------------------------------------------+
| Processes:                                                                            |
|  GPU   GI   CI        PID   Type   Process name                            GPU Memory |
|        ID   ID                                                             Usage      |
|=======================================================================================|
|    0   N/A  N/A      1784      G   /usr/bin/gnome-shell                        116MiB |
|    0   N/A  N/A      3292    C+G   ...29837382,8897893935064066579,262144      211MiB |
+---------------------------------------------------------------------------------------+

In the end, i just want to use gpu with many pods.

0

There are 0 best solutions below