Jetson Orin NX nvidia-container-runtime not utilizing GPU

250 Views Asked by At

I'm attempting to get an Nvidia Jetson Orin NX 16gb playing nicely with Kubernetes, using CRI-O.

I've installed the latest stable toolkit on the Jetson.

NVIDIA Container Runtime version 1.13.5
commit: 6b8589dcb4dead72ab64f14a5912886e6165c079
spec: 1.1.0-rc.2

runc version 1.1.7-0ubuntu1~20.04.1
spec: 1.0.2-dev
go: go1.18.1
libseccomp: 2.5.1

CRI-O is configured with the following

[crio.runtime]
  default_runtime = "nvidia"

[crio.runtime.runtimes.nvidia]
  runtime_path = "/usr/bin/nvidia-container-runtime"
  runtime_type = "oci"
  runtime_root = "/run/nvidia-container-runtime"

The nvidia-device-plugin is installed on the cluster, and has labeled the node accordingly with nvidia.com/gpu: 1

The node shows as such

NAME      STATUS   ROLES           AGE     VERSION       INTERNAL-IP    EXTERNAL-IP   OS-IMAGE             KERNEL-VERSION   CONTAINER-RUNTIME
jetson1   Ready    control-plane   7d16h   v1.27.4+k0s   192.168.4.53   <none>        Ubuntu 20.04.6 LTS   5.10.104-tegra   cri-o://1.27.1

I've applied a RuntimeClass (though I thought I could do without it if the CRI is defaulting to nvidia)

---
apiVersion: node.k8s.io/v1
kind: RuntimeClass
metadata:
  name: gpu-enabled-class
handler: nvidia

And this is the Pod that I'm testing

---
apiVersion: v1
kind: Pod
metadata:
  name: nvidia-query
spec:
  runtimeClassName: gpu-enabled-class
  restartPolicy: OnFailure
  containers:
    - name: nvidia-query
      image: dudo/test_cuda
      resources:
        limits:
          nvidia.com/gpu: 1
  tolerations:
    - key: nvidia.com/gpu
      operator: Exists
      effect: NoSchedule

This pod is properly scheduled, and executes as intended, running this script, but it doesn't utilize any gpu when checking jtop. If I run the script directly on the Jetson, jtop shows as expected, and the gpu is utilized, but from the container, nada.

Any ideas on what might be misconfigured? Any recommendations on how to debug this further?

0

There are 0 best solutions below