Is it possible to run multiple tensorflow serving containers on single GPU node kubernetes

1k Views Asked by At

I am running a Tensorflow model on an AKS cluster with GPU nodes. The model currently runs in a single TF Serving container (https://hub.docker.com/r/tensorflow/serving) in a single pod on a single GPU node.

By default the TF serving container will claim all available RAM in the pod, but I can downscale the memory request by the container in my deployment.yaml file and still get the same results in acceptable processing time. I was wondering if there is any possibility to run two TF models in parallel on the same GPU node. Memory-wise it should work, but when I try to adapt the replicaset of my deployment to two, it tries to deploy two pods but the second one is hanging on the status pending.

$ kubectl get po -n myproject -w
NAME                                 READY   STATUS    RESTARTS   AGE
myproject-deployment-cb7769df4-ljcfc   1/1     Running   0          2m
myproject-deployment-cb7769df4-np9qd   0/1     Pending   0          26s

If I describe the pod I get the following error

$ kubectl describe po -n myproject myproject-deployment-cb7769df4-np9qd
Name:           myproject-deployment-cb7769df4-np9qd
Namespace:      myproject
<...>
Events:
  Type     Reason            Age   From                Message
  ----     ------            ----  ----                -------
  Warning  FailedScheduling  105s  default-scheduler   0/1 nodes are available: 1 Insufficient nvidia.com/gpu.

Since the first pod 'claims' the GPU, the second one cannot use it anymore and remains in status pending. I see two different possiblities:

  1. Run two TF serving containers in one pod on one GPU node
  2. Run two pods, each with one TF serving container on one GPU node

Is any of the options above feasible?

My deployment can be found below.

apiVersion: apps/v1
kind: Deployment
metadata:
  name: myproject-deployment
  labels:
    app: myproject-server
  namespace: myproject
spec:
  replicas: 1
  selector:
    matchLabels:
      app: myproject-server
  template:
    metadata:
      labels:
        app: myproject-server
    spec:
      containers:
      - name: server
        image: tensorflow/serving:2.3.0-gpu
        ports:
        - containerPort: 8500
        volumeMounts:
          - name: azurestorage
            mountPath: /models
        resources:
          requests:
            memory: "10Gi"
            cpu: "1"
          limits:
            memory: "12Gi"
            cpu: "2"
            nvidia.com/gpu: 1
        args: ["--model_config_file=/models/models.config", "--monitoring_config_file=/models/monitoring.config"]
      volumes:
      - name: azurestorage
        persistentVolumeClaim:
          claimName: pvcmodels

1

There are 1 best solutions below

0
On

Interesting question - as far as I know, this is not possible, also not for two containers running as the same pod (resources are configured on container level), at least not out of the box (see https://github.com/kubernetes/kubernetes/issues/52757)

I found this while searching for an answer: https://blog.ml6.eu/a-guide-to-gpu-sharing-on-top-of-kubernetes-6097935ababf, but that involves tinkering with kubernetes itself.

You could run multiple processes in the same container to achieve sharing, however this goes a bit against the idea of kubernetes/containers and of course won't work for 2 completely different workloads/services.