I am running a Tensorflow model on an AKS cluster with GPU nodes. The model currently runs in a single TF Serving container (https://hub.docker.com/r/tensorflow/serving) in a single pod on a single GPU node.
By default the TF serving container will claim all available RAM in the pod, but I can downscale the memory request by the container in my deployment.yaml
file and still get the same results in acceptable processing time. I was wondering if there is any possibility to run two TF models in parallel on the same GPU node. Memory-wise it should work, but when I try to adapt the replicaset of my deployment to two, it tries to deploy two pods but the second one is hanging on the status pending.
$ kubectl get po -n myproject -w
NAME READY STATUS RESTARTS AGE
myproject-deployment-cb7769df4-ljcfc 1/1 Running 0 2m
myproject-deployment-cb7769df4-np9qd 0/1 Pending 0 26s
If I describe the pod I get the following error
$ kubectl describe po -n myproject myproject-deployment-cb7769df4-np9qd
Name: myproject-deployment-cb7769df4-np9qd
Namespace: myproject
<...>
Events:
Type Reason Age From Message
---- ------ ---- ---- -------
Warning FailedScheduling 105s default-scheduler 0/1 nodes are available: 1 Insufficient nvidia.com/gpu.
Since the first pod 'claims' the GPU, the second one cannot use it anymore and remains in status pending. I see two different possiblities:
- Run two TF serving containers in one pod on one GPU node
- Run two pods, each with one TF serving container on one GPU node
Is any of the options above feasible?
My deployment can be found below.
apiVersion: apps/v1
kind: Deployment
metadata:
name: myproject-deployment
labels:
app: myproject-server
namespace: myproject
spec:
replicas: 1
selector:
matchLabels:
app: myproject-server
template:
metadata:
labels:
app: myproject-server
spec:
containers:
- name: server
image: tensorflow/serving:2.3.0-gpu
ports:
- containerPort: 8500
volumeMounts:
- name: azurestorage
mountPath: /models
resources:
requests:
memory: "10Gi"
cpu: "1"
limits:
memory: "12Gi"
cpu: "2"
nvidia.com/gpu: 1
args: ["--model_config_file=/models/models.config", "--monitoring_config_file=/models/monitoring.config"]
volumes:
- name: azurestorage
persistentVolumeClaim:
claimName: pvcmodels
Interesting question - as far as I know, this is not possible, also not for two containers running as the same pod (resources are configured on container level), at least not out of the box (see https://github.com/kubernetes/kubernetes/issues/52757)
I found this while searching for an answer: https://blog.ml6.eu/a-guide-to-gpu-sharing-on-top-of-kubernetes-6097935ababf, but that involves tinkering with kubernetes itself.
You could run multiple processes in the same container to achieve sharing, however this goes a bit against the idea of kubernetes/containers and of course won't work for 2 completely different workloads/services.