Goal: Restart Docker daemon on GKE
Issue: Cannot connect to bus
Background
While on Google Kubernetes Engine (GKE), I am attempting to restart the host node's Docker daemon in order to enable the Nvidia GPU Telemetry for Kubernetes on nodes that have a GPU. I have correctly isolated just the GPU nodes properly, and I am able to run every command on the host node by having a DaemonSet
run an initContainer
following the Automatically bootstrapping Kubernetes Engine nodes with daemonSets guide.
During runtime, however, the following pod does not allow me to connect to the Docker daemon:
apiVersion: v1
kind: Pod
metadata:
name: debug
namespace: gpu-monitoring
spec:
affinity:
nodeAffinity:
requiredDuringSchedulingIgnoredDuringExecution:
nodeSelectorTerms:
- matchExpressions:
- key: cloud.google.com/gke-accelerator
operator: Exists
containers:
- command:
- sleep
- "86400"
env:
- name: ROOT_MOUNT_DIR
value: /root
image: docker.io/ubuntu:18.04
imagePullPolicy: IfNotPresent
name: node-initializer
securityContext:
privileged: true
volumeMounts:
- mountPath: /root
name: root
- mountPath: /scripts
name: entrypoint
- mountPath: /run
name: run
volumes:
- hostPath:
path: /
type: ""
name: root
- configMap:
defaultMode: 484
name: nvidia-container-toolkit-installer-entrypoint
name: entrypoint
- hostPath:
path: /run
type: ""
name: run
The user is 0
, while the users present in /run/user
are 1003
, and 1002
.
In order to verify connectivity and interactions with the root Kubernetes (k8s) node, the following is run:
root@debug:/# chroot "${ROOT_MOUNT_DIR}" ps aux
USER PID %CPU %MEM VSZ RSS TTY STAT START TIME COMMAND
root 1 0.0 0.0 226124 9816 ? Ss Oct13 0:27 /sbin/init
The Issues
Both images
When attempting to interact with the underlying Kubernetes (k8s) node to restart the Docker daemon, I get the following:
root@debug:/# ls /run/dbus
system_bus_socket
root@debug:/# ROOT_MOUNT_DIR="${ROOT_MOUNT_DIR:-/root}"
root@debug:/# chroot "${ROOT_MOUNT_DIR}" systemctl status docker
Failed to connect to bus: No data available
When attempting to start dbus on the host node:
root@debug:/# export XDG_RUNTIME_DIR=/run/user/`id -u`
root@debug:/# export DBUS_SESSION_BUS_ADDRESS="unix:path=${XDG_RUNTIME_DIR}/bus"
root@debug:/# chroot "${ROOT_MOUNT_DIR}" /etc/init.d/dbus start
Failed to connect to bus: No data available
Image: solita/ubuntu-systemd
When trying to run commands using the same k8s pod config, except inside the solita/ubuntu-systemd
image, the following are the results:
root@debug:/# /etc/init.d/dbus start
[....] Starting dbus (via systemctl): dbus.serviceRunning in chroot, ignoring request: start
. ok
Configuration Variations Attempted I have tried to change the following, in pretty much every combination, to no avail:
- Image to
docker.io/solita/ubuntu-systemd:18.04
- Add
shareProcessNamespace: true
- Add the following mounts:
/dev
,/proc
,/sys
- Restrict
/run
to/run/dbus
, and/run/systemd
So the answer is a weird workaround that was not fully expected. In order to restart the Docker daemon, first punch a firewall hole for pods to connect to the host node. Next, use
gcloud compute ssh
, and ssh into the node and restart via a remote ssh command: