Restarting Docker daemon on host node from within Kubernetes pod

3.2k Views Asked by At

Goal: Restart Docker daemon on GKE

Issue: Cannot connect to bus

Background While on Google Kubernetes Engine (GKE), I am attempting to restart the host node's Docker daemon in order to enable the Nvidia GPU Telemetry for Kubernetes on nodes that have a GPU. I have correctly isolated just the GPU nodes properly, and I am able to run every command on the host node by having a DaemonSet run an initContainer following the Automatically bootstrapping Kubernetes Engine nodes with daemonSets guide.

During runtime, however, the following pod does not allow me to connect to the Docker daemon:

apiVersion: v1
kind: Pod
metadata:
  name: debug
  namespace: gpu-monitoring
spec:
  affinity:
    nodeAffinity:
      requiredDuringSchedulingIgnoredDuringExecution:
        nodeSelectorTerms:
        - matchExpressions:
          - key: cloud.google.com/gke-accelerator
            operator: Exists
  containers:
  - command:
    - sleep
    - "86400"
    env:
    - name: ROOT_MOUNT_DIR
      value: /root
    image: docker.io/ubuntu:18.04
    imagePullPolicy: IfNotPresent
    name: node-initializer
    securityContext:
      privileged: true
    volumeMounts:
    - mountPath: /root
      name: root
    - mountPath: /scripts
      name: entrypoint
    - mountPath: /run
      name: run
  volumes:
  - hostPath:
      path: /
      type: ""
    name: root
  - configMap:
      defaultMode: 484
      name: nvidia-container-toolkit-installer-entrypoint
    name: entrypoint
  - hostPath:
      path: /run
      type: ""
    name: run

The user is 0, while the users present in /run/user are 1003, and 1002.

In order to verify connectivity and interactions with the root Kubernetes (k8s) node, the following is run:

root@debug:/# chroot "${ROOT_MOUNT_DIR}" ps aux

USER         PID %CPU %MEM    VSZ   RSS TTY      STAT START   TIME COMMAND
root           1  0.0  0.0 226124  9816 ?        Ss   Oct13   0:27 /sbin/init

The Issues

Both images

When attempting to interact with the underlying Kubernetes (k8s) node to restart the Docker daemon, I get the following:

root@debug:/# ls /run/dbus

system_bus_socket

root@debug:/# ROOT_MOUNT_DIR="${ROOT_MOUNT_DIR:-/root}"
root@debug:/# chroot "${ROOT_MOUNT_DIR}" systemctl status docker

Failed to connect to bus: No data available

When attempting to start dbus on the host node:

root@debug:/# export XDG_RUNTIME_DIR=/run/user/`id -u`
root@debug:/# export DBUS_SESSION_BUS_ADDRESS="unix:path=${XDG_RUNTIME_DIR}/bus"
root@debug:/# chroot "${ROOT_MOUNT_DIR}" /etc/init.d/dbus start

Failed to connect to bus: No data available

Image: solita/ubuntu-systemd

When trying to run commands using the same k8s pod config, except inside the solita/ubuntu-systemd image, the following are the results:

root@debug:/# /etc/init.d/dbus start
[....] Starting dbus (via systemctl): dbus.serviceRunning in chroot, ignoring request: start
. ok 

Configuration Variations Attempted I have tried to change the following, in pretty much every combination, to no avail:

  • Image to docker.io/solita/ubuntu-systemd:18.04
  • Add shareProcessNamespace: true
  • Add the following mounts: /dev, /proc, /sys
  • Restrict /run to /run/dbus, and /run/systemd
1

There are 1 best solutions below

0
On

So the answer is a weird workaround that was not fully expected. In order to restart the Docker daemon, first punch a firewall hole for pods to connect to the host node. Next, use gcloud compute ssh, and ssh into the node and restart via a remote ssh command:

apt-get update
apt-get install -y \
  apt-transport-https \
  curl \
  gnupg \
  lsb-release \
  ssh

export CLOUD_SDK_REPO="cloud-sdk-$(lsb_release -c -s)"
echo "deb https://packages.cloud.google.com/apt $CLOUD_SDK_REPO main" | tee -a /etc/apt/sources.list.d/google-cloud-sdk.list
curl https://packages.cloud.google.com/apt/doc/apt-key.gpg | apt-key add -
apt-get update
apt-get install -y google-cloud-sdk

CLUSTER_NAME="$(curl -sS http://metadata/computeMetadata/v1/instance/attributes/cluster-name -H "Metadata-Flavor: Google")"
NODE_NAME="$(curl -sS http://metadata.google.internal/computeMetadata/v1/instance/name -H 'Metadata-Flavor: Google')"
FULL_ZONE="$(curl -sS http://metadata.google.internal/computeMetadata/v1/instance/zone -H 'Metadata-Flavor: Google' | awk -F  "/" '{print $4}')"
MAIN_ZONE=$(echo $FULL_ZONE | sed 's/\(.*\)-.*/\1/')

gcloud compute ssh \
  --internal-ip $NODE_NAME \
  --zone=$FULL_ZONE \
  -- "sudo systemctl restart docker"