On a k8s node, how does one manage the pod disk IO land-rush at node power-on?

1.2k Views Asked by At

The Problem

When one of our locally hosted bare-metal k8s (1.18) nodes is powered-on, pods are scheduled, but struggle to reach 'Ready' status - almost entirely due to a land-rush of disk IO from 30-40 pods being scheduled simultaneously on the node.

This often results in a cascade of Deployment failures:

  • IO requests on the node stack up in the IOWait state as pods deploy.
  • Pod startup times skyrocket from (normal) 10-20 seconds to minutes.
  • livenessProbes fail.
  • Pods are re-scheduled, compounding the problem as more IO requests stack up.
  • Repeat.

FWIW Memory and CPU are vastly over-provisioned on the nodes, even in the power-on state (<10% usage).

Although we do have application NFS volume mounts (that would normally be suspect WRT IO issues), the disk activity and restriction at pod startup is almost entirely in the local docker container filesystem.

Attempted Solutions

As disk IO is not a limitable resource, we are struggling to find a solution for this. We have tuned our docker images to write to disk as little as possible at startup, and this has helped some.

One basic solution involves lowering the number of pods scheduled per node by increasing the number of nodes in the cluster. This isn't ideal for us, as they are physical machines, and once the nodes DO start up, the cluster is significantly over-resourced.

As we are bare-metal/local we do not have an automated method to auto-provision nodes in startup situations and lower them as the cluster stabilizes.

Applying priorityClasses at first glance seemed to be a solution. We have created priorityClasses and applied them accordingly, however, as listed in the documentation:

Pods can have priority. Priority indicates the importance of a Pod relative to other Pods. If a Pod cannot be scheduled, the scheduler tries to preempt (evict) lower priority Pods to make scheduling of the pending Pod possible.

tldr: Pods will still all be "scheduleable" simultaneously at power-on, as no configurable resource limits are being exceeded.

Question(s)

  • Is there a method to limit scheduing pods on a node based on its current number of non-Ready pods? This would allow priority classes to evict non-priority pods and schedule the higher priority first.
  • Aside from increasing the number of cluster nodes, is there a method we have not thought of to manage this disk IO landrush otherwise?
2

There are 2 best solutions below

1
On

While I am also interested to see smart people answer the question, here is my probably "just OK" idea:

  1. Configure the new node with a Taint that will prevent your "normal" pods from being scheduled to it.
  2. Create a deployment of do-nothing pods with:
    • A "reasonably large" memory request, eg: 1GB.
    • A number of replicas high enough to "fill" the node.
    • A toleration for the above Taint.
  3. Remove the Taint from the now-"full" node.
  4. Scale down the do-nothing deployment at whatever rate you feel is appropriate as to avoid the "land rush".

Here's a Dockerfile for the do-nothing "noop" image I use for testing/troubleshooting:

FROM alpine:3.9

CMD sh -c 'while true; do sleep 5; done'
0
On

Kubernetes Startup Probes might mitigate the problem of Pods being killed due to livenessProbe timeouts: https://kubernetes.io/docs/tasks/configure-pod-container/configure-liveness-readiness-startup-probes/#define-startup-probes

If you configure them appropiately, the I/O "landrush" will still happen, but the pods have enought time to settle themselves instead of being killed.