The Problem
When one of our locally hosted bare-metal k8s (1.18) nodes is powered-on, pods are scheduled, but struggle to reach 'Ready' status - almost entirely due to a land-rush of disk IO from 30-40 pods being scheduled simultaneously on the node.
This often results in a cascade of Deployment failures:
- IO requests on the node stack up in the IOWait state as pods deploy.
- Pod startup times skyrocket from (normal) 10-20 seconds to minutes.
- livenessProbes fail.
- Pods are re-scheduled, compounding the problem as more IO requests stack up.
- Repeat.
FWIW Memory and CPU are vastly over-provisioned on the nodes, even in the power-on state (<10% usage).
Although we do have application NFS volume mounts (that would normally be suspect WRT IO issues), the disk activity and restriction at pod startup is almost entirely in the local docker container filesystem.
Attempted Solutions
As disk IO is not a limitable resource, we are struggling to find a solution for this. We have tuned our docker images to write to disk as little as possible at startup, and this has helped some.
One basic solution involves lowering the number of pods scheduled per node by increasing the number of nodes in the cluster. This isn't ideal for us, as they are physical machines, and once the nodes DO start up, the cluster is significantly over-resourced.
As we are bare-metal/local we do not have an automated method to auto-provision nodes in startup situations and lower them as the cluster stabilizes.
Applying priorityClasses at first glance seemed to be a solution. We have created priorityClasses and applied them accordingly, however, as listed in the documentation:
Pods can have priority. Priority indicates the importance of a Pod relative to other Pods. If a Pod cannot be scheduled, the scheduler tries to preempt (evict) lower priority Pods to make scheduling of the pending Pod possible.
tldr: Pods will still all be "scheduleable" simultaneously at power-on, as no configurable resource limits are being exceeded.
Question(s)
- Is there a method to limit scheduing pods on a node based on its current number of non-Ready pods? This would allow priority classes to evict non-priority pods and schedule the higher priority first.
- Aside from increasing the number of cluster nodes, is there a method we have not thought of to manage this disk IO landrush otherwise?
While I am also interested to see smart people answer the question, here is my probably "just OK" idea:
Here's a Dockerfile for the do-nothing "noop" image I use for testing/troubleshooting: