How can I correctly set Kubernetes pod eviction limits, to avoid system OOM killer

1.1k Views Asked by At

I've spent over a full day trying to make sense of Kubernetes' resource management. Specifically, I'm trying to set up eviction thresholds and resource reservations in such a way that there is always at least 1GiB of memory available.

Going on the documentation regarding resource reservations and out-of-resource handling, I figured setting the following eviction policy would suffice:

--eviction-hard=memory.available<1Gi

However, in practice, this does not work at all, as the computation the kubelet does seems to be different from the computation the kernel does when it needs to determine whether or not the OOM killer needs to be invoked. E.g. when I load up my system with a bunch of pods running an artificial memory hog, I get the following report from free -m:

Total:      15866
Used:       14628
free:       161
shared:     53
buff/cache: 1077
available:  859

According to the kernel, there's 859 MiB memory available. Yet, the kubelet does not invoke its eviction policy. In fact, I've been able to invoke the system OOM killer before the kubelet eviction policy was invoked, even when ramping up memory usage incredibly slowly (to allow the kubelet housekeeing control loop to sleep 10 seconds, as per its default configuration).

I've found this script which used to be in Kubernetes documentation and is supposed to calculate the available memory in the same way the Kubelet does. I ran it in parallel to free -m above and got the following result:

memory.available_in_mb 1833

That's almost 1000M difference!

Now, I understand the calculation was by design, but that leaves me with the obvious question: how can I reliably manage system resource usage so that the system OOM killer does not get invoked? What eviction policy can I set so the kubelet will start evicting pods when there's less than a gigabyte of memory available?

1

There are 1 best solutions below

1
On

According to documentation https://kubernetes.io/docs/tasks/administer-cluster/reserve-compute-resources/, you should add the Kubelet flag --system-reserved=memory=1024Mi