Are Kubernetes liveness probe failures voluntary or involuntary disruptions?

1.2k Views Asked by At

I have an application deployed to Kubernetes that depends on an outside application. Sometimes the connection between these 2 goes to an invalid state, and that can only be fixed by restarting my application.

To do automatic restarts, I have configured a liveness probe that will verify the connection.

This has been working great, however, I'm afraid that if that outside application goes down (such that the connection error isn't just due to an invalid pod state), all of my pods will immediately restart, and my application will become completely unavailable. I want it to remain running so that functionality not depending on the bad service can continue.

I'm wondering if a pod disruption budget would prevent this scenario, as it limits the # of pods down due to a "voluntary" disruption. However, the K8s docs don't state whether liveness probe failure are a voluntary disruption. Are they?

3

There are 3 best solutions below

0
On BEST ANSWER

I would say, accordingly to the documentation:

Voluntary and involuntary disruptions

Pods do not disappear until someone (a person or a controller) destroys them, or there is an unavoidable hardware or system software error.

We call these unavoidable cases involuntary disruptions to an application. Examples are:

  • a hardware failure of the physical machine backing the node
  • cluster administrator deletes VM (instance) by mistake
  • cloud provider or hypervisor failure makes VM disappear
  • a kernel panic
  • the node disappears from the cluster due to cluster network partition
  • eviction of a pod due to the node being out-of-resources.

Except for the out-of-resources condition, all these conditions should be familiar to most users; they are not specific to Kubernetes.

We call other cases voluntary disruptions. These include both actions initiated by the application owner and those initiated by a Cluster Administrator. Typical application owner actions include:

  • deleting the deployment or other controller that manages the pod
  • updating a deployment's pod template causing a restart
  • directly deleting a pod (e.g. by accident)

Cluster administrator actions include:

  • Draining a node for repair or upgrade.
  • Draining a node from a cluster to scale the cluster down (learn about Cluster Autoscaling ).
  • Removing a pod from a node to permit something else to fit on that node.

-- Kubernetes.io: Docs: Concepts: Workloads: Pods: Disruptions

So your example is quite different and according to my knowledge it's neither voluntary or involuntary disruption.


Also taking a look on another Kubernetes documentation:

Pod lifetime

Like individual application containers, Pods are considered to be relatively ephemeral (rather than durable) entities. Pods are created, assigned a unique ID (UID), and scheduled to nodes where they remain until termination (according to restart policy) or deletion. If a Node dies, the Pods scheduled to that node are scheduled for deletion after a timeout period.

Pods do not, by themselves, self-heal. If a Pod is scheduled to a node that then fails, the Pod is deleted; likewise, a Pod won't survive an eviction due to a lack of resources or Node maintenance. Kubernetes uses a higher-level abstraction, called a controller, that handles the work of managing the relatively disposable Pod instances.

-- Kubernetes.io: Docs: Concepts: Workloads: Pods: Pod lifecycle: Pod lifetime

Container probes

The kubelet can optionally perform and react to three kinds of probes on running containers (focusing on a livenessProbe):

  • livenessProbe: Indicates whether the container is running. If the liveness probe fails, the kubelet kills the container, and the container is subjected to its restart policy. If a Container does not provide a liveness probe, the default state is Success.

-- Kubernetes.io: Docs: Concepts: Workloads: Pods: Pod lifecycle: Container probes

When should you use a liveness probe?

If the process in your container is able to crash on its own whenever it encounters an issue or becomes unhealthy, you do not necessarily need a liveness probe; the kubelet will automatically perform the correct action in accordance with the Pod's restartPolicy.

If you'd like your container to be killed and restarted if a probe fails, then specify a liveness probe, and specify a restartPolicy of Always or OnFailure.

-- Kubernetes.io: Docs: Concepts: Workloads: Pods: Pod lifecycle: When should you use a startup probe

According to those information it would be better to create custom liveness probe which should consider internal process health checks and external dependency(liveness) health check. In the first scenario your container should stop/terminate your process unlike the the second case with external dependency.

Answering following question:

I'm wondering if a pod disruption budget would prevent this scenario.

In this particular scenario PDB will not help.


I'd reckon giving more visibility to the comment, I've made with additional resources on the matter could prove useful to other community members:

7
On

I'm wondering if a pod disruption budget would prevent this scenario.

Yes, it will prevent.

As you stated, when the pod goes down (or node failure) nothing can do pods from becoming unavailable. However, Certain services require that a minimum number of pods always keep running always.

There could be another way (Stateful resource) but it’s one of the simplest Kubernetes resources available.

Note: You can also use a percentage instead of an absolute number in the minAvailable field. For example, you could state that 60% of all pods with the app=run-always label need to be running at all times.

0
On

Testing with PodDisruptionBudget. Pod will still restart at the same time.

example

https://github.com/AlphaWong/PodDisruptionBudgetAndPodProbe

So yes. like @Dawid Kruk u should create a customized script like following

# something like this
livenessProbe:
  exec:
    command:
    - /bin/sh
    - -c
    # generate a random number for sleep
    - 'SLEEP_TIME=$(shuf -i 2-40 -n 1);sleep $SLEEP_TIME; curl -L --max-time 5 -f nginx2.default.svc.cluster.local'
  initialDelaySeconds: 10
  # think about the gap between each call
  periodSeconds: 30
  # it is required after k8s v1.12
  timeoutSeconds: 90