involuntary disruptions / SIGKILL handling in microservice following saga pattern

95 Views Asked by At

Should i engineer my microservice to handle involuntary disruptions like hardware failure? Are these disruptions frequent enough to be handled in a service running on AWS managed EKS cluster.
Should i consider some design change in the service to handle the unexpected SIGKILL with methods like persisting the data at each step or will that be considered as over-engineering?

What standard way would you suggest for handling these involuntary disruptions if it is
a) a restful service that responds typically in 1s(follows saga pattern). b) a service that process a big 1GB file in 1 hour.

1

There are 1 best solutions below

2
On

There are couple of ways to handle those disruptions. As mentioned here here:

Here are some ways to mitigate involuntary disruptions:

  • Ensure your pod requests the resources it needs.
  • Replicate your application if you need higher availability. (Learn about running replicated stateless and stateful applications.)
  • For even higher availability when running replicated applications, spread applications across racks (using anti-affinity) or across zones (if using a multi-zone cluster.)

The frequency of voluntary disruptions varies.

So:

  • if your budget allows it, spread your app accross zones or racks, you can use Node affinity to schedule Pods on cetrain nodes,
  • make sure to configure Replicas, it will ensure that when one Pod receives SIGKILL the load is automatically directed to another Pod. You can read more about this here.
  • consider using DaemonSets, which ensure each Node runs a copy of a Pod.
  • use Deployments for stateless apps and StatefulSets for stateful.
  • last thing you can do is to write your app to be distruption tolerant.

I hope I cleared the water a little bit for you, feel free to ask more questions.