Run a large amount of tasks on a cluster

215 Views Asked by At

I'm looking for a solution to running a large amount of tasks and monitoring their status on a cluster.

In detail: Each task consists of 3-4 processes which are docker contained (each process is a docker run command). All of the processes have to run on the same server.

The amount of tasks we're talking about is bursts of several hundreds of tasks at a time.

I've looked into several solutions all of them based on Mesos:

  • Chronos - Seems like it would falter under high load and in any case is more directed towards recurring (cron) jobs. While I need one-time (heavy) job.
  • Custom Mesos FW - Seems to low-level for my needs would require me to write scheduling and retrying mechanisms, I'd save this for last resort.
  • Aurora - This seems promising as each task is run on the same node and comprised of several processes. I am missing a couple of this here though: Aurora seems to not be able to run several tasks as a part of a single job. Since my tasks are all similar with different input I could use a single job with many (say 400) instances and the first process of each task (whose role is to download the input from S3) could download a different set based on the instance ID. Which brings me to another problem: I can't find a working example of using {{ mesos.instance }} in .aurora files can anyone give me an example?

Thanks for all the fish people

2

There are 2 best solutions below

2
js84 On

You could also have a look on Kubernetes (which also can be run as a framework in Mesos). Kubernetes has the concept of Pods which are basically a set of co-located containers. So in your case a pod would consist of your 3-4 processes/containers and then these pods can be scaled up/down.

Short comments regarding the other solutions you mentioned:

  • Chronos: Not really targeting your use case
  • Custom FW: Actually not so difficult, but good call to save this as last resort.
  • Aurora: Very powerful but also complex framework
  • Marathon (which you didn't mention): targeted for long running applications which can be easily scaled up and down.
1
Michael Hausenblas On

In addition to the excellent other answer, you could check out Two Sigma's Cook which they have only recently open sourced but have been using in prod at scale for a while.