Run a large amount of tasks on a cluster

196 Views Asked by At

I'm looking for a solution to running a large amount of tasks and monitoring their status on a cluster.

In detail: Each task consists of 3-4 processes which are docker contained (each process is a docker run command). All of the processes have to run on the same server.

The amount of tasks we're talking about is bursts of several hundreds of tasks at a time.

I've looked into several solutions all of them based on Mesos:

  • Chronos - Seems like it would falter under high load and in any case is more directed towards recurring (cron) jobs. While I need one-time (heavy) job.
  • Custom Mesos FW - Seems to low-level for my needs would require me to write scheduling and retrying mechanisms, I'd save this for last resort.
  • Aurora - This seems promising as each task is run on the same node and comprised of several processes. I am missing a couple of this here though: Aurora seems to not be able to run several tasks as a part of a single job. Since my tasks are all similar with different input I could use a single job with many (say 400) instances and the first process of each task (whose role is to download the input from S3) could download a different set based on the instance ID. Which brings me to another problem: I can't find a working example of using {{ mesos.instance }} in .aurora files can anyone give me an example?

Thanks for all the fish people

2

There are 2 best solutions below

2
On

You could also have a look on Kubernetes (which also can be run as a framework in Mesos). Kubernetes has the concept of Pods which are basically a set of co-located containers. So in your case a pod would consist of your 3-4 processes/containers and then these pods can be scaled up/down.

Short comments regarding the other solutions you mentioned:

  • Chronos: Not really targeting your use case
  • Custom FW: Actually not so difficult, but good call to save this as last resort.
  • Aurora: Very powerful but also complex framework
  • Marathon (which you didn't mention): targeted for long running applications which can be easily scaled up and down.
1
On

In addition to the excellent other answer, you could check out Two Sigma's Cook which they have only recently open sourced but have been using in prod at scale for a while.