Submitting Slurm job to head node via compute node?

189 Views Asked by At

I've set up a Slurm cluster on AWS ParallelCluster for a customer who needs to be able to launch nested Slurm jobs. For example, from a login node, we need to be able to launch a single job on a compute node that can launch hundreds/thousands of jobs on separate nodes in the cluster.

If this is considered to be against best practice with Slurm job architecture, we can't simply ask our client to rewrite all of their jobs, we simply need to get to a working state with their existing jobs written the way they are.

When running srun --partition all srun --partition all echo hi, the initial job gets instantiated, but from there, the compute node that runs the root level job seems to be unable to submit jobs to the cluster.

Error message:
srun: error: Unable to create step for job 2: Job/step already completing or completed

What I think might be happening is that the first job is allocating all of the resources on the compute node it gets run on, and the compute node trying to run the second Slurm job on itself, instead of redirecting jobs back to the head node so they can be run on another node/partition. What I don't know is how to reconfigure the cluster to allow compute nodes to resubmit jobs into the queue.

0

There are 0 best solutions below