Automatically cancel slurm jobs if there are insufficient instances on AWS ParallelCluster

124 Views Asked by At

I recently started playing around with AWS ParallelCluster and I noticed that when I submit a job that requires more instances than there are currently available in my region/AZ then the available instances are brought up and idle until all remaining instances become available. It seems like this can sometimes take a very long time. SLURM reports in /var/log/parallelcluster/slurm_resume.log

ERROR - Error in CreateFleet request (...): InsufficientInstanceCapacity - We currently do not have sufficient c6i.metal capacity in the Availability Zone you requested (us-east-1a)

The problem is, I still pay for the nodes that are up and waiting. Is there a way to instead cancel the job after a certain timeout such that I can try later?

1

There are 1 best solutions below

1
On BEST ANSWER

There might be a better solution than canceling the job in the face of limited capacity. ParallelCluster has a hidden capability called "all or nothing instance launching" that you can turn on by editing your cluster configuration.

What enabling this will do is instruct ParallelCluster to only launch new instances for a job if it can get all the requested instances. The job will not proceed to a running state, and you will not accrue charges for the unused instances. This should prevent the situation you are describing above.

Here's a link to an AWS HPC blog article that will tell you all about it and show you how to use it: https://aws.amazon.com/blogs/hpc/minimize-hpc-compute-costs-with-all-or-nothing-instance-launching/