I'm using Snakemake on a cluster, and I don't know how best to handle the fact that some jobs can be preempted.
For more power on the cluster I use, it is possible to have access to the resources of other teams, but with the risk of being preempted, which consists in stopping the job in progress, and rescheduling it. It will be launched again as soon as a resource is available. This is especially advantageous when you have a lot of quick jobs to run. Unfortunately, I don't have the impression that Snakemake supports this properly.
In the example given in the help on the cluster-status feature for Slurm, there is no PREEMPTED in the running_status list (running_status=["PENDING", "CONFIGURING", "COMPLETING", "RUNNING", "SUSPENDED"]), which may lead to consider a preempted job has failed. Not a big deal, I’ve added PREEMPTED to this list, but I am led to believe that Snakemake did not consider this scenario.
More annoyingly, even when running Snakemake with the --rerun-incomplete option, when the job is interrupted by the preemption, then restarted, I get the following error:
IncompleteFilesException:
The files below seem to be incomplete. If you are sure that certain files are not incomplete, mark them as complete with
snakemake --cleanup-metadata <filenames>
To re-generate the files rerun your command with the --rerun-incomplete flag.
I would expect the interrupted job to restart from scratch.
For now, the only solution I have found is to stop using other teams' resources to avoid having my jobs preempted, but I am losing computing power.
How do you use Snakemake in a context where your jobs can be preempted? Anyone see a solution so I don't get the IncompleteFilesException anymore?
Thanks in advance
Snakemake has a restart feature, which can be used to let jobs be resubmitted automatically. However, there is no special handling for prememption currently, indeed. You are also right, I was not even aware that something like that exists on slurm. A PR in that direction would be welcome of course. Basically, one would need to extend the status script handling to recognize this and in that case restart the job.