We're pretty new to using Slurm and have run into some tricky issues with the cgroups plugin. We started diving into Slurm to get better at managing our computing resources, especially for the more complex tasks. It's been an exciting ride so far because of all the cool stuff Slurm can do. But we've hit a bit of a snag with the cgroups plugin - it's key for managing our resources, and right now, it's giving us a headache. Whenever we try running jobs, we bump into a bunch of errors that have us scratching our heads.
Here are the logs where the errors start to appear:
[2023-10-12T14:50:29.479] [36.batch] error: unable to open '/sys/fs/cgroup/cpuset//tasks' for reading : No such file or directory
[2023-10-12T14:50:29.511] [36.batch] error: unable to mount cpuset cgroup namespace: Device or resource busy
[2023-10-12T14:50:29.511] [36.batch] error: unable to create cpuset cgroup namespace
[2023-10-12T14:50:29.511] [36.batch] error: unable to open '/sys/fs/cgroup/devices//tasks' for reading : No such file or directory
[2023-10-12T14:50:29.512] [36.batch] cgroup/v1: xcgroup_ns_create: cgroup namespace 'devices' is now mounted
[2023-10-12T14:50:29.514] [36.batch] error: common_cgroup_lock error
[2023-10-12T14:50:29.514] [36.batch] error: task_g_pre_setuid: task/cgroup: Unspecified error
[2023-10-12T14:50:29.514] [36.batch] error: Failed to invoke task plugins: one of task_p_pre_setuid functions returned error
[2023-10-12T14:50:29.515] [36.batch] error: called without a previous init. This shouldn't happen!
[2023-10-12T14:50:29.515] [36.batch] error: job_manager: exiting abnormally: Slurmd could not execve job
Our setup includes two nodes on Ubuntu 22.04 and one on Ubuntu 18.04. Initially, we tried using the cgroups V2 plugin, which did not work, so we switched to the cgroups V1 plugin. This change allowed us to run jobs on the Ubuntu 18.04 node, but we encountered the aforementioned errors on the Ubuntu 22.04 nodes.
Additional Symptoms and Troubleshooting Steps:
Nodes go into 'idle' and then 'drain' states after attempting to execute a job.
Altered kernel parameters, but this did not resolve the issue.
We're really stumped by these errors. It's like they pop up randomly, each one under different conditions, which makes figuring out what's actually going wrong pretty tough. This hit-and-miss pattern is throwing us off big time, as we can't seem to get these errors to happen consistently enough to really get what's causing them. If anyone's been through this kind of wild goose chase with Slurm's cgroup
plugin or knows a thing or two about these oddball issues, we'd love to hear your thoughts!
If anyone has faced similar issues or has expertise in Slurm and cgroups on Ubuntu systems, your help would be greatly appreciated. We are looking for insights into:
Potential causes for these specific errors.
Diagnostic tools or methods that could help us further investigate.
Any known compatibility issues or best practices for using Slurm with Ubuntu 22.04 and the cgroups V1 plugin.
Thank you in advance for your assistance!
What We Tried:
We initially set up our system with Slurm and the cgroups V2 plugin on two nodes running Ubuntu 22.04 and one on Ubuntu 18.04.
Encountering issues with the V2 plugin, we switched to the cgroups V1 plugin. We attempted to run jobs on this new configuration.
Additionally, we altered kernel parameters in an effort to resolve the issues.
What We Expected:
Our expectation was that switching to the cgroups V1 plugin would enable smooth execution of jobs across all nodes, including those on Ubuntu 22.04, similar to the success we had on the Ubuntu 18.04 node.
We also hoped that altering the kernel parameters might resolve any compatibility or configuration issues causing the errors.
What Actually Happened:
While the switch to the cgroups V1 plugin allowed job execution on the Ubuntu 18.04 node, it did not resolve the issues on the Ubuntu 22.04 nodes.
We encountered a series of errors on the Ubuntu 22.04 nodes, such as problems with opening and mounting directories, and various other cgroup-related errors.
The nodes would go into 'idle' and then 'drain' states after attempting to execute a job. The changes in kernel parameters did not lead to any improvement in the situation.