I'm running a PyTorch Lightning training script on a SLURM-managed cluster using Distributed Data Parallel (DDP). My setup involves 1 node with 4 GPUs. However, I'm encountering issues with the configuration of GPUs in my training script.
Environment:
- PyTorch Lightning
- SLURM cluster with 1 node and 4 GPUs
Trainer Configuration in YAML:
trainer:
_target_: lightning.pytorch.trainer.Trainer
default_root_dir: ${paths.run_dir}
# ... other settings ...
accelerator: gpu
devices: 4
strategy: ddp
SBATCH Script:
#!/bin/bash -l
#SBATCH --partition=gpu
#SBATCH --nodes=1
#SBATCH --gres=gpu:4
#SBATCH --ntasks-per-node=4
#SBATCH --output=/project/home/<project>/output/output.txt
#SBATCH --error=/project/home/<project>/output/error.txt
# set python path to poetry environent in home
cd repos/myrepo
srun python -m myrepo.main
Issue:
When I set
devices: 4
in the Trainer configuration, I receive the following error four times:Error in call to target 'lightning.pytorch.trainer.trainer.Trainer': MisconfigurationException('You requested gpu: [0, 1, 2, 3]\n But your machine only has: [0]')
If I change the
devices
to-1
, I get this error:ValueError: You set `devices=1` in Lightning, but the number of tasks per node configured in SLURM `--ntasks-per-node=4` does not match. HINT: Set `devices=4`.
If I replace
srun python
bypython
, then it works. However, lightning clearly recommends using srun in their docs, and I get this warning:The `srun` command is available on your system but is not used. HINT: If your intention is to run Lightning on SLURM, prepend your python command with `srun` like so: srun python .../repos/myrepo
It seems like there's a mismatch or misconfiguration in how the GPUs are being requested or recognized between PyTorch Lightning and SLURM.
Questions:
- How should I correctly configure the GPU devices in PyTorch Lightning's Trainer to work with SLURM's DDP setup?
- What's the difference between
srun python
andpython
within the sbatch script?
Any insights or suggestions on resolving these errors would be greatly appreciated.