PyTorch Lightning DDP Error with SLURM: GPU MisconfigurationException and Devices Mismatch

274 Views Asked by At

I'm running a PyTorch Lightning training script on a SLURM-managed cluster using Distributed Data Parallel (DDP). My setup involves 1 node with 4 GPUs. However, I'm encountering issues with the configuration of GPUs in my training script.

Environment:

  • PyTorch Lightning
  • SLURM cluster with 1 node and 4 GPUs

Trainer Configuration in YAML:

trainer:
  _target_: lightning.pytorch.trainer.Trainer
  default_root_dir: ${paths.run_dir}
  # ... other settings ...
  accelerator: gpu
  devices: 4
  strategy: ddp

SBATCH Script:

#!/bin/bash -l
#SBATCH --partition=gpu
#SBATCH --nodes=1
#SBATCH --gres=gpu:4
#SBATCH --ntasks-per-node=4
#SBATCH --output=/project/home/<project>/output/output.txt
#SBATCH --error=/project/home/<project>/output/error.txt

# set python path to poetry environent in home

cd repos/myrepo
srun python -m myrepo.main

Issue:

  1. When I set devices: 4 in the Trainer configuration, I receive the following error four times:

    Error in call to target 'lightning.pytorch.trainer.trainer.Trainer':
    MisconfigurationException('You requested gpu: [0, 1, 2, 3]\n But your machine only has: [0]')
    
  2. If I change the devices to -1, I get this error:

    ValueError: You set `devices=1` in Lightning, but the number of tasks per node configured in SLURM `--ntasks-per-node=4` does not match. HINT: Set `devices=4`.
    
  3. If I replace srun python by python, then it works. However, lightning clearly recommends using srun in their docs, and I get this warning:

    The `srun` command is available on your system but is not used. HINT: If your intention is to run Lightning on SLURM, prepend your python command with `srun` like so: srun python .../repos/myrepo
    

It seems like there's a mismatch or misconfiguration in how the GPUs are being requested or recognized between PyTorch Lightning and SLURM.

Questions:

  1. How should I correctly configure the GPU devices in PyTorch Lightning's Trainer to work with SLURM's DDP setup?
  2. What's the difference between srun python and python within the sbatch script?

Any insights or suggestions on resolving these errors would be greatly appreciated.

0

There are 0 best solutions below