Profiling Torch Distributed with Scalene (torchrun + scalene)

168 Views Asked by At

I am using torch distributed in my code. I run it using the torchrun command from my terminal. I want to profile it using the scalene profiler.

Sample torchrun run command: bash torchrun --nnodes 1 --nproc_per_node 6 --standalone main.py --train

Sample scalene profiling command: bash scalene --no-browser --reduced-profile --cpu --outfile profile_rnd00_pong_5│ fig_path=./configs/PongTuning/config_rnd00.conf --log_name=PongTuning_rnd0 hr_teslaT4_test00.html --profile-interval 120 main.py --train

I tried combining these two as follows but it does not work: bash scalene --no-browser --reduced-profile --cpu --outfile profile_rnd00_pong_5│ fig_path=./configs/PongTuning/config_rnd00.conf --log_name=PongTuning_rnd0 hr_teslaT4_test00.html --profile-interval 120 torchrun --nnodes 1 --nproc_per_node 6 --standalone main.py --train Is there a way to use scalene while also depending on torchrun to run my distributed pytorch code.

I have also tried the following way:

python -m scalene --- -m torch.distributed.run --nnodes 1 --nproc_per_node 6 --standalone main.py --train

and it raised the following error:

Scalene: Program did not run for long enough to profile.                      
ERROR:torch.distributed.elastic.multiprocessing.api:failed (exitcode: 2) local
_rank: 0 (pid: 136598) of binary: /tmp/scalenelcqus7e6/python                 
Error in program being profiled:  

Note: main.py uses argparse and accepts --train as one of its options.

I have also looked at pytorch profiler but it doesn't seem to help me. It does not profile non-pytorch related parts of my code. I need to profile other parts of my code to optimize such as the use cases of python arrays and conversions between object types.

I really appreaciate your help. Thank you.

0

There are 0 best solutions below