Having "torch.distributed.elastic.multiprocessing.errors.ChildFailedError:" error when using accelerator

263 Views Asked by At

I'm tring to use accelerate for fine-tuning LLM with FSDP and i'm having problem with the error maybe my code doesn't work so i did basic accelerate test and it doesn't work

this is my accelerate env

  • Accelerate version: 0.27.2
  • Platform: Linux-5.4.0-96-generic-x86_64-with-glibc2.31
  • Python version: 3.9.5
  • Numpy version: 1.26.2
  • PyTorch version (GPU?): 2.2.0+cu121 (True)
  • PyTorch XPU available: False
  • PyTorch NPU available: False
  • System RAM: 1007.76 GB
  • GPU type: NVIDIA A40
  • Accelerate default config: - compute_environment: LOCAL_MACHINE - distributed_type: FSDP - mixed_precision: bf16 - use_cpu: False - debug: False - num_processes: 3 - machine_rank: 0 - num_machines: 1 - rdzv_backend: static - same_network: True - main_training_function: main - fsdp_config: {'fsdp_auto_wrap_policy': 'TRANSFORMER_BASED_WRAP', 'fsdp_backward_prefetch': 'BACKWARD_PRE', 'fsdp_cpu_ram_efficient_loading': True, 'fsdp_forward_prefetch': False, 'fsdp_offload_params': False, 'fsdp_sharding_strategy': 'FULL_SHARD', 'fsdp_state_dict_type': 'SHARDED_STATE_DICT', 'fsdp_sync_module_states': True, 'fsdp_transformer_layer_cls_to_wrap': 'LlamaDecoderLayer', 'fsdp_use_orig_params': True} - downcast_bf16: no - tpu_use_cluster: False - tpu_use_sudo: False - tpu_env: []

and i did accelerate test

accelerate test


Running:  accelerate-launch /usr/local/lib/python3.9/dist-packages/accelerate/test_utils/scripts/test_script.py
stderr: Detected kernel version 5.4.0, which is below the recommended minimum of 5.5.0; this can cause the process to hang. It is recommended to upgrade the kernel to the minimum version or higher.
stdout: **Initialization**
stdout: Testing, testing. 1, 2, 3.
stdout: Distributed environment: FSDP  Backend: nccl
stdout: Num processes: 3
stdout: Process index: 0
stdout: Local process index: 0
stdout: Device: cuda:0
stdout:
stdout: Mixed precision type: bf16
stdout:
stdout: Distributed environment: FSDP  Backend: nccl
stdout: Num processes: 3
stdout: Process index: 1
stdout: Local process index: 1
stdout: Device: cuda:1
stdout:
stdout: Mixed precision type: bf16
stdout:
stdout: Distributed environment: FSDP  Backend: nccl
stdout: Num processes: 3
stdout: Process index: 2
stdout: Local process index: 2
stdout: Device: cuda:2
stdout:
stdout: Mixed precision type: bf16
stdout:
stdout:
stdout: **Test process execution**
stderr: Detected kernel version 5.4.0, which is below the recommended minimum of 5.5.0; this can cause the process to hang. It is recommended to upgrade the kernel to the minimum version or higher.
stdout:
stdout: **Test split between processes as a list**
stdout:
stdout: **Test split between processes as a dict**
stdout:
stdout: **Test split between processes as a tensor**
stdout:
stdout: **Test random number generator synchronization**
stdout: All rng are properly synched.
stdout:
stdout: **DataLoader integration test**
stdout: 1 0 2 tensor([ 0,  1,  2,  3,  4,  5,  6,  7,  8,  9, 10, 11, 12, 13, 14, 15, 16, 17,
stdout:         18, 19, 20, 21, 22, 23, 24, 25, 26, 27, 28, 29, 30, 31, 32, 33, 34, 35,
stdout:         36, 37, 38, 39, 40, 41, 42, 43, 44, 45, 46, 47, 48, 49, 50, 51, 52, 53,
stdout:         54, 55, 56, 57, 58, 59, 60, 61, 62, 63, 64, 65, 66, 67, 68, 69, 70, 71,
stdout:         72, 73, 74, 75, 76, 77, 78, 79, 80, 81, 82, 83, 84, 85, 86, 87, 88, 89,
stdout:         90, 91, 92, 93, 94, 95], device='cuda:0') <class 'accelerate.data_loader.DataLoaderShard'>
stdout: tensor([ 0,  1,  2,  3,  4,  5,  6,  7,  8,  9, 10, 11, 12, 13, 14, 15, 16, 17,
stdout:         18, 19, 20, 21, 22, 23, 24, 25, 26, 27, 28, 29, 30, 31, 32, 33, 34, 35,
stdout:         36, 37, 38, 39, 40, 41, 42, 43, 44, 45, 46, 47, 48, 49, 50, 51, 52, 53,
stdout:         54, 55, 56, 57, 58, 59, 60, 61, 62, 63, 64, 65, 66, 67, 68, 69, 70, 71,
stdout:         72, 73, 74, 75, 76, 77, 78, 79, 80, 81, 82, 83, 84, 85, 86, 87, 88, 89,
stdout:         90, 91, 92, 93, 94, 95], device='cuda:2') <class 'accelerate.data_loader.DataLoaderShard'>
stdout: tensor([ 0,  1,  2,  3,  4,  5,  6,  7,  8,  9, 10, 11, 12, 13, 14, 15, 16, 17,
stdout:         18, 19, 20, 21, 22, 23, 24, 25, 26, 27, 28, 29, 30, 31, 32, 33, 34, 35,
stdout:         36, 37, 38, 39, 40, 41, 42, 43, 44, 45, 46, 47, 48, 49, 50, 51, 52, 53,
stdout:         54, 55, 56, 57, 58, 59, 60, 61, 62, 63, 64, 65, 66, 67, 68, 69, 70, 71,
stdout:         72, 73, 74, 75, 76, 77, 78, 79, 80, 81, 82, 83, 84, 85, 86, 87, 88, 89,
stdout:         90, 91, 92, 93, 94, 95], device='cuda:1') <class 'accelerate.data_loader.DataLoaderShard'>
stderr: Traceback (most recent call last):
stderr:   File "/usr/local/lib/python3.9/dist-packages/accelerate/test_utils/scripts/test_script.py", line 708, in <module>
stderr: Traceback (most recent call last):
stderr:   File "/usr/local/lib/python3.9/dist-packages/accelerate/test_utils/scripts/test_script.py", line 708, in <module>
stderr: Traceback (most recent call last):
stderr:   File "/usr/local/lib/python3.9/dist-packages/accelerate/test_utils/scripts/test_script.py", line 708, in <module>
stderr:     main()
stderr:   File "/usr/local/lib/python3.9/dist-packages/accelerate/test_utils/scripts/test_script.py", line 687, in main
stderr:     main()
stderr:   File "/usr/local/lib/python3.9/dist-packages/accelerate/test_utils/scripts/test_script.py", line 687, in main
stderr:     dl_preparation_check()
stderr:   File "/usr/local/lib/python3.9/dist-packages/accelerate/test_utils/scripts/test_script.py", line 196, in dl_preparation_check
stderr:     dl = prepare_data_loader(
stderr:   File "/usr/local/lib/python3.9/dist-packages/accelerate/data_loader.py", line 858, in prepare_data_loader
stderr:     dl_preparation_check()
stderr:   File "/usr/local/lib/python3.9/dist-packages/accelerate/test_utils/scripts/test_script.py", line 196, in dl_preparation_check
stderr:     main()
stderr:     dl = prepare_data_loader(  File "/usr/local/lib/python3.9/dist-packages/accelerate/test_utils/scripts/test_script.py", line 687, in main
stderr:
stderr:   File "/usr/local/lib/python3.9/dist-packages/accelerate/data_loader.py", line 858, in prepare_data_loader
stderr:     raise ValueError(
stderr: ValueError: To use a `DataLoader` in `split_batches` mode, the batch size (8) needs to be a round multiple of the number of processes (3).
stderr:     dl_preparation_check()
stderr:   File "/usr/local/lib/python3.9/dist-packages/accelerate/test_utils/scripts/test_script.py", line 196, in dl_preparation_check
stderr:     raise ValueError(
stderr: ValueError: To use a `DataLoader` in `split_batches` mode, the batch size (8) needs to be a round multiple of the number of processes (3).
stderr:     dl = prepare_data_loader(
stderr:   File "/usr/local/lib/python3.9/dist-packages/accelerate/data_loader.py", line 858, in prepare_data_loader
stderr:     raise ValueError(
stderr: ValueError: To use a `DataLoader` in `split_batches` mode, the batch size (8) needs to be a round multiple of the number of processes (3).
stderr: [2024-03-14 13:39:21,081] torch.distributed.elastic.multiprocessing.api: [ERROR] failed (exitcode: 1) local_rank: 0 (pid: 27324) of binary: /usr/bin/python3.9
stderr: Traceback (most recent call last):
stderr:   File "/usr/local/bin/accelerate-launch", line 8, in <module>
stderr:     sys.exit(main())
stderr:   File "/usr/local/lib/python3.9/dist-packages/accelerate/commands/launch.py", line 1029, in main
stderr:     launch_command(args)
stderr:   File "/usr/local/lib/python3.9/dist-packages/accelerate/commands/launch.py", line 1010, in launch_command
stderr:     multi_gpu_launcher(args)
stderr:   File "/usr/local/lib/python3.9/dist-packages/accelerate/commands/launch.py", line 672, in multi_gpu_launcher
stderr:     distrib_run.run(args)
stderr:   File "/usr/local/lib/python3.9/dist-packages/torch/distributed/run.py", line 803, in run
stderr:     elastic_launch(
stderr:   File "/usr/local/lib/python3.9/dist-packages/torch/distributed/launcher/api.py", line 135, in __call__
stderr:     return launch_agent(self._config, self._entrypoint, list(args))
stderr:   File "/usr/local/lib/python3.9/dist-packages/torch/distributed/launcher/api.py", line 268, in launch_agent
stderr:     raise ChildFailedError(
stderr: torch.distributed.elastic.multiprocessing.errors.ChildFailedError:
stderr: ============================================================
stderr: /usr/local/lib/python3.9/dist-packages/accelerate/test_utils/scripts/test_script.py FAILED
stderr: ------------------------------------------------------------
stderr: Failures:
stderr: [1]:
stderr:   time      : 2024-03-14_13:39:21
stderr:   host      : 919c8ff8c821
stderr:   rank      : 1 (local_rank: 1)
stderr:   exitcode  : 1 (pid: 27325)
stderr:   error_file: <N/A>
stderr:   traceback : To enable traceback see: https://pytorch.org/docs/stable/elastic/errors.html
stderr: [2]:
stderr:   time      : 2024-03-14_13:39:21
stderr:   host      : 919c8ff8c821
stderr:   rank      : 2 (local_rank: 2)
stderr:   exitcode  : 1 (pid: 27326)
stderr:   error_file: <N/A>
stderr:   traceback : To enable traceback see: https://pytorch.org/docs/stable/elastic/errors.html
stderr: ------------------------------------------------------------
stderr: Root Cause (first observed failure):
stderr: [0]:
stderr:   time      : 2024-03-14_13:39:21
stderr:   host      : 919c8ff8c821
stderr:   rank      : 0 (local_rank: 0)
stderr:   exitcode  : 1 (pid: 27324)
stderr:   error_file: <N/A>
stderr:   traceback : To enable traceback see: https://pytorch.org/docs/stable/elastic/errors.html

I can't find the problem

0

There are 0 best solutions below