The example provided in Memory Requirements - DeepSpeed 0.10.1 documentation is as follows:

python -c 'from deepspeed.runtime.zero.stage_1_and_2 import estimate_zero2_model_states_mem_needs_all_cold; \
estimate_zero2_model_states_mem_needs_all_cold(total_params=2851e6, num_gpus_per_node=8, num_nodes=1)'
Estimated memory needed for params, optim states and gradients for a:
HW: Setup with 1 node, 8 GPUs per node.
SW: Model with 2851M total params.
  per CPU  |  per GPU |   Options
  127.45GB |   5.31GB | offload_optimizer=cpu
  127.45GB |  15.93GB | offload_optimizer=none

I noticed that the per CPU value is the same for both lines. However, offload_optimizer takes values of cpu and none respectively, where the cpu option value theoretically would store the optimizer state and gradient on the CPU, thus taking up more CPU memory, which doesn't match the API report. Am I misunderstanding the implementation of the ZeRO2 or DeepSpeed offload_optimizer parameter?

I reproduced the example on my own machine.

0

There are 0 best solutions below