The example provided in Memory Requirements - DeepSpeed 0.10.1 documentation is as follows:
python -c 'from deepspeed.runtime.zero.stage_1_and_2 import estimate_zero2_model_states_mem_needs_all_cold; \
estimate_zero2_model_states_mem_needs_all_cold(total_params=2851e6, num_gpus_per_node=8, num_nodes=1)'
Estimated memory needed for params, optim states and gradients for a:
HW: Setup with 1 node, 8 GPUs per node.
SW: Model with 2851M total params.
per CPU | per GPU | Options
127.45GB | 5.31GB | offload_optimizer=cpu
127.45GB | 15.93GB | offload_optimizer=none
I noticed that the per CPU
value is the same for both lines.
However, offload_optimizer
takes values of cpu
and none
respectively, where the cpu
option value theoretically would store the optimizer state and gradient on the CPU, thus taking up more CPU memory, which doesn't match the API report.
Am I misunderstanding the implementation of the ZeRO2 or DeepSpeed offload_optimizer
parameter?
I reproduced the example on my own machine.