What is the cause of the low CPU utilization in rllib PPO? What does 'cpu_util_percent' measure?

374 Views Asked by At

I implement multiagent ppo in rllib with a custom environment, it learns and works well except for the speed performance. I wonder if an underutilized CPU may cause the issue, so I want to know what ray/tune/perf/cpu_util_percent measures. Does it measure only the rollout workers, or is averaged over the learner? And what may be the cause? (All my runs give average of 13% CPU usage.)

run on gcp
ray 2.0
python3.9
torch1.12
head: n1-standard-8 with 1 v100 gpu
2 workers: c2-standard-60

num_workers: 120  # this worker != machine, num_workers = num_rollout_workers
num_envs_per_worker: 1
num_cpus_for_driver: 8
num_gpus: 1
num_cpus_per_worker: 1
num_gpus_per_worker: 0
train_batch_size: 12000
sgd_minibatch_size: 3000

I tried smaller batch size=4096 and smaller number of workers=10, and larger batch_size=480000, all resulted 10~20% CPU usage.

I cannot share the code.

0

There are 0 best solutions below