I am a looking at the example code for RLLib from

https://docs.ray.io/en/latest/rllib/rllib-training.html#rllib-config-framework

with modified line .rollouts(num_rollout_workers=10, horizon = 50000) to use as many workers as I have CPU cores and test if this implementation can achieve performance of my best BOT for cart-pole environment https://github.com/sebtac/MLxE which learns on 8 cores in 6 minutes to keep the pole up for 50K steps and, when fully trained, does so for 1M+ steps.

analyzing the outputs generated by print(pretty_print(result)) it seems that training "freezes" as the reported numbers do not change after each iteration (see details below). But after couple iterations the training "re-starts" with generally better performance indicators than in the last iteration. Why this might be the case? I assume that it has to do sth with the way the training algorithm is coded within RLLib. Maybe asking it to run episodes as long as 50K steps does not update all values in the results before results are printed in the next iteration?

Code:

start_time = time.time()
from ray.rllib.algorithms.ppo import PPOConfig
from ray.tune.logger import pretty_print

print("LOADED")
algo = (
    PPOConfig()
    .rollouts(num_rollout_workers=10, horizon = 50000)
    .resources(num_gpus=0)
    .environment(env="CartPole-v1")
    .build()
)
print("ALGO DONE")

for i in range(100):
    result = algo.train()
    print(pretty_print(result))

    if i % 5 == 0:
        checkpoint_dir = algo.save()
        print(f"Checkpoint saved in directory {checkpoint_dir}")
        
print("DONE!")
end_time = time.time()
print(end_time-start_time)

Results:

agent_timesteps_total: 400000
counters:
  num_agent_steps_sampled: 400000
  num_agent_steps_trained: 400000
  num_env_steps_sampled: 400000
  num_env_steps_trained: 400000
custom_metrics: {}
date: 2022-12-29_22-13-37
done: false
episode_len_mean: 2430.73
episode_media: {}
episode_reward_max: 13961.0
episode_reward_mean: 2430.73
episode_reward_min: 65.0
episodes_this_iter: 0
episodes_total: 382
experiment_id: d2a7bd4c67654d7790cfb8655fa78a8f
hostname: Sebastians-MacBook-Pro-2.local
info:
  learner:
    default_policy:
      custom_metrics: {}
      diff_num_grad_updates_vs_sampler_policy: 464.5
      learner_stats:
        cur_kl_coeff: 1.6653346031121838e-17
        cur_lr: 4.999999873689376e-05
        entropy: 0.42701029777526855
        entropy_coeff: 0.0
        kl: -8.940913009958251e-10
        model: {}
        policy_loss: 0.0
        total_loss: -1.4889612459834557e-26
        vf_explained_var: -1.0
        vf_loss: 0.0
      num_agent_steps_trained: 128.0
      num_grad_updates_lifetime: 92535.5
  num_agent_steps_sampled: 400000
  num_agent_steps_trained: 400000
  num_env_steps_sampled: 400000
  num_env_steps_trained: 400000
iterations_since_restore: 100
node_ip: 127.0.0.1
num_agent_steps_sampled: 400000
num_agent_steps_trained: 400000
num_env_steps_sampled: 400000
num_env_steps_sampled_this_iter: 4000
num_env_steps_trained: 400000
num_env_steps_trained_this_iter: 4000
num_faulty_episodes: 0
num_healthy_workers: 10
num_in_flight_async_reqs: 0
num_remote_worker_restarts: 0
num_steps_trained_this_iter: 4000
perf:
  cpu_util_percent: 43.44375
  ram_util_percent: 46.568749999999994
pid: 21391
policy_reward_max: {}
policy_reward_mean: {}
policy_reward_min: {}
sampler_perf:
  mean_action_processing_ms: 0.07180588214674852
  mean_env_render_ms: 0.0
  mean_env_wait_ms: 0.051468155449639424
  mean_inference_ms: 8.535723368819935
  mean_raw_obs_processing_ms: 0.19257008707362022
sampler_results:
  custom_metrics: {}
  episode_len_mean: 2430.73
  episode_media: {}
  episode_reward_max: 13961.0
  episode_reward_mean: 2430.73
  episode_reward_min: 65.0
  episodes_this_iter: 0
  hist_stats:
    episode_lengths: [243, 199, 228, 82, 213, 338, 174, 485, 65, 84, 241, 347, 87,
      211, 271, 123, 379, 430, 315, 366, 305, 294, 275, 372, 264, 409, 239, 386, 235,
      321, 371, 151, 129, 491, 319, 859, 375, 318, 625, 594, 714, 622, 360, 2613,
      1473, 2520, 971, 2727, 2675, 3172, 3138, 2559, 2219, 5523, 2478, 2276, 3264,
      2491, 2418, 5696, 4909, 322, 744, 2535, 2508, 2422, 4157, 1797, 4350, 4334,
      4229, 3523, 4170, 4415, 2028, 2006, 1252, 2496, 2620, 5257, 5763, 4102, 5463,
      6000, 9156, 9189, 6292, 5792, 6237, 11669, 11629, 6299, 1704, 1856, 1887, 2206,
      7746, 2047, 1879, 13961]
    episode_reward: [243.0, 199.0, 228.0, 82.0, 213.0, 338.0, 174.0, 485.0, 65.0,
      84.0, 241.0, 347.0, 87.0, 211.0, 271.0, 123.0, 379.0, 430.0, 315.0, 366.0, 305.0,
      294.0, 275.0, 372.0, 264.0, 409.0, 239.0, 386.0, 235.0, 321.0, 371.0, 151.0,
      129.0, 491.0, 319.0, 859.0, 375.0, 318.0, 625.0, 594.0, 714.0, 622.0, 360.0,
      2613.0, 1473.0, 2520.0, 971.0, 2727.0, 2675.0, 3172.0, 3138.0, 2559.0, 2219.0,
      5523.0, 2478.0, 2276.0, 3264.0, 2491.0, 2418.0, 5696.0, 4909.0, 322.0, 744.0,
      2535.0, 2508.0, 2422.0, 4157.0, 1797.0, 4350.0, 4334.0, 4229.0, 3523.0, 4170.0,
      4415.0, 2028.0, 2006.0, 1252.0, 2496.0, 2620.0, 5257.0, 5763.0, 4102.0, 5463.0,
      6000.0, 9156.0, 9189.0, 6292.0, 5792.0, 6237.0, 11669.0, 11629.0, 6299.0, 1704.0,
      1856.0, 1887.0, 2206.0, 7746.0, 2047.0, 1879.0, 13961.0]
  num_faulty_episodes: 0
  policy_reward_max: {}
  policy_reward_mean: {}
  policy_reward_min: {}
  sampler_perf:
    mean_action_processing_ms: 0.07180588214674852
    mean_env_render_ms: 0.0
    mean_env_wait_ms: 0.051468155449639424
    mean_inference_ms: 8.535723368819935
    mean_raw_obs_processing_ms: 0.19257008707362022
time_since_restore: 1137.2171549797058
time_this_iter_s: 11.495001077651978
time_total_s: 1137.2171549797058
timers:
  learn_throughput: 514.515
  learn_time_ms: 7774.316
  load_throughput: 38164731.574
  load_time_ms: 0.105
  synch_weights_time_ms: 8.647
  training_iteration_time_ms: 11505.094
timestamp: 1672373617
timesteps_since_restore: 0
timesteps_total: 400000
training_iteration: 100
trial_id: default
warmup_time: 9.37411093711853}

While entries like learner_timsteps_toatal and learner_stats do change from iteration to iteration elements like Hist_Stats do not. I assume that it means that learning (parameter updates) continues from iteration to iteration but the evaluation games played in subsequent iteration result in the same results as in the previous iteration. Does it mean that the policy becomes at certain moment deterministic? an/or that initial states are the same across multiple iterations? or that it take longer time for episode generation to finish than the time needed for another reporting of the results?

0

There are 0 best solutions below