Problem with PettingZoo and Stable-Baselines3 with a ParallelEnv after training has completed and agents are evaluated

Question

Problem with PettingZoo and Stable-Baselines3 with a ParallelEnv after training has completed and agents are evaluated

39 Views Asked by phantomBlurrr At 20 August 2025 at 21:55

I am in the process of creating a custom environment on which to train multiple agents in parallel to minimize some environment-related metric. At the moment I'm obtaining examples which I can use as guidelines when I go to add PettingZoo multi-agent, parallel support for my existing Gym environment.

However I am having trouble at the first stage of learning how to do any of this: I can't get any of the environments provided by PettingZoo (PZ) to even run as expected. I have followed the tutorials from PZ themselves for Waterworld and Knights-Archers-Zombies. I have also been utilizing those two tutorials as templates for attempting to train/evaluate the environments described in the Butterfly and SISL sections.

All in all, the environments I have gotten the closest to work are the following:

Butterfly
- Pistonball (canonically broken)
- Knights-Archers-Zombies (not working)
- Cooperative Pong (not working)
SISL
- Multiwalker (not working)
- Waterworld (working!)

I have also watched this video by Jordan Terry (CEO of Farma-Foundation) where the concept of PZ is explained.

The goal of this initial push of getting PZ-provided environments to run is to have examples which I can use as guidelines when I go to add PZ multi-agent, parallel support for my existing Gym environment.

What I have done that does work
The only environment which I have been able to (seemingly) implement correctly is Waterworld and below is the code for it:

"""
Source: https://pettingzoo.farama.org/environments/sisl/waterworld/
"""

from __future__ import annotations

import glob
import os

import supersuit as ss
from stable_baselines3 import PPO
from stable_baselines3.ppo import MlpPolicy

from pettingzoo.sisl import waterworld_v4

out_dir     = "./Source/PettingZoo/output_data"
log_dir     = out_dir + "/" + "logs"
model_dir   = out_dir + "/" + "models"
model_file_name = "Waterworld_SB3_PPO_" + "03-23-24"

def train(env_fn, steps: int = 10_000, seed: int | None = 0, **env_kwargs):
    # Train a single model to play as each agent in a cooperative Parallel environment
    env = env_fn.parallel_env(**env_kwargs)
    env.reset(seed=seed)

    print(f"Starting training on {str(env.metadata['name'])}.")

    #Vectorize..!
    env = ss.pettingzoo_env_to_vec_env_v1(env)
    env = ss.concat_vec_envs_v1(env, 8, num_cpus=2, base_class="stable_baselines3")

    # Note: Waterworld's observation space is discrete (242,) so we use an MLP policy rather than CNN
    model = PPO(
        MlpPolicy,
        env,
        verbose=3,
        learning_rate=1e-3,
        batch_size=256,
    )

    model.learn(total_timesteps=steps) #Train the model...
    model.save(f"{model_dir}/{model_file_name}") #Save the model...

    print("Model has been saved.")
    print(f"Finished training on {str(env.unwrapped.metadata['name'])}.")

    env.close() #Close env when you're done...


def evaluate(env_fn, num_games: int = 10, **env_kwargs):
    # Evaluate a trained agent vs a random agent
    env = env_fn.env(**env_kwargs)

    print(f"\nStarting evaluation on {model_dir}/{model_file_name} (num_games={num_games})")

    try:
        latest_policy = max(glob.glob(f"{model_dir}/{model_file_name}.zip"), key=os.path.getctime)
    except ValueError:
        print("Policy not found.")
        exit(0)

    model = PPO.load(latest_policy)

    rewards = {agent: 0 for agent in env.possible_agents}

    # Note: We train using the Parallel API but evaluate using the AEC API
    # SB3 models are designed for single-agent settings, we get around this by using the same model for every agent
    for i in range(num_games):
        env.reset(seed=i)

        for agent in env.agent_iter():
            obs, reward, termination, truncation, info = env.last()

            for a in env.agents:
                rewards[a] += env.rewards[a]
            if termination or truncation:
                break
            else:
                act = model.predict(obs, deterministic=True)[0]
                print(act)

            env.step(act)
    env.close()

    avg_reward = sum(rewards.values()) / len(rewards.values())
    print("Rewards: ", rewards)
    print(f"Avg reward: {avg_reward}")
    return avg_reward

if __name__ == "__main__":
    env_fn = waterworld_v4
    env_kwargs = { 
        "n_pursuers"        : 2,            #number of pursuing archea (agents)
        "n_evaders"         : 5,            #number of food objects
        "n_poisons"         : 10,           #number of poison objects
        "n_coop"            : 1,            #number of pursuing archea (agents) that must be touching food at the same time to consume it
        "n_sensors"         : 30,           #number of sensors on all pursuing archea (agents)
        "sensor_range"      : 0.3,          #length of sensor dendrite on all pursuing archea (agents)
        "radius"            : 0.015,        #archea base radius. Pursuer: radius, food: 2 x radius, poison: 3/4 x radius
        "obstacle_radius"   : 0.1,          #radius of obstacle object
        "n_obstacles"       : 1,                
        #"obstacle_coord"    : [(0.5, 0.5)], #coordinate of obstacle object. Can be set to None to use a random location
        "obstacle_coord"    : None,         #coordinate of obstacle object. Can be set to None to use a random location
        "pursuer_max_accel" : 0.075,        #pursuer archea maximum acceleration (maximum action size)
        "evader_speed"      : 0.1,          #food speed
        "poison_speed"      : 0.1,          #poison speed
        "poison_reward"     : -2.0,         #reward for pursuer consuming a poison object (typically negative)
        "food_reward"       : 15,           #reward for pursuers consuming a food object
        "encounter_reward"  : 0.02,         #reward for a pursuer colliding with a food object
        "thrust_penalty"    : -0.2,         #scaling factor for the negative reward used to penalize large actions
        "local_ratio"       : 1.0,          #Proportion of reward allocated locally vs distributed globally among all agents
        "speed_features"    : True,         #toggles whether pursuing archea (agent) sensors detect speed of other objects and archea
        "max_cycles"        : 1000,         #After max_cycles steps all agents will return done
        "render_mode"       : "human"
    }

    # Train a model
    #train(env_fn, steps=2000000, seed=0, **env_kwargs)

    # Evaluate 10 games (average reward should be positive but can vary significantly)
    #evaluate(env_fn, num_games=10, **env_kwargs)

    # Watch 2 games
    evaluate(env_fn, num_games=2, **env_kwargs)

For this environment I double checked that all of my arguments DO get passed into the environment on init and are actually set. I also printed the action during training and the output makes sense (the agent really is selecting some actions). Then during evaluation printing out the actions looks like this:

[-0.04170908 -0.26907   ]
...
[-0.77480114 -0.0071846 ]
[0.47875726 0.17024541]
[-0.29022884 -0.17299792]
[0.5283084 0.2298947]
[-0.43450794 -0.20316385]
...
[0.58812404 0.294622  ]

Which again makes sense. Of course, I can also see the agent behaving as expected (for waterworld the agent is represented by a blue circle and it chases around green circles and avoids red ones).

Now here is the problem
I tried to use this for all the other environments that I have been messing with and those will not work AFTER training. For Multiwalker I have the exact same script as above, except I have made some modifications so it implemented the Multiwalker instead.

Below the script for that:

"""
Source: https://pettingzoo.farama.org/environments/sisl/multiwalker/
"""

from __future__ import annotations

import glob
import os

import supersuit as ss
from stable_baselines3 import PPO
from stable_baselines3.ppo import MlpPolicy

from pettingzoo.sisl import multiwalker_v9

out_dir     = "./Source/PettingZoo/output_data"
log_dir     = out_dir + "/" + "logs"
model_dir   = out_dir + "/" + "models"
model_file_name = "Multiwalker_SB3_PPO_" + "03-23-24"

def train(env_fn, steps: int = 10_000, seed: int | None = 0, **env_kwargs):
    # Train a single model to play as each agent in a cooperative Parallel environment
    env = env_fn.parallel_env(**env_kwargs)
    env.reset(seed=seed)

    print(f"Starting training on {str(env.metadata['name'])}.")

    #Vectorize..!
    env = ss.pettingzoo_env_to_vec_env_v1(env)
    env = ss.concat_vec_envs_v1(env, 4, num_cpus=2, base_class="stable_baselines3")

    model = PPO(
        MlpPolicy,
        env             = env,
        verbose         = 3,
        gamma           = 0.95,     #Determines the importance of future rewards in the agent's decision making process.
        n_steps         = 512,      #The number of steps before performaing a gradient update.
        ent_coef        = 0.02,     #Encourages exploration by penalizing low-entropy policies.
        learning_rate   = 0.001,    #Controls step size during optimization.
        vf_coef         = 0.045,    #Coefficient for value function loss in the total loss function.   
        gae_lambda      = 0.95,     #Controls the trade-off between bias and variance in estimating the advantage function. 1 is Monte-Carlo estimates, 0 is one-step estimates.
        n_epochs        = 5,        #Optimization epochs per batch of data.    
        clip_range      = 0.25,     #Limits the deviation of the policy update from the old policy.
        batch_size      = 64,       #The number of samples used in each gradient update.
    )

    model.learn(total_timesteps=steps) #Train the model...
    model.save(f"{model_dir}/{model_file_name}") #Save the model...

    print("Model has been saved.")
    print(f"Finished training on {str(env.unwrapped.metadata['name'])}.")

    env.close() #Close env when you're done...

def evaluate(env_fn, num_games: int = 10, **env_kwargs):
    env = env_fn.env(**env_kwargs)

    print(f"\nStarting evaluation on {model_dir}/{model_file_name} (num_games={num_games})")

    try:
        latest_policy = max(glob.glob(f"{model_dir}/{model_file_name}.zip"), key=os.path.getctime)
    except ValueError:
        print("Policy not found.")
        exit(0)

    model = PPO.load(latest_policy)

    rewards = {agent: 0 for agent in env.possible_agents}

    # Note: We train using the Parallel API but evaluate using the AEC API
    # SB3 models are designed for single-agent settings, we get around this by using he same model for every agent
    for i in range(num_games):
        env.reset(seed=i)

        for agent in env.agent_iter():
            obs, reward, termination, truncation, info = env.last()

            for a in env.agents:
                rewards[a] += env.rewards[a]
            if termination or truncation:
                break
            else:
                act = model.predict(obs, deterministic=True)[0]
                print(act)

            env.step(act)
    env.close()

    avg_reward = sum(rewards.values()) / len(rewards.values())
    print("Rewards: ", rewards)
    print(f"Avg reward: {avg_reward}")
    return avg_reward

if __name__ == "__main__":
    env_fn = multiwalker_v9
    env_kwargs = { 
        "n_walkers"         : 3,        # number of bipedal walker agents in environment
        "position_noise"    : 1e-3,     # noise applied to neighbors and package positional observations
        "angle_noise"       : 1e-3,     # noise applied to neighbors and package rotational observations
        "forward_reward"    : 2.0,      # reward received is forward_reward * change in position of the packagee
        "fall_reward"       : -10,      # reward applied when an agent falls
        "shared_reward"     : False,    # whether reward is distributed among all agents or allocated individually
        "terminate_reward"  : -100.0,   # reward applied to each walker if they fail to carry the package to the right edge of the terrain
        "terminate_on_fall" : True,     # If True (default), a single walker falling causes all agents to be done, and they all receive an additional terminate_reward. If False, then only the fallen agent(s) receive fall_reward, and the rest of the agents are not done i.e. the environment continues.
        "remove_on_fall"    : True,     # Remove a walker when it falls (only works when terminate_on_fall is False)
        "terrain_length"    : 200,      # length of terrain in number of steps
        "max_cycles"        : 1000,     #  after max_cycles steps all agents will return done
        #"render_mode"       : None     # None, "human", "rgb_array"
        "render_mode"       : "human"    
    }

    # Train a model
    #train(env_fn, steps=2000000, seed=0, **env_kwargs)

    # Evaluate 10 games (average reward should be positive but can vary significantly)
    #evaluate(env_fn, num_games=10, **env_kwargs)

    # Watch 2 games
    evaluate(env_fn, num_games=2, **env_kwargs)

For this environment I also double checked that all of my arguments DO get passed into the environment on init and are actually set. I also printed the action during training and the output makes sense (the agent really is selecting some actions). That looks like this:

[-0.01879434 -0.7358413   0.64253783 -1.        ]
...
[-1.          0.10215812 -1.          0.46984252]
[-1.         -0.6451508   0.33084095  0.11673849]
[-0.45901364  0.82280475  1.         -0.22843474]
[-0.99101275 -0.23140995  0.54993254  0.7401654 ]
[-0.01305871 -0.7434979   1.          0.3282415 ]
...
[-0.95464736  1.          1.         -1.        ]

The agents can be seen doing something as well. The problem is that AFTER training, during evaluation, the output actions now always look like this:

[-1.  1.  1.  1.]
...
[-1.  1.  1.  1.]
[-1.  1.  1.  1.]
[-1.  1.  1.  1.]
[-1.  1.  1.  1.]
[-1.  1.  1.  1.]
...
[-1.  1.  1.  1.]

The agents then proceed to collapse and just sit there frozen.

If I TRAIN using "human" rendering and wait until close to the 2 million time steps, the agents are NOT just sitting there frozen, they're still moving around.

This SAME phenomenon with my agents freezing up has been seen in all of the other environments that I have attempted to implement.

For Knights-Archers-Zombies (KAZ), I followed the same procedure as above except I modified the script to accommodate the KAZ implementation, mainly the changes are found during the env creation:

env = env_fn.parallel_env(**env_kwargs)
env = ss.black_death_v3(env)

visual_observation = not env.unwrapped.vector_state
if visual_observation:
    env = ss.color_reduction_v0(env, mode="B")
    env = ss.resize_v1(env, x_size=84, y_size=84)
    env = ss.frame_stack_v1(env, 3)

env.reset(seed=seed)

print(f"Starting training on {str(env.metadata['name'])}.")

env = ss.pettingzoo_env_to_vec_env_v1(env)
env = ss.concat_vec_envs_v1(env, 4, num_cpus=2, base_class="stable_baselines3")

and the environment config:

env_fn = knights_archers_zombies_v10
    env_kwargs = { 
        "spawn_rate"        : 20,       # how many cycles before a new zombie is spawned. A lower number means zombies are spawned at a higher rate.
        "num_archers"       : 2,        # how many archer agents initially spawn.
        "num_knights"       : 2,        # how many knight agents initially spawn.
        "max_zombies"       : 10,       # maximum number of zombies that can exist at a time
        "max_arrows"        : 10,       # maximum number of arrows that can exist at a time
        "killable_knights"  : True,     # if set to False, knight agents cannot be killed by zombies.
        "killable_archers"  : True,     # if set to False, archer agents cannot be killed by zombies.
        "pad_observation"   : True,     # if agents are near edge of environment, their observation cannot form a 40x40 grid. If this is set to True, the observation is padded with black.
        "line_death"        : False,    # if set to False, agents do not die when they touch the top or bottom border. If True, agents die as soon as they touch the top or bottom border.
        "vector_state"      : True,     # whether to use vectorized state, if set to False, an image-based observation will be provided instead.
        "use_typemasks"     : False,    # only relevant when vector_state=True is set, adds typemasks to the vectors.
        "sequence_space"    : False,    # experimental, only relevant when vector_state=True is set, removes non-existent entities in the vector state.
        "render_mode"       : None      # None, "human", "rgb_array"
        #"render_mode"       : "human"    
    }

Similarly for Cooperative Pong, I followed the same procedure as above except I modified the script to accommodate the coop-pong implementation, mainly the changes are found during the env creation:

env = env_fn.parallel_env(**env_kwargs)
env = ss.color_reduction_v0(env, mode='B')
env = ss.resize_v1(env, x_size=84, y_size=84)
env = ss.frame_stack_v1(env, 3)
    
env.reset(seed=seed)

print(f"Starting training on {str(env.metadata['name'])}.")

env = ss.pettingzoo_env_to_vec_env_v1(env)
#WARNING: More than 2 envs uses 100% RAM
env = ss.concat_vec_envs_v1(env, 2, num_cpus=2, base_class="stable_baselines3")

and the env configuration:

env_fn = cooperative_pong_v5
    env_kwargs = { 
        "ball_speed"        : 9,            #Speed of ball (in pixels)
        "left_paddle_speed" : 12,           #Speed of left paddle (in pixels)
        "right_paddle_speed": 12,           #Speed of right paddle (in pixels)
        "cake_paddle"       : True,         #If True, the right paddle cakes the shape of a 4 tiered wedding cake
        "max_cycles"        : 900,          #after max_cycles steps all agents will return done
        "bounce_randomness" : False,        #If True, each collision of the ball with the paddles adds a small random angle to the direction of the ball, with the speed of the ball remaining unchanged.
        "max_reward"        : 100,          #Total reward given to each agent over max_cycles timesteps
        "off_screen_penalty": -10,          #Negative reward penalty for each agent if the ball goes off the screen
        "render_mode"       : "human"
    }

Pistonball is actually where I started and stopped working with it shortly since it is literally stated in the associated script that it is broken and it does not work.

As far as I can tell, the implementations are working fine during training but once I load up the model and try to evaluate it the actions are just some "high" value or some constant value which causes the agents to freeze up.

Here is the code I use for the evaluation, this is the same accross ALL my implementations and I am assuming THIS is probably where the problem lies, though I may be wrong...

def evaluate(env_fn, num_games: int = 10, **env_kwargs):
    env = env_fn.env(**env_kwargs)

    print(f"\nStarting evaluation on {model_dir}/{model_file_name} (num_games={num_games})")

    try:
        latest_policy = max(glob.glob(f"{model_dir}/{model_file_name}.zip"), key=os.path.getctime)
    except ValueError:
        print("Policy not found.")
        exit(0)

    model = PPO.load(latest_policy)

    rewards = {agent: 0 for agent in env.possible_agents}

    # Note: We train using the Parallel API but evaluate using the AEC API
    # SB3 models are designed for single-agent settings, we get around this by using the same model for every agent
    for i in range(num_games):
        env.reset(seed=i)

        for agent in env.agent_iter():
            obs, reward, termination, truncation, info = env.last()

            for a in env.agents:
                rewards[a] += env.rewards[a]
            if termination or truncation:
                break
            else:
                act = model.predict(obs, deterministic=True)[0]
                print(act)

            env.step(act)
    env.close()

    avg_reward = sum(rewards.values()) / len(rewards.values())
    print("Rewards: ", rewards)
    print(f"Avg reward: {avg_reward}")
    return avg_reward

So that's what I need help with: What am I doing wrong that when I evaluate on any other environment besides Waterworld, the agent actions are just horrible and cause the agent to effectively freeze?

Does anyone know how to properly evaluate any of my remaining target environments? Or have any advice? Or can point to any new resources for setting up this multi-agent, parallel environments?

Original Q&A

Problem with PettingZoo and Stable-Baselines3 with a ParallelEnv after training has completed and agents are evaluated

There are 0 best solutions below

Related Questions in PYTHON

Related Questions in STABLE-BASELINES

Related Questions in MULTI-AGENT-REINFORCEMENT-LEARNING

Related Questions in PETTINGZOO

Trending Questions

Popular # Hahtags

Popular Questions