Using Stable Baselines3 on pettingzoo MPE simple spread

242 Views Asked by At

SO I created a stable baseline model using A2C to train simple spread environment from pettingzoo (https://pettingzoo.farama.org/environments/mpe/simple_spread/). I referred to the SB3 tutorial provided at pettingzoo for this, and for some reason I am not getting any reward value higher than 0, and even after training the average reward is not going above -300 for around 10 episodes. I want to ask why is this happening because not even one positive reward value is a very weird thing and even in random training in other environments, you can get better rewards. Anyways, here is my implementation for model: `

def train_model(
env_fn, steps: int = 10_000, seed: int= 0, **env_kwargs
):

    env = env_fn.parallel_env(**env_kwargs)

    env.reset(seed=seed)

    print(f"Starting training on {str(env.metadata['name'])}.")

    #env = ss.pettingzoo_env_to_vec_env_v1(env)
    env = MarkovVectorEnv(env)
    env = ss.concat_vec_envs_v1(env, 2, num_cpus=2, base_class="stable_baselines3")
    policy_kwargs = {'net_arch': [128,128]}
    model = A2C(
        MlpPolicy,
        env,
        verbose=1,
        learning_rate= 0.002,
        gamma = 0.99,
        ent_coef = 0.03,
        policy_kwargs= policy_kwargs,
    )

    model.learn(total_timesteps=steps)

    model.save(f"{env.unwrapped.metadata.get('name')}_{time.strftime('%Y%m%d-%H%M%S')}")

    print("Model has been saved.")

    print(f"Finished training on {str(env.unwrapped.metadata['name'])}.")

    env.close()

Any help or guidance would be appreciated.

I tried customizing policy network, or even customizing feature extraction (https://stable-baselines3.readthedocs.io/en/master/guide/custom_policy.html), but the reward still didn't improve. And with parameters, I have tried everything, and it's still not making sense, with any step size or changing anything in model, even with PPO, it did not work, I looked at simple spread reward fn and it is as follows (I think that it's only subtracting from the reward and not adding anything, which doesn't feel right, but I am not sure I am fairly new to this):

def reward(self, agent, world):
        # Agents are rewarded based on minimum agent distance to each landmark, penalized for collisions
        rew = 0
        for l in world.landmarks:
            dists = [np.sqrt(np.sum(np.square(a.state.p_pos - l.state.p_pos))) for a in world.agents]
            rew -= min(dists)
        if agent.collide:
            for a in world.agents:
                if self.is_collision(a, agent):
                    rew -= 1
        return rew
0

There are 0 best solutions below