StableBaselines3 Significant Difference between Evaluation and Testing Reward

64 Views Asked by At

I am using stable_baselines3 to train a model to play the Atari Breakout game. For training efficiency, I am using a vectorized environment of 4 games. Here is the code regarding training:

from stable_baselines3.common.env_util import make_atari_env
from stable_baselines3.common.vec_env import SubprocVecEnv
from stable_baselines3 import A2C

# train model
train_env = make_atari_env("Breakout-v4", n_envs=4, seed=0, vec_env_cls=SubprocVecEnv)
train_env = VecFrameStack(train_env, n_stack=4)

model = A2C('CnnPolicy', train_env, verbose=1)
model.learn(total_timesteps=200_000)

When evaluating the resulting model, I am getting acceptable results of 10.7 reward points in average. In the code I am using only one environment for evaluation which is virtually stacked to be compatible with the model.

from stable_baselines3.common.evaluation import evaluate_policy

eval_env = make_atari_env("Breakout-v4", n_envs=1, seed=0, vec_env_cls=SubprocVecEnv)
eval_env = VecFrameStack(eval_env, n_stack=4)

evaluate_policy(model, eval_env, n_eval_episodes=10, render=False)

My problem is that when I apply the model on the same evaluation environment without the eval_policy to simulate an actual application of the model on the game, I am getting really bad results, mostly 0 reward points.

for i in range(0,10):
  obs = eval_env.reset()
  done = False
  score = 0
  while not done:
    action, _ = model.predict(obs)
    obs, reward, done, info = eval_env.step(action)
    score += reward
  print("Episode:{} Score:{}".format(i, score))

Any idea what I am doing wrong here? It seems like the model is not recognizing the environment corrctly in the latter attempt.

0

There are 0 best solutions below