I am using stable_baselines3
to train a model to play the Atari Breakout game.
For training efficiency, I am using a vectorized environment of 4 games. Here is the
code regarding training:
from stable_baselines3.common.env_util import make_atari_env
from stable_baselines3.common.vec_env import SubprocVecEnv
from stable_baselines3 import A2C
# train model
train_env = make_atari_env("Breakout-v4", n_envs=4, seed=0, vec_env_cls=SubprocVecEnv)
train_env = VecFrameStack(train_env, n_stack=4)
model = A2C('CnnPolicy', train_env, verbose=1)
model.learn(total_timesteps=200_000)
When evaluating the resulting model, I am getting acceptable results of 10.7 reward points in average. In the code I am using only one environment for evaluation which is virtually stacked to be compatible with the model.
from stable_baselines3.common.evaluation import evaluate_policy
eval_env = make_atari_env("Breakout-v4", n_envs=1, seed=0, vec_env_cls=SubprocVecEnv)
eval_env = VecFrameStack(eval_env, n_stack=4)
evaluate_policy(model, eval_env, n_eval_episodes=10, render=False)
My problem is that when I apply the model on the same evaluation environment without
the eval_policy
to simulate an actual application of the model on the game, I am getting
really bad results, mostly 0 reward points.
for i in range(0,10):
obs = eval_env.reset()
done = False
score = 0
while not done:
action, _ = model.predict(obs)
obs, reward, done, info = eval_env.step(action)
score += reward
print("Episode:{} Score:{}".format(i, score))
Any idea what I am doing wrong here? It seems like the model is not recognizing the environment corrctly in the latter attempt.