So I have created a custom environment using OpenAI Gym. I'm closely following the keras-rl examples of the DQNAgent for the CartPole example which leads to the following implementation:
nb_actions = env.action_space.n
# Option 1 : Simple model
model = Sequential()
model.add(Flatten(input_shape=(1,) + env.observation_space.shape))
model.add(Dense(nb_actions))
model.add(Activation('linear'))
# Next, we build a very simple model.
#model = Sequential()
#model.add(Flatten(input_shape=(1,) + env.observation_space.shape))
#model.add(Dense(16))
#model.add(Activation('relu'))
#model.add(Dense(16))
#model.add(Activation('relu'))
#model.add(Dense(16))
#model.add(Activation('relu'))
#model.add(Dense(nb_actions, activation='linear'))
#model.add(Activation('linear'))
# Finally, we configure and compile our agent. You can use every built-in Keras optimizer and # even the metrics!
memory = SequentialMemory(limit=50000, window_length=1)
policy = BoltzmannQPolicy()
dqn = DQNAgent(model=model, nb_actions=nb_actions, memory=memory, nb_steps_warmup=10, target_model_update=1e-2, policy=policy)
dqn.compile(Adam(lr=1e-3), metrics=['mae'])
# Okay, now it's time to learn something! We visualize the training here for show, but this # slows down training quite a lot. You can always safely abort the training prematurely using # Ctrl + C.
dqn.fit(env, nb_steps=2500, visualize=False, verbose=2)
# After training is done, we save the final weights.
dqn.save_weights('dqn_{}_weights.h5f'.format(ENV_NAME), overwrite=True)
# Finally, evaluate our algorithm for 5 episodes.
dqn.test(env, nb_episodes=10, visualize=False)
So everything looks as I would expect up until the dqn.test function call. Sample output from the dqn.fit is as follows:
... 1912/2500: episode: 8, duration: 1.713s, episode steps: 239, steps per second: 139, episode reward: -78.774, mean reward: -0.330 [-27928.576, 18038.443], mean action: 0.657 [0.000, 2.000], mean observation: 8825.907 [5947.400, 17211.920], loss: 7792970.500000, mean_absolute_error: 653.732361, mean_q: 1.000000
2151/2500: episode: 9, duration: 1.790s, episode steps: 239, steps per second: 134, episode reward: -23335.055, mean reward: -97.636 [-17918.534, 17819.400], mean action: 0.636 [0.000, 2.000], mean observation: 8825.907 [5947.400, 17211.920], loss: 8051206.500000, mean_absolute_error: 676.335266, mean_q: 1.000000
2390/2500: episode: 10, duration: 1.775s, episode steps: 239, steps per second: 135, episode reward: 16940.150, mean reward: 70.879 [-25552.948, 17819.400], mean action: 0.611 [0.000, 2.000], mean observation: 8825.907 [5947.400, 17211.920], loss: 8520963.000000, mean_absolute_error: 690.176819, mean_q: 1.000000
With the various rewards being different it appears to me that the fitting is working as expected. But when the dqn.test method is run, it keeps generating the same output for each episode. In the case of the data I'm using, negative rewards are bad and positive rewards would be good.
Here is the result of the test method being run:
Testing for 10 episodes
- Episode 1: reward: -62996.100, steps: 239
- Episode 2: reward: -62996.100, steps: 239
- Episode 3: reward: -62996.100, steps: 239
- Episode 4: reward: -62996.100, steps: 239
- Episode 5: reward: -62996.100, steps: 239
- Episode 6: reward: -62996.100, steps: 239
- Episode 7: reward: -62996.100, steps: 239
- Episode 8: reward: -62996.100, steps: 239
- Episode 9: reward: -62996.100, steps: 239
- Episode 10: reward: -62996.100, steps: 239
This leads me to two questions:
1) Why are the episode rewards the same for each episode?
2) Why might the model be recommending a set of actions that lead to terrible rewards?
I would check the env object and see if it already has rewards computation there.
I am wondering if the .fit function is not exploring the state space for some reason.
I recently did a RL project (lunar lander) with open AI gym and Keras, although I didn’t use the DQN agent and other Keras built in RL stuff. I simply built a simple feedforwad network. Check this GitHub link see if it’s helpful? https://github.com/tianchuliang/techblog/tree/master/OpenAIGym