I'm currently working on implementing Q-learning for the FrozenLake-v1 environment in OpenAI Gym. However, my Q-table doesn't seem to be updating during training, and it remains filled with zeros. I've reviewed my code multiple times, but I can't pinpoint the issue.
Here's the code I'm using:
import gymnasium as gym
import numpy as np
import random
def run():
env = gym.make("FrozenLake-v1") # setup env
Q = np.zeros((env.observation_space.n, env.action_space.n)) # empty q_table
alpha = 0.7
gamma = 0.95
epsilon = 0.9
epsilon_decay = 0.005
epsilon_min = 0.01
episode = 0
episodes = 10000
state, info = env.reset()
print("Before training")
print(Q)
while episode < episodes:
if epsilon > epsilon_min:
epsilon -= epsilon_decay
if random.random() < epsilon:
action = env.action_space.sample()
else:
action = np.argmax(Q[state])
new_state, reward, terminated, truncated, info = env.step(action)
Q[state, action] = Q[state, action] + alpha * (float(reward) + gamma * np.max(Q[new_state]) - Q[state, action])
state = new_state
if terminated or truncated:
episode += 1
state, info = env.reset() # Reset the environment
print("After training")
print(Q)
env.close()
run()
I suspect the issue might be related to how I'm updating the Q-table or handling the environment states. Any help in identifying and resolving the problem would be greatly appreciated.
I added print statements to display intermediate values, including the selected actions, rewards, and the Q-table itself during training. This was to check if the values are updating as expected. I tried training the agent with a smaller number of episodes to simplify the problem and observe if the Q-table starts updating. However, even with a reduced number of episodes, the Q-table remains filled with zeros. I revisited the Q-table update formula to make sure it aligns with the Q-learning algorithm. The formula seems correct, but the issue persists.
I expected the Q-table to gradually update during training, reflecting the agent's learned values for state-action pairs. However, the Q-table remains unchanged, filled with zeros even after running the training loop for the specified number of episodes.
The issue is due to a combination of two problems:
If there are multiple maximum values in an array,
np.argmax
will return the first index at which the maximum value occurs. Initially, all values in the Q-table are 0, so whenever you are taking an exploitation step, you will take the first action, which in this case is 'move left'.Except for arriving at the goal state, all rewards are zero, so only after you first find the goal state (and get a reward of 1) will the Q-table start to contain non-zero values. It is very unlikely that your agent will find the goal state in the first few hundred episodes, and since epsilon decays to 0.01 quickly, you are taking exploitation steps (i.e. moving left) most of the time, getting rewards of 0, and not making any meaningful updates to the Q-table.
Instead of
np.argmax
, I suggest using the following function, which returns a random index at which the maximum value occurs:Also, these hyperparameters for epsilon are more sensible. Using this, epsilon will reach its minimum value at around half of the training: