Q values overshoot in Double Deep Q Learning

332 Views Asked by At

I am trying to teach the agent to play ATARI Space Invaders video game, but my Q values overshoot. I have clipped positive rewards to 1 (agent also receives -1 for losing a life), so the maximum expected return should be around 36 (maybe I am wrong about this). I have also implemented the Huber loss. I have noticed that when my Q values start overshooting, the agent stops improving (the reward stops increasing).

Code can be found here

Plots can be found here

Note: I have binarized frames, so that I can use bigger replay buffer (my replay buffer size is 300 000 which is 3 times smaller than in original paper)

EDIT: I have binarized frames so I can use 1 bit(instead of 8 bits) to store one pixel of the image in the replay buffer, using numpy.packbits function. In that way I can use 8 times bigger replay buffer. I have checked if image is distorted after packing it with packbits, and it is NOT. So sampling from replay buffer works fine. This is the main loop of the code (maybe the problem is in there):

    frame_count = 0
    LIFE_CHECKPOINT = 3
    for episode in range(EPISODE,EPISODES):
        # reset the environment and init variables
        frames, _, _ = space_invaders.resetEnv(NUM_OF_FRAMES)
        state = stackFrames(frames)
        done = False
        episode_reward = 0
        episode_reward_clipped = 0
        frames_buffer = frames # contains preprocessed frames (not stacked)
        while not done:
            if (episode % REPORT_EPISODE_FREQ == 0):
                space_invaders.render()
            # select an action from behaviour policy
            action, Q_value, is_greedy_action = self.EGreedyPolicy(Q, state, epsilon, len(ACTIONS))
            # perform action in the environment
            observation, reward, done, info = space_invaders.step(action)
            episode_reward += reward # update episode reward
            reward, LIFE_CHECKPOINT = self.getCustomReward(reward, info, LIFE_CHECKPOINT)
            episode_reward_clipped += reward
            frame = preprocessFrame(observation, RESOLUTION)
            # pop first frame from the buffer, and add new at the end (s1=[f1,f2,f3,f4], s2=[f2,f3,f4,f5])
            frames_buffer.append(frame) 
            frames_buffer.pop(0)
            new_state = stackFrames(frames_buffer)
            # add (s,a,r,s') tuple to the replay buffer
            replay_buffer.add(packState(state), action, reward, packState(new_state), done)

            state = new_state # new state becomes current state
            frame_count += 1
            if (replay_buffer.size() > MIN_OBSERVATIONS): # if there is enough data in replay buffer
                Q_values.append(Q_value)
                if (frame_count % TRAINING_FREQUENCY == 0):
                    batch = replay_buffer.sample(BATCH_SIZE)
                    loss = Q.train_network(batch, BATCH_SIZE, GAMMA, len(ACTIONS))
                    losses.append(loss)
                    num_of_weight_updates += 1
                if (epsilon > EPSILON_END):
                    epsilon = self.decayEpsilon(epsilon, EPSILON_START, EPSILON_END, FINAL_EXPLORATION_STATE)
            if (num_of_weight_updates % TARGET_NETWORK_UPDATE_FREQ == 0) and (num_of_weight_updates != 0): # update weights of target network
                Q.update_target_network() 
                print("Target_network is updated!")
        episode_rewards.append(episode_reward)

I have also checked the Q.train_network and Q.update_target_network functions and they work fine.

I was wondering if problem can be in hyper parameters:

ACTIONS = {"NOOP":0,"FIRE":1,"RIGHT":2,"LEFT":3,"RIGHTFIRE":4,"LEFTFIRE":5}
NUM_OF_FRAMES = 4 # number of frames that make 1 state
EPISODES = 10000 # number of episodes
BUFFER_SIZE = 300000 # size of the replay buffer(can not put bigger size, RAM)
MIN_OBSERVATIONS = 30000
RESOLUTION = 84 # resolution of frames
BATCH_SIZE = 32
EPSILON_START = 1 # starting value for the exploration probability
EPSILON_END = 0.1 
FINAL_EXPLORATION_STATE = 300000 # final frame for which epsilon is decayed
GAMMA = 0.99 # discount factor
TARGET_NETWORK_UPDATE_FREQ = 10000 
REPORT_EPISODE_FREQ = 100
TRAINING_FREQUENCY = 4
OPTIMIZER = RMSprop(lr=0.00025,rho=0.95,epsilon=0.01)
0

There are 0 best solutions below