RAM Usage keeps growing while training reinforcement learning agent

332 Views Asked by At

The other day I started training my Atari Breakout reinforcement learning agent. But after an hour and a half or so I noticed my screen started freezing and it became very difficult to interact with the computer via mouse.

So, I decided I'd rerun the program, but would monitor the system components. One thing I noticed was the RAM would continue to grow the longer the program ran. My first suspect was the replay buffer so I dedicated considerable amount of time to that to reduce it's memory requirements. But I got the same thing. To investigate further, I cut off any additions to the repay buffer after 50,000 to see if RAM usage continued to grow, it did. I eventually narrowed it down to this section of code:

def get_gradients(self, target_q_values, importance, states, actions):
        with tf.GradientTape() as tape:
            q_values_current_state_dqn = self.dqn_architecture(states)
            one_hot_actions = tf.keras.utils.to_categorical(actions, self.num_legal_actions, dtype=np.float32) # e.g. [[0,0,1,0],[1,0,0,0],...]
            Q = tf.reduce_sum(tf.multiply(q_values_current_state_dqn, one_hot_actions), axis=1)
            error = Q - tf.cast(target_q_values, tf.float32)
            loss = tf.keras.losses.Huber()(target_q_values, Q)
            
            if self.use_prioritized_experience_replay:
                loss = tf.reduce_mean(loss * importance) # Gradient is scaled -> loss = lower at begining -> reduces bias against situataions that are sampled more frequently
            
        dqn_architecture_gradients = tape.gradient(loss, self.dqn_architecture.trainable_variables) # Computes the gradient using operations recorded in context of this tape.
        self.dqn_architecture.optimizer.apply_gradients(zip(dqn_architecture_gradients, self.dqn_architecture.trainable_variables))  
        return loss, error

It should be noted that I also seen the following show up in the logs.

2023-02-16 22:48:32,045 5 out of the last 5 calls to <function Agent.get_gradients at 0x7fb3ec66e830> triggered tf.function retracing. Tracing is expensive and the excessive number of tracings could be due to (1) creating @tf.function repeatedly in a loop, (2) passing tensors with different shapes, (3) passing Python objects instead of tensors. For (1), please define your @tf.function outside of the loop. For (2), @tf.function has reduce_retracing=True option that can avoid unnecessary retracing. For (3), please refer to https://www.tensorflow.org/guide/function#controlling_retracing and https://www.tensorflow.org/api_docs/python/tf/function for  more details.
2023-02-16 22:48:32,217 6 out of the last 6 calls to <function Agent.get_gradients at 0x7fb3ec66e830> triggered tf.function retracing. Tracing is expensive and the excessive number of tracings could be due to (1) creating @tf.function repeatedly in a loop, (2) passing tensors with different shapes, (3) passing Python objects instead of tensors. For (1), please define your @tf.function outside of the loop. For (2), @tf.function has reduce_retracing=True option that can avoid unnecessary retracing. For (3), please refer to https://www.tensorflow.org/guide/function#controlling_retracing and https://www.tensorflow.org/api_docs/python/tf/function for  more details.

The get_gradient function outlined above is called from this function:

def train_network(self, batch_size, gamma, frame_number, priority_scale):
    importance = 0
    if self.use_prioritized_experience_replay:
        (states, actions, rewards, new_states, terminal_flags), importance, indices = self.replay_buffer.sample_buffer(self.batch_size, priority_scale)
        importance = importance ** (1-self.calculate_epsilon(frame_number)) # recently started training = low frame number = high epsilon = low power = largely decreased importance = lower importance, later in training is slightly decreased importance. Increases importance of newer frames
    else:
        states, actions, rewards, new_states, terminal_flags = self.replay_buffer.sample_buffer(self.batch_size, priority_scale)
    
    best_action_in_next_state_dqn = self.dqn_architecture.predict(new_states, verbose=0).argmax(axis=1)
    target_q_network_q_values = self.target_dqn_architecture.predict(new_states, verbose=0)
    optimal_q_value_in_next_state_target_dqn = target_q_network_q_values[range(batch_size), best_action_in_next_state_dqn]
    target_q_values = rewards + (gamma*optimal_q_value_in_next_state_target_dqn * (1-terminal_flags)) # makes 0 if terminal flag set
    # Calculate loss and perform gradfient descent
    # TensorFlow "records" relevant operations executed inside the context of a tf. GradientTape onto a "tape". TensorFlow then uses that tape to compute the gradients of a "recorded" computation using reverse mode differentiation.
    loss, error = self.get_gradients(target_q_values, importance, states, actions)
    
    if self.use_prioritized_experience_replay:
        self.replay_buffer.set_priorities(indices, error)
        
    return float(loss.numpy()), error

And that train_network function is called from the following loop:

while frame_number < NUM_FRAMES_AGENT_TRAINED_OVER:
        breakout_environment.reset_env()
        episode_reward_sum = 0
        for _ in range(MAX_EPISODE_LENGTH):
            # Get action
            action = breakout_agent.take_action(frame_number, breakout_environment.state)
            
            # Take step
            frame, reward, terminal, life_lost = breakout_environment.step(action)
            frame_number += 1
            episode_reward_sum += reward

            # Add experience to replay memory  action, frame, reward, terminal, clip_reward
            breakout_agent.add_experience_to_replay_buffer(action, frame[:, :, 0], reward, life_lost, CLIP_REWARD)
            

            # Train the network every 4 additions to the replay buffer 
            if frame_number % UPDATE_FREQUENCY == 0 and breakout_agent.replay_buffer.total_indexes_written_to > REPLAY_BUFFER_START_SIZE:
                loss, _ = breakout_agent.train_network(BATCH_SIZE, DISCOUNT_FACTOR, frame_number, PRIORITY_SCALE) # batch_size, gamma, frame_number, priority_scale
                loss_list.append(loss)

            # Update target network
            if frame_number % TARGET_UPDATE_FREQ == 0 and frame_number > REPLAY_BUFFER_START_SIZE:
                breakout_agent.update_target_network()

            # Break the loop when the game is over
            if terminal:
                break
        rewards_list.append(episode_reward_sum)

Any help would be greatly appreciated

EDIT: On further investigation, I found a question on stackoverflow that stated 'Passing python scalars or lists as arguments to tf.function will always build a new graph. To avoid this, pass numeric arguments as Tensors whenever possible' So I need to convert the optimizer.apply_gradients arguments from python lists to tensorflow tensors or another tensorflow data type. As the lists are of varying dimensionality with varying nesting depths I can't use tf.convert_to_tensor or tf.ragged.constant

0

There are 0 best solutions below