How to sample actions for a multi-dimensional continuous action space for REINFORCE algorithm

699 Views Asked by At

So, the problem that I am working on can be summarised like this:

  1. The observation space is an 8x1 vector and all are continuous values. Some of them are in the range [-inf, inf] and some are [-360, 360].

  2. The action space is a 4x1 vector. All the values are in [-1, 1] range.

  3. Currently, I am trying to solve it with policy gradient algorithms, REINFORCE algorithm specifically.

  4. Since the action space is continuous, at the output layer I get the values of 4 mu and 4 sigma which I use as parameters of normal distribution to sample the actions from.

  5. Using neural network as function approximator and the NN architecture is:

    • Input layer: 8 neurons
    • Hidden layer 1: 256 neurons with ReLU activation
    • Hidden layer 2: 256 neurons with ReLU activation
    • Output layer: 8 neurons
    • 4 neurons for mu with no activation function so this can take any value in the range [-inf, inf]. However, later on after sampling the actions I clip their values between [-1, 1]. 4 neurons for sigma with ELU +0.001 activation to keep the standard deviation value in the range [0.001, inf].
  6. My reward function is such that during an episode,

    • at each timestep when the agent is within a certain target zone it receives +6000 reward
    • each time step that it is not in the zone it gets -20
    • at the end of the episode if it is not inside the target zone it gets -20000
    • if it goes to a BAD state during the episode it receives -100000 reward and the episode ends immediately.
  7. The Loss function is:

    • loss = - log_prob(action) * R
  8. The solution doesn't seem to converge, in the sense that mean values keep on increasing and the sigma values are stuck at 0.001(which is the minimum possible value for them). The question that I want to ask are:

    • The way I am sampling the actions, is that correct?
    • Does the loss function look right?
    • Should I use ReLU activation at the input layer as well (although that does not sound right to me, however, in some of the implementations of PPO algorithm that I have seen, people use ReLU at input as well)
  9. I can share the code as well, if that's what's needed to point out a problem.

  10. Any other suggestions are also welcome.

Edit: Btw I wrote my own gym environment for this problem that's why you may notice some inconsistencies between my code and typical RL code for solving OpenAI gym related problems. The main loop which runs for num_episodes is:

def main():
    global num_episodes, normalized_reward_history, reward_history
    env = gym.make('gymXplane-v2')

    pi = Policy()

    for n_epi in range(1, TOTAL_EPISODES+1):
        env.reset()
        sleep(2)
        done = False
        initial_obs = (env.getObservationSpace())
        obs= torch.from_numpy(initial_obs).float().squeeze()
        steps = 0

        while not done:
            steps += 1
            step_arr.append(steps)

            output = pi.forward(obs, softmax_dim=1)

            means = output[:int(ACTION_SPACE_SIZE / 2)]
            sigs = output[int(ACTION_SPACE_SIZE / 2):]

            dists = torch.distributions.Normal(means, sigs)
            action_samples = dists.sample().clamp(-1.0, 1.0)

            obs_prime, reward, done, info = env.step(action_samples)
            pi.put_data((reward), (action_samples, dists))

            obs = torch.t(torch.from_numpy(obs_prime).float())[0]
        pi.train_net()
    env.close()

and py.train_net() is as follow

    def train_net(self):
        global normalized_reward_history, reward_history
        R = 0
        self.optimizer.zero_grad()

        cumulative = 0
        discounted_rewards = np.zeros(len(self.reward_arr))
        for t in reversed(range(len(self.reward_arr))):  # get discounted rewards
            cumulative = cumulative * GAMMA + self.reward_arr[t]
            discounted_rewards[t] = cumulative

        normalized_rewards = discounted_rewards - np.mean(discounted_rewards)
        if np.std(discounted_rewards) != 0:
            normalized_rewards /= np.std(normalized_rewards)

        data = list(zip(normalized_rewards, self.action_arr))

        for R, action in data[::-1]:
            samples = action[0]
            dists = action[1]

            loss = - dists.log_prob(samples) * R

            loss = loss.mean()

            loss.backward()

        self.optimizer.step()
0

There are 0 best solutions below