DQN understanding input and output (layer)

1.3k Views Asked by At

I have a question about the input and output (layer) of a DQN.

e.g

Two points: P1(x1, y1) and P2(x2, y2)

P1 has to walk towards P2

I have the following information:

  • Current position P1 (x/y)
  • Current position P2 (x/y)
  • Distance to P1-P2 (x/y)
  • Direction to P1-P2 (x/y)

P1 has 4 possible actions:

  • Up
  • Down
  • Left
  • Right

How do I have to setup the input and output layer?

  • 4 input nodes
  • 4 output nodes

Is that correct? What do I have to do with the output? I got 4 arrays with 4 values each as output. Is doing argmax on the output correct?

Edit:

Input / State:

# Current position P1
state_pos = [x_POS, y_POS]
state_pos = np.asarray(state_pos, dtype=np.float32)
# Current position P2
state_wp = [wp_x, wp_y]
state_wp = np.asarray(state_wp, dtype=np.float32)
# Distance P1 - P2 
state_dist_wp = [wp_x - x_POS, wp_y - y_POS]
state_dist_wp = np.asarray(state_dist_wp, dtype=np.float32)
# Direction P1 - P2
distance = [wp_x - x_POS, wp_y - y_POS]
norm = math.sqrt(distance[0] ** 2 + distance[1] ** 2)
state_direction_wp = [distance[0] / norm, distance[1] / norm]
state_direction_wp = np.asarray(state_direction_wp, dtype=np.float32)
state = [state_pos, state_wp, state_dist_wp, state_direction_wp]
state = np.array(state)

Network:

def __init__(self):
    self.q_net = self._build_dqn_model()
    self.epsilon = 1 

def _build_dqn_model(self):
    q_net = Sequential()
    q_net.add(Dense(4, input_shape=(4,2), activation='relu', kernel_initializer='he_uniform'))
    q_net.add(Dense(128, activation='relu', kernel_initializer='he_uniform'))
    q_net.add(Dense(128, activation='relu', kernel_initializer='he_uniform'))
    q_net.add(Dense(4, activation='linear', kernel_initializer='he_uniform'))
    rms = tf.optimizers.RMSprop(lr = 1e-4)
    q_net.compile(optimizer=rms, loss='mse')
    return q_net

def random_policy(self, state):
    return np.random.randint(0, 4)

def collect_policy(self, state):
    if np.random.random() < self.epsilon:
        return self.random_policy(state)
    return self.policy(state)

def policy(self, state):
    # Here I get 4 arrays with 4 values each as output
    action_q = self.q_net(state)
2

There are 2 best solutions below

0
On BEST ANSWER

Adding input_shape=(4,2) in the first Dense layer is causing the output shape to be (None, 4, 4). Defining q_net the following way solves it:

q_net = Sequential()
q_net.add(Reshape(target_shape=(8,), input_shape=(4,2)))
q_net.add(Dense(128,  activation='relu', kernel_initializer='he_uniform'))
q_net.add(Dense(128, activation='relu', kernel_initializer='he_uniform'))
q_net.add(Dense(128, activation='relu', kernel_initializer='he_uniform'))
q_net.add(Dense(4, activation='linear', kernel_initializer='he_uniform'))
rms = tf.optimizers.RMSprop(lr = 1e-4)
q_net.compile(optimizer=rms, loss='mse')
return q_net

Here, q_net.add(Reshape(target_shape=(8,), input_shape=(4,2))) reshapes the (None, 4, 2) input to (None, 8) [Here, None represents the batch shape].

To verify, print q_net.output_shape and it should be (None, 4) [Whereas in the previous case it was (None, 4, 4)].

You also need to do one more thing. Recall that input_shape does not take batch shape into account. What I mean is, input_shape=(4,2) expects inputs of shape (batch_shape, 4, 2). Verify it by printing q_net.input_shape and it should output (None, 4, 2). Now, what you have to do is - add a batch dimension to your input. Simply you can do the following:

state_with_batch_dim = np.expand_dims(state,0)

And pass state_with_batch_dim to q_net as input. For example, you can call the policy method you wrote like policy(np.expand_dims(state,0)) and get an output of dimension (batch_shape, 4) [in this case (1,4)].

And here are the answers to your initial questions:

  1. Your output layer should have 4 nodes (units).
  2. Your first dense layer does not necessarily have to have 4 nodes (units). If you consider the Reshape layer, the notion of nodes or units does not fit there. You can think of the Reshape layer as a placeholder that takes a tensor of shape (None, 4, 2) and outputs a reshaped tensor of shape (None, 8).
  3. Now, you should get outputs of shape (None, 4) - there, the 4 values represent the q-values of 4 corresponding actions. No need to do argmax here to find the q-values.
11
On

It could make sense to feed the DQN some information on the direction it's currently facing too. You could set it up as (Current Pos X, Current Pos Y, X From Goal, Y From Goal, Direction).

The output layer should just be (Up, Left, Down, Right) in an order you determine. An Argmax layer is suitable for the problem. Exact code depends on if you using TF / Pytorch.