Deep Reinforcement Learning 1-step TD not converging

109 Views Asked by At

Is there some trick to getting 1-step TD (temporal difference) prediction to converge with a neural net? The network is a simple feed forward network using ReLU. I've got the network working for Q-learning in the following way:

  gamma = 0.9
  q0 = model.predict(X0[times+1])
  q1 = model.predict(X1[times+1])
  q2 = model.predict(X2[times+1])
  q_Opt = np.min(np.concatenate((q0,q1,q2),axis=1),axis=1)
  # Use negative rewards because rewards are negative
  target = -np.array(rewards)[times] + gamma * q_Opt

Where X0, X1, and X2 are MNIST image features with actions 0, 1, and 2 concatenated onto them respectively. This method converges. What I'm trying that doesn't work:

  # What I'm trying that doesn't work
  v_hat_next = model.predict(X[time_steps+1])
  target = -np.array(rewards)[times] + gamma * v_hat_next

  history = model.fit(X[times], target, batch_size=128, epochs=10, verbose=1)

This method doesn't converge at all and in fact gives identical state values for every state. Any idea what I'm doing wrong? Is there some trick to setting up the target? The target is supposed to be +1+̂ (+1,) and I thought that's what I've done here.

0

There are 0 best solutions below