To my knowledge, Deep Q learning with gradient descent follows a process of:
Initialise random weights and biases
Take an action from starting state
Determine reward
Make changes to weights and biases according the current time step through gradient descent
Make changes to weights and biases according to previous time steps through gradient descent, but use the reward * discount factor ^ steps away instead of just the reward.
Repeat from step 2
Across an infinite time period, this should lead to each step leading to a change in weights and biases with the goal as current reward + expected future return * discount factor for gradient descent, matching the Bellman equation. However, according to this approach, in each step we would need to make an amount of changes equal to every step including and before it. In a non-terminal case of Deep Q learning, this should(to my knowledge) lead to an infinite amount of required processing time.
In my current case, I am trying to run deep Q learning on the dinosaur game, and it is hypothetically possible for the dinosaur to never die, and thus it could result in the problem mentioned above.
Potential solutions could be simply rounding down to 0 when the discount factor ^ steps is below a certain threshold, or arbitrarily terminating the episode at a certain point and starting again, but neither of these solutions don’t seem to be completely correct.
In the Atari paper it would appear they used a finite memory size to hold all the states required, and then only chose one at random to perform gradient descent on. Is this a correct interpretation, and is this potentially a solution to the problem I’m facing? Are there any other potential solutions?
Edit:
It seems that our label for gradient descent is immediate reward + future return, but instead of finding future return through continuing to play through the episode, we use the current Q function estimate to find the future return. This still seems a bit counterintuitive, as we are partially using our own function as the goal of our gradient descent, but the knowledge of the immediate reward seemingly makes the function converge on a solution.