Sarsa and Q Learning (reinforcement learning) don't converge optimal policy

2k Views Asked by At

I have a question about my own project for testing reinforcement learning technique. First let me explain you the purpose. I have an agent which can take 4 actions during 8 steps. At the end of this eight steps, the agent can be in 5 possible victory states. The goal is to find the minimum cost. To access of this 5 victories (with different cost value: 50, 50, 0, 40, 60), the agent don't take the same path (like a graph). The blue states are the fail states (sorry for quality) and the episode is stopped.

enter image description here

The real good path is: DCCBBAD

Now my question, I don't understand why in SARSA & Q-Learning (mainly in Q learning), the agent find a path but not the optimal one after 100 000 iterations (always: DACBBAD/DACBBCD). Sometime when I compute again, the agent falls in the good path (DCCBBAD). So I would like to understand why sometime the agent find it and why sometime not. And there is a way to look at in order to stabilize my agent?

Thank you a lot,

Tanguy

1

There are 1 best solutions below

0
mimoralea On

TD;DR;

  1. Set your epsilon so that you explore a bunch for a large number of episodes. E.g. Linearly decaying from 1.0 to 0.1.
  2. Set your learning rate to a small constant value, such as 0.1.
  3. Don't stop your algorithm based on number of episodes but on changes to the action-value function.

More detailed version:

Q-learning is only garranteed to converge under the following conditions:

  1. You must visit all state and action pairs infinitely ofter.
  2. The sum of all the learning rates for all timesteps must be infinite, so sum
  3. The sum of the square of all the learning rates for all timesteps must be finite, that is squares

To hit 1, just make sure your epsilon is not decaying to a low value too early. Make it decay very very slowly and perhaps never all the way to 0. You can try epsilon, too. To hit 2 and 3, you must ensure you take care of 1, so that you collect infinite learning rates, but also pick your learning rate so that its square is finite. That basically means =< 1. If your environment is deterministic you should try 1. Deterministic environment here that means when taking an action a in a state s you transition to state s' for all states and actions in your environment. If your environment is stochastic, you can try a low number, such as 0.05-0.3.

Maybe checkout https://youtu.be/wZyJ66_u4TI?t=2790 for more info.