MDP & Reinforcement Learning - Convergence Comparison of VI, PI and QLearning Algorithms

1.2k Views Asked by At

I have implemented VI (Value Iteration), PI (Policy Iteration), and QLearning algorithms using python. After comparing results, I noticed something. VI and PI algorithms converge to same utilities and policies. With same parameters, QLearning algorithm converge to different utilities, but same policies with VI and PI algorithms. Is this something normal? I read a lot of papers and books about MDP and RL, but couldn't find anything which tells if utilities of VI-PI algorithms should converge to same utilities with QLearning or not.

Following information is about my grid world and results.

MY GRID WORLD

grid_world.png

  • States => {s0, s1, ... , s10}
  • Actions => {a0, a1, a2, a3} where: a0 = Up, a1 = Right, a2 = Down, a3 = Left for all states
  • There are 4 terminal states, which have +1, +1, -10, +10 rewards.
  • Initial state is s6
  • Transition probability of an action is P, and (1 - p) / 2 to go left or right side of that action. (For example: If P = 0.8, when agent tries to go UP, with 80% chance agent will go UP, and with 10% chance agent will go RIGHT, and 10% LEFT.)

RESULTS

  • VI and PI algorithm results with Reward = -0.02, Discount Factor = 0.8, Probability = 0.8
  • VI converges after 50 iterations, PI converges after 3 iteration

vi_pi_results.png

  • QLearning algorithm results with Reward = -0.02, Discount Factor = 0.8, Learning Rate = 0.1, Epsilon (For exploration) = 0.1
  • Resulting utilities on the image of QLearning results are the maximum Q(s, a) pairs of each state.

qLearning_1million_10million_iterations_results.png

In addition, I also noticed that, when QLearning does 1 million iterations, states which are equally far away from the +10 rewarded terminal have the same utilities. Agent seems it does not care if it is going to the reward from a path which is near to -10 terminal or not, while agent cares about it on VI and PI algorithms. Is this because, in QLearning, we don't know the transition probability of environment?

1

There are 1 best solutions below

6
On BEST ANSWER

If the state and action spaces are finite, as in your problem, Q-learning algorithm should converge asymptotically to the optimal utility (aka, Q-function), when the number of transitions approaches to infinity and under the following conditions:

enter image description here,

where n is the number of transitions and a is the learning rate. This conditions requires updating your learning rate as learning progresses. A typical choice could be use a_n = 1/n. However, in practice, the learning rate schedule may require some tuning depending on the problem.

On the other hand, another convergence condition consists in update all state-action pairs infinitely (in a asymtotical sense). This could be achieved simply by maintaining an exploration rate bigger than zero.

So, in your case, you need to decrease the learning rate.