MDP & Reinforcement Learning - Convergence Comparison of VI, PI and QLearning Algorithms

Question

MDP & Reinforcement Learning - Convergence Comparison of VI, PI and QLearning Algorithms

1.2k Views Asked by yoe1323456 At 27 June 2025 at 15:23

I have implemented VI (Value Iteration), PI (Policy Iteration), and QLearning algorithms using python. After comparing results, I noticed something. VI and PI algorithms converge to same utilities and policies. With same parameters, QLearning algorithm converge to different utilities, but same policies with VI and PI algorithms. Is this something normal? I read a lot of papers and books about MDP and RL, but couldn't find anything which tells if utilities of VI-PI algorithms should converge to same utilities with QLearning or not.

Following information is about my grid world and results.

MY GRID WORLD

grid_world.png

States => {s0, s1, ... , s10}
Actions => {a0, a1, a2, a3} where: a0 = Up, a1 = Right, a2 = Down, a3 = Left for all states
There are 4 terminal states, which have +1, +1, -10, +10 rewards.
Initial state is s6
Transition probability of an action is P, and (1 - p) / 2 to go left or right side of that action. (For example: If P = 0.8, when agent tries to go UP, with 80% chance agent will go UP, and with 10% chance agent will go RIGHT, and 10% LEFT.)

RESULTS

VI and PI algorithm results with Reward = -0.02, Discount Factor = 0.8, Probability = 0.8
VI converges after 50 iterations, PI converges after 3 iteration

vi_pi_results.png

QLearning algorithm results with Reward = -0.02, Discount Factor = 0.8, Learning Rate = 0.1, Epsilon (For exploration) = 0.1
Resulting utilities on the image of QLearning results are the maximum Q(s, a) pairs of each state.

qLearning_1million_10million_iterations_results.png

In addition, I also noticed that, when QLearning does 1 million iterations, states which are equally far away from the +10 rewarded terminal have the same utilities. Agent seems it does not care if it is going to the reward from a path which is near to -10 terminal or not, while agent cares about it on VI and PI algorithms. Is this because, in QLearning, we don't know the transition probability of environment?

Original Q&A

There are 1 best solutions below

**Pablo EM** · Accepted Answer

If the state and action spaces are finite, as in your problem, Q-learning algorithm should converge asymptotically to the optimal utility (aka, Q-function), when the number of transitions approaches to infinity and under the following conditions:

,

where n is the number of transitions and a is the learning rate. This conditions requires updating your learning rate as learning progresses. A typical choice could be use a_n = 1/n. However, in practice, the learning rate schedule may require some tuning depending on the problem.

On the other hand, another convergence condition consists in update all state-action pairs infinitely (in a asymtotical sense). This could be achieved simply by maintaining an exploration rate bigger than zero.

So, in your case, you need to decrease the learning rate.

MDP & Reinforcement Learning - Convergence Comparison of VI, PI and QLearning Algorithms

There are 1 best solutions below

Related Questions in PYTHON

Related Questions in MACHINE-LEARNING

Related Questions in REINFORCEMENT-LEARNING

Related Questions in Q-LEARNING

Related Questions in MDP

Trending Questions

Popular # Hahtags

Popular Questions