MDP Policy Iteration example calculations

535 Views Asked by Amsci Fi At 23 October 2025 at 12:23

I am new to RL and following lectures from UWaterloo. In the lecture 3a on Policy Iteration, professor gave an example of MDP involving a company that needs to make decision between Advertise(A) or Save(S) decisions in states - Poor Unknown(PU), Poor Famous(PF), Rich Famous(RF) and Rich Unknown(RU) as shown in the MDP transition diagram below.

For the second iteration, n=1 the state value of "Rich and Famous" is shown as 54.2. I am not able to follow the calculation through Policy Iteration algorithm.

My calculation goes as follows,

V_2(RF) = V_1(RF) + gamma * Sum_s'[ p(s'|s,a)]*V(s')

For Save action,

V_2(RF) = 10 + 0.9 * [0.5*10 + 0.5 * 10] = 19

What am I missing here?

Original Q&A

There are 1 best solutions below

Amsci Fi On 23 September 2021 at 10:07

I think I found the answer. The V is not a value update for an iteration but value under the policy (different from value iteration). Hence, we need to solve the linear equation as,

V = (I - gama*P)^-1 * R ; matrix inverse method

In octave for second iteration for optimal policy actions as "ASSS", values will be,

octave:32> A=eye(4) - 0.9*[0.5 0.5 0 0; 0.5 0 0.5 0;0 0 0.5 0.5;0.5 0 0 0.5]
A =

   0.5500  -0.4500        0        0
  -0.4500   1.0000  -0.4500        0
        0        0   0.5500  -0.4500
  -0.4500        0        0   0.5500

 octave:35> B=[0;0;10;10]
B =

    0
    0
   10
   10

octave:36> A\B
ans =

   31.585
   38.604
   54.202
   44.024

MDP Policy Iteration example calculations

There are 1 best solutions below

Related Questions in DYNAMIC-PROGRAMMING

Related Questions in REINFORCEMENT-LEARNING

Related Questions in POLICY

Related Questions in MARKOV-DECISION-PROCESS

Trending Questions

Popular # Hahtags

Popular Questions