MDP Policy Iteration example calculations

533 Views Asked by At

I am new to RL and following lectures from UWaterloo. In the lecture 3a on Policy Iteration, professor gave an example of MDP involving a company that needs to make decision between Advertise(A) or Save(S) decisions in states - Poor Unknown(PU), Poor Famous(PF), Rich Famous(RF) and Rich Unknown(RU) as shown in the MDP transition diagram below. enter image description here

For the second iteration, n=1 the state value of "Rich and Famous" is shown as 54.2. I am not able to follow the calculation through Policy Iteration algorithm.

My calculation goes as follows,

V_2(RF) = V_1(RF) + gamma * Sum_s'[ p(s'|s,a)]*V(s')

For Save action,

V_2(RF) = 10 + 0.9 * [0.5*10 + 0.5 * 10] = 19

What am I missing here?

1

There are 1 best solutions below

0
On

I think I found the answer. The V is not a value update for an iteration but value under the policy (different from value iteration). Hence, we need to solve the linear equation as,

V = (I - gama*P)^-1 * R ; matrix inverse method

In octave for second iteration for optimal policy actions as "ASSS", values will be,

octave:32> A=eye(4) - 0.9*[0.5 0.5 0 0; 0.5 0 0.5 0;0 0 0.5 0.5;0.5 0 0 0.5]
A =

   0.5500  -0.4500        0        0
  -0.4500   1.0000  -0.4500        0
        0        0   0.5500  -0.4500
  -0.4500        0        0   0.5500

 octave:35> B=[0;0;10;10]
B =

    0
    0
   10
   10

octave:36> A\B
ans =

   31.585
   38.604
   54.202
   44.024