I am new to RL and following lectures from UWaterloo. In the lecture 3a on Policy Iteration, professor gave an example of MDP involving a company that needs to make decision between Advertise(A) or Save(S) decisions in states - Poor Unknown(PU), Poor Famous(PF), Rich Famous(RF) and Rich Unknown(RU) as shown in the MDP transition diagram below.
For the second iteration, n=1 the state value of "Rich and Famous" is shown as 54.2. I am not able to follow the calculation through Policy Iteration algorithm.
My calculation goes as follows,
V_2(RF) = V_1(RF) + gamma * Sum_s'[ p(s'|s,a)]*V(s')
For Save action,
V_2(RF) = 10 + 0.9 * [0.5*10 + 0.5 * 10] = 19
What am I missing here?
I think I found the answer. The V is not a value update for an iteration but value under the policy (different from value iteration). Hence, we need to solve the linear equation as,
In octave for second iteration for optimal policy actions as "ASSS", values will be,