I am currently learning about markov decision processes and have been using the MDPToolbox package in R.

I am practicing coding a "discrete cake eating problem" where we are analyzing how a consumer should eat a cake made up of 3 slices. This is a simple problem because there is no stochastic elements in this problem.

The R code I have is the following:

#Cake Eating with the MDPTools Package
require(MDPtoolbox)

#Transition Matrices from eating 0,1,2 and 3 pieces of cake

C_0<-matrix(c(1,0,0,0,
              1,0,0,0,
              1,0,0,0,
              1,0,0,0),
            ncol=4, nrow=4,
            byrow=TRUE)

C_1<-matrix(c(0,1,0,0,
              0,1,0,0,
              0,1,0,0,
              0,1,0,0),
            ncol=4, nrow=4,
            byrow=TRUE)


C_2<-matrix(c(0,0,1,0,
              0,0,1,0,
              0,0,1,0,
              0,0,1,0),
            ncol=4, nrow=4,
            byrow=TRUE)


C_3<-matrix(c(0,0,0,1,
              0,0,0,1,
              0,0,0,1,
              0,0,0,1),
            ncol=4, nrow=4,
            byrow=TRUE)

T<-list(C_0=C_0,C_1=C_1,C_2=C_2,C_3=C_3)

#Utilities from eating 0, 1,2 and 3 pieces of cake.

U_func<-function(x){x^0.5}

U_C0<-U_func(0)
U_C1<-U_func(1)
U_C2<-U_func(2)
U_C3<-U_func(3)

#Reward Matrix
W<-matrix(c(U_C0,U_C0,U_C0,U_C0,
            U_C1,U_C1,U_C1,U_C0,
            U_C2,U_C2,U_C0,U_C0,
            U_C3,U_C0,U_C0,U_C0),
          ncol=4, nrow=4,byrow=TRUE)

#MDP Check
mdp_check(T,W)

#MDP policy iteration
m1<-mdp_policy_iteration(T,W,0.8)

m1
names(T)[m1$policy]

The output from this code is:

> m1
$V
[1] 4.920475 5.920475 6.150593 5.668430

$policy
[1] 3 3 2 1

$iter
[1] 4

$time
Time difference of 0.03619385 secs

> names(T)[m1$policy]
[1] "C_2" "C_2" "C_1" "C_0"

This is peculiar because The optimal policy suggests eating two pieces of cake in the first two periods and one piece of cake in the third period and zero in the fourth period.

The problem is that the cake size in this problem is intended to consist of only 3 pieces.

My Question: Why does the optimal policy suggest eating 5 pieces of cake when there are only 3 pieces in this problem? (where is my code broken?)

0

There are 0 best solutions below