I have problem in implementing mdp
(markov decision process) by python.
I have these matrices: states: (1 x n)
and actions: (1 x m)
.Transition matrix is calculated by this code:
p = np.zeros((n,n))
for t in range(l): # my data is a 1x100 matrix
p[states[t]-1, states[t+1]-1] = p[states[t]-1, states[t+1]-1] + 1
for i in range(n):
p[i,:] = p[i, :] / np.sum(p[i, :])
and Reward matrix by this code:
for i in range(l):
Reward = (states[i+1]-states[i])/(states[i])*100
To have the optimal value, "quantecon package" in python is defined by:
ddp = quantecon.markov.DiscreteDP(R, Q, beta)
where Q : transition matrix should be m x n x m
.
Can anyone help me understand how Q can be a (m,n,m) matirx?! Thank you in advance.
If you have
n
states andm
actions,Q
will be an array of shape(n, m, n)
(not(m, n, m)
), where you letQ[s, a, t]
store the probability that the state in the next period becomes thet
-th state when the current state is thes
-th state and the action taken is thea
-th action.