MDP implementation using python - dimensions

465 Views Asked by At

I have problem in implementing mdp (markov decision process) by python.

I have these matrices: states: (1 x n) and actions: (1 x m) .Transition matrix is calculated by this code:

p = np.zeros((n,n))
for t in range(l): # my data is a 1x100 matrix
p[states[t]-1, states[t+1]-1] = p[states[t]-1, states[t+1]-1] + 1
for i in range(n):
p[i,:] = p[i, :] / np.sum(p[i, :])    

and Reward matrix by this code:

for i in range(l): 
Reward = (states[i+1]-states[i])/(states[i])*100

To have the optimal value, "quantecon package" in python is defined by:

ddp = quantecon.markov.DiscreteDP(R, Q, beta)

where Q : transition matrix should be m x n x m.

Can anyone help me understand how Q can be a (m,n,m) matirx?! Thank you in advance.

1

There are 1 best solutions below

0
On

If you have n states and m actions, Q will be an array of shape (n, m, n) (not (m, n, m)), where you let Q[s, a, t] store the probability that the state in the next period becomes the t-th state when the current state is the s-th state and the action taken is the a-th action.