Must a transition matrix from a Markov Decision Process be stochastic?

3.5k Views Asked by At

I'm trying to find the optimal policy for a Markov Decision Process problem specified in this diagram, using Value Iteration (via pymdptoolbox) and NumPy. But pymdptoolbox says my transition matrix "is not stochastic".

Is it because of the arrays with [0, 0, 0, 0]? Some transitions are impossible, like from state 1 to state 3. How do I represent these impossible transitions, if not with zeroes?

My code:

import mdptoolbox 
import numpy as np

transitions = np.array([
#action1
    [
            [0.2, 0.8, 0, 0], #s1
            [0, 0, 0, 0], #s2
            [0, 0, 0, 0], #s3
            [0, 0, 0.9, 0.1] #s4
    ],

#action2
    [
            [0.2, 0, 0, 0.8], #s1
            [0, 0.2, 0.8, 0], #s2
            [0, 0, 0, 0], #s3
            [0, 0, 0, 0] #s4
    ],

#action3
    [
            [0, 0, 0, 0], #s1
            [0.8, 0.2, 0, 0], #s2
            [0, 0, 0, 1], #s3
            [0, 0, 0, 0] #s4
    ],

#action4
        [
                [0.8, 0, 0, 0.2], #s1
                [0, 0, 0, 0], #s2
                [0, 1, 0, 0], #s3
                [0, 0, 0, 0] #s4
        ]
])

rewards = np.array([
        [0, 0, 0, 0],
        [0, 0, 0, 0],
        [1, 1, 1, 1],
        [0, 0, 0, 0]
        ])

vi = mdptoolbox.mdp.ValueIteration(transitions, rewards, 0.4)
2

There are 2 best solutions below

0
On

I don't have enough rep to comment a reply but I wanted to expand on Prune's answer. Currently doing an exercise comparing the mdp toolbox value iteration results with those of our own implementation of the algorithm in python. I will say that I'm not exactly sure what you mean by sink Prune, so I may be repeating your answer in a way and if so edit/flag my comment for deletion all good.

I basically end up following your advice. However, my classmate had a good contribution that I think really made it work. Basically, say you have three states, state1, state2, and state 3. Also, you have a transition matrix for a given action, a 3x3 with the states 1,2,3 as rows and the probability of transitioning to states 1,2,3 in the columns (so the cell [1,2] would be the probability of transitioning to state 2 given taking the action from state 1. If you had all 1's in the diagonal as Prune suggested, you would have a 100% chance of remaining in the state if you took the action, no matter what state you're in.

To get this to work with mdptoolbox and be stochastic you do want to force all rows to sum to 1 like Prune said. However, I don't think you can arbitrarily choose which column you can drop a "1" into a row of all zeros. I think to get the program to run consistently and accurately you need to ensure that in that row of all zeros (i.e. for state s) you are putting a 1 in the cell corresponding with a transition back to the same state (i.e. [s,s] =1). This would be essentially the same as putting it on the correct "diagonal" position for the given row. Also, make sure that the reward for this action (reward[s,s]) is 0. This essentially is saying that if you take the impossible action of interest at state s, you have a 100% chance of staying in the same state with no reward.

This answer is pretty naive on my part, though I will say I scoured the manual and source code trying to find a solution (here's a link to the manual). I feel confident enough in this answer to post because I coded an implementation of value iteration that doesn't depend on a perfectly stochastic matrix and got the same optimal policies and values I did when I followed the method described above for the mdptoolbox value iteration. Moreover, when I arbitrarily chose columns to force a "1" into and make the matrix stochastic, I did not get consistent results, nor did any of them line up with the manual implementation of the algorithm. For reference here's the psuedo code I referenced for value iteration. If I'm doing something wrong someone call me out!

psuedocode for value iteration

0
On

The problem is that you've used all 0 values to represent an unreachable state. You must have a total probability of 1 in each row. Since the state is unreachable, it doesn't matter how you do that -- drop a "1" into the first column, distribute the values evenly, whatever suits your fancy. When I've hit this problem, I simply use a "1" on the main diagonal: let the impossible state be a sink.