I'm trying to find the optimal policy for a Markov Decision Process problem specified in this diagram, using Value Iteration (via pymdptoolbox) and NumPy. But pymdptoolbox says my transition matrix "is not stochastic".
Is it because of the arrays with [0, 0, 0, 0]? Some transitions are impossible, like from state 1 to state 3. How do I represent these impossible transitions, if not with zeroes?
My code:
import mdptoolbox
import numpy as np
transitions = np.array([
#action1
[
[0.2, 0.8, 0, 0], #s1
[0, 0, 0, 0], #s2
[0, 0, 0, 0], #s3
[0, 0, 0.9, 0.1] #s4
],
#action2
[
[0.2, 0, 0, 0.8], #s1
[0, 0.2, 0.8, 0], #s2
[0, 0, 0, 0], #s3
[0, 0, 0, 0] #s4
],
#action3
[
[0, 0, 0, 0], #s1
[0.8, 0.2, 0, 0], #s2
[0, 0, 0, 1], #s3
[0, 0, 0, 0] #s4
],
#action4
[
[0.8, 0, 0, 0.2], #s1
[0, 0, 0, 0], #s2
[0, 1, 0, 0], #s3
[0, 0, 0, 0] #s4
]
])
rewards = np.array([
[0, 0, 0, 0],
[0, 0, 0, 0],
[1, 1, 1, 1],
[0, 0, 0, 0]
])
vi = mdptoolbox.mdp.ValueIteration(transitions, rewards, 0.4)

The problem is that you've used all 0 values to represent an unreachable state. You must have a total probability of 1 in each row. Since the state is unreachable, it doesn't matter how you do that -- drop a "1" into the first column, distribute the values evenly, whatever suits your fancy. When I've hit this problem, I simply use a "1" on the main diagonal: let the impossible state be a sink.