I am trying out the PyBrains maze example
my setup is:
envmatrix = [[...]]
env = Maze(envmatrix, (1, 8))
task = MDPMazeTask(env)
table = ActionValueTable(states_nr, actions_nr)
table.initialize(0.)
learner = Q()
agent = LearningAgent(table, learner)
experiment = Experiment(task, agent)
for i in range(1000):
experiment.doInteractions(N)
agent.learn()
agent.reset()
Now, I am not confident in the results that I am getting
The bottom-right corner (1, 8) is the absorbing state
I have put an additional punishment state (1, 7) in mdp.py:
def getReward(self):
""" compute and return the current reward (i.e. corresponding to the last action performed) """
if self.env.goal == self.env.perseus:
self.env.reset()
reward = 1
elif self.env.perseus == (1,7):
reward = -1000
else:
reward = 0
return reward
Now, I do not understand how, after 1000 runs and 200 interaction during every run, agent thinks that my punishment state is a good state (you can see the square is white)
I would like to see the values for every state and policy after the final run. How do I do that? I have found that this line table.params.reshape(81,4).max(1).reshape(9,9)
returns some values, but I am not sure whether those correspond to values of the value function
Now I added another constraint - made the agent to always start from the same position: (1, 1) by adding
self.initPos = [(1, 1)]
in maze.py and now I get this behaviour after 1000 runs with each run having 200 interactions:Which kind of makes sense now - the robot tries to go around the wall from another side, avoiding the state (1, 7)
So, I was getting weird results because the agent used to start from random positions, which also included the punishing state
EDIT:
Another point is that if it is desirable to spawn the agent randomly, then make sure it is not spawned in the punishable state
Also, seems then that
table.params.reshape(81,4).max(1).reshape(9,9)
returns the value for every state from the value function