How to choose action in TD(0) learning

Question

How to choose action in TD(0) learning

1.5k Views Asked by zimmerrol At 21 July 2017 at 07:23

I am currently reading Sutton's Reinforcement Learning: An introduction book. After reading chapter 6.1 I wanted to implement a TD(0) RL algorithm for this setting:

To do this, I tried to implement the pseudo-code presented here:

Doing this I wondered how to do this step A <- action given by π for S: I can I choose the optimal action A for my current state S? As the value function V(S) is just depending on the state and not on the action I do not really know, how this can be done.

I found this question (where I got the images from) which deals with the same exercise - but here the action is just picked randomly and not choosen by an action policy π.

Edit: Or this is pseudo-code not complete, so that I have to approximate the action-value function Q(s, a) in another way, too?

Original Q&A

There are 1 best solutions below

**Pablo EM** · Accepted Answer · 2017-07-21T07:48:32.057000

You are right, you cannot choose an action (neither derive a policy π) only from a value function V(s) because, as you notice, it depends only on the state s.

The key concept that you are probably missing here, it's that TD(0) learning is an algorithm to compute the value function of a given policy. Thus, you are assuming that your agent is following a known policy. In the case of the Random Walk problem, the policy consists in choosing actions randomly.

If you want to be able to learn a policy, you need to estimate the action-value function Q(s,a). There exists several methods to learn Q(s,a) based on Temporal-difference learning, such as for example SARSA and Q-learning.

In the Sutton's RL book, the authors distinguish between two kind of problems: prediction problems and control problems. The former refers to the process of estimating the value function of a given policy, and the latter to estimate policies (often by means of action-value functions). You can find a reference to these concepts in the starting part of Chapter 6:

As usual, we start by focusing on the policy evaluation or prediction problem, that of estimating the value function for a given policy . For the control problem (finding an optimal policy), DP, TD, and Monte Carlo methods all use some variation of generalized policy iteration (GPI). The differences in the methods are primarily differences in their approaches to the prediction problem.

How to choose action in TD(0) learning

There are 1 best solutions below

Related Questions in REINFORCEMENT-LEARNING

Related Questions in TEMPORAL-DIFFERENCE

Trending Questions

Popular # Hahtags

Popular Questions