Updates in Temporal Difference Learning

1.3k Views Asked by At

I read about Tesauro's TD-Gammon program and would love to implement it for tic tac toe, but almost all of the information is inaccessible to me as a high school student because I don't know the terminology.

The first equation here, http://www.stanford.edu/group/pdplab/pdphandbook/handbookch10.html#x26-1310009.2

gives the "general supervised learning paradigm." It says that the w sub t on the left side of the equation is the parameter vector at time step t. What exactly does "time step" mean? Within the framework of a tic tac toe neural network designed to output the value of a board state, would the time step refer to the number of played pieces in a given game? For example the board represented by the string "xoxoxoxox" would be at time step 9 and the board "xoxoxoxo " would be at time step 8? Or would the time step refer to the amount of time elapsed since training has begun?

Since w sub t is the weight vector for a given time step, does this mean that every time step has its own evaluation function (neural network)? So to evaluate a board state with only one move, you would have to feed into a different NN than you would feed a board state with two moves? I think I am misinterpreting something here because as far as I know Tesauro used only one NN for evaluating all board states (though it is hard to find reliable information about TD-Gammon).

How come the gradient of the output is taken with respect to w and not w sub t?

Thanks in advance for clarifying these ideas. I would appreciate any advice regarding my project or suggestions for accessible reading material.

1

There are 1 best solutions below

2
On

TD deals with learning within the framework of a Markov Decision Process. That is, you begin in some state st, perform an action at, receive reward rt, and end up in another state — st+1. The initial state is called s0. The subscript is called time.

A TD agent begins not knowing what rewards it will receive for what actions or what other states those actions take it to. It's goal is to learn which actions maximize long-term total reward.

The state could represent the state of the tic-tac-toe board. So, s0 in your case would be a clear board: "---------", s1 might be "-o-------", s2 might be "-o--x----", etc. The action might be the cell index to mark. With this representation, you would have 39=19 683 possible states and 9 possible actions. After learning the game, you would have a table telling you which cell to mark on each possible board.

The same kind of direct representation would not work well for Backgammon, because there are far too many possible states. What TD-Gammon does is approximate states using a neural network. The weights of the network are treated as a state vector, and the reward is always 0, except upon winning.

The tricky part here is that the state the TD-Gammon algorithm learns is the state of the neural network used to evaluate board positions, not the state of the board. So, at t=0 you have not played a single game and the network is in its initial state. Every time you have to make a move, you use the network to choose the best possible one, and then update the network's weights based on whether the move led to victory or not. Before making the move, the network has weights wt, afterwards it has weights wt+1. After playing hundreds of thousands of games, you learn weights that allow the neural network to evaluate board positions quite accurately.

I hope this clears things up.