I am reading Silver et al (2012) "Temporal-Difference Search in Computer Go", and trying to understand the update order for the eligibility trace algorithm.
In the Algorithm 1 and 2 of the paper, weights are updated before updating the eligibility trace. I wonder if this order is correct (Line 11 and 12 in the Algorithm 1, and Line 12 and 13 of the Algorithm 2).
Thinking about an extreme case with lambda=0, the parameter is not updated with the initial state-action pair (since e is still 0). So I doubt the order possibly should be the opposite.
Can someone clarify the point?
I find the paper very instructive for learning the reinforcement learning area, so would like to understand the paper in detail.
If there is a more suitable platform to ask this question, please kindly let me know as well.


It looks to me like you're correct,
eshould be updated beforetheta. That's also what should happen according to the math in the paper. See, for example, Equations (7) and (8), wheree_tis first computed usingphi(s_t), and only THEN isthetaupdated usingdelta V_t(which would bedelta Qin the control case).Note that what you wrote about the extreme case with
lambda=0is not entirely correct. The initial state-action pair will still be involved in an update (not in the first iteration, but they will be incorporated ineduring the second iteration). However, it looks to me like the very first rewardrwill never be used in any updates (because it only appears in the very first iteration, whereeis still0). Since this paper is about Go, I suspect it will not matter though; unless they're doing something unconventional, they probably only use non-zero rewards for the terminal game state.