How does MCTS work with 'precise lines'

420 Views Asked by At

So I am familiar with more basic tree search algorithms like game search w/ minimax, but I've been trying to learn more about the Monte-Carlo Tree Search algorithm, and was wondering how it deals with 'precise lines.'

In the context of chess, where you might be in a position where you have 30 losing moves but 1 winning line, how would the MTCS Algorithm, more specifically the UCB1 function deal with this? The way I understand UCB1 is that it essentially does a sort of average over its child nodes so the UCB1 value of a line of chess where you have 30 losing moves and one winning one should be deceptively low?

I'm still learning about MCTS but I've always had this question and was hoping someone could explain how MCTS still converges to minimax even if a UCB1 value might be very low.

Any knowledge would be appreciated! Thanks

2

There are 2 best solutions below

0
On

The way I understand UCB1 is that it essentially does a sort of average over its child nodes so the UCB1 value of a line of chess where you have 30 losing moves and one winning one should be deceptively low?

From the UCT formula w_i/n_i + c*sqrt(ln(N)/n_i) we can see that the exploration term is proportional to the inverse of the square root of child visits, n_i. What this means in that the child node with the best win rate will be heavily favored and therefore will have far more visits. So the UCT score of the parent will be an average heavily weighted toward the win rate of best child node.

This effect will propagate back up the tree, leading to the best line having the most visits and an accurate win rate at each node. In this way MCTS converges to the minimax result as the number of simulations increases.

For a more theoretical discussion, see the main result of Bandit based Monte-Carlo Planning:

Theorem 6 Consider a finite-horizon MDP with rewards scaled to lie in the [0, 1] interval. Let the horizon of the MDP be D, and the number of actions per state be K. Consider algorithm UCT such that the bias terms of UCB1 are multiplied by D. Then the bias of the estimated expected payoff, Xn, is O(log(n)/n). Further, the failure probability at the root converges to zero at a polynomial rate as the number of episodes grows to infinity.

0
On

Imran's answer is correct in that, from a theoretical point of view, the UCB1 strategy typically used in the Selection phase of MCTS should eventually be able to handle with the kinds of situations you describe, and that MCTS (assuming we use something like UCB1 for the Selection phase) will eventually converge to minimax evaluations.

However, "eventually" here means "after an infinite number of MCTS iterations". We need an infinite amount of processing time because only the Selection phase of MCTS can adequately handle the types of situations you describe (the Playout phase can't), and the Selection phase is only actually used in a slowly-growing part of the tree around the root node. So, if the situations you describe are "located" relatively close to the root node, then we can expect that strategies like UCB1 can adequately handle them. If they are very deep / far away from the root, so deep that we don't manage to grow the search tree that far in the processing time we have... then MCTS indeed does not tend to handle these situations well.

Note that a similar thing can be said for minimax-based approaches; if they don't manage to search deep enough, they can also result in poor evaluations. The story tends to be much more binary in the case of minimax-like algorithms though; either they manage to search sufficiently deep for good evaluations, or they don't. In the case of MCTS, it will always evaluate these types of situations poorly initially, and might gradually improve as the search tree gradually grows.

In practice, minimax/alpha-beta/related algorithms were believed to outperform MCTS-based methods for about a full decade in games with many "trap" situations, like the situations you describe. This includes chess-like games. During the same period of time, MCTS was much more promising already in games like Go. Only in a recent paper did a combination of MCTS + Deep Reinforcement Learning + ridiculous amounts of hardware beat minimax-based approaches in chess-like games.