Minimizing Simple and Cumulative Regret in Monte-Carlo Tree Search [chapter]

Tom Pepels, Tristan Cazenave, Mark H. M. Winands, Marc Lanctot
2014 Communications in Computer and Information Science  
Regret minimization is important in both the Multi-Armed Bandit problem and Monte-Carlo Tree Search (MCTS). Recently, simple regret, i.e., the regret of not recommending the best action, has been proposed as an alternative to cumulative regret in MCTS, i.e., regret accumulated over time. Each type of regret is appropriate in different contexts. Although the majority of MCTS research applies the UCT selection policy for minimizing cumulative regret in the tree, this paper introduces a new MCTS
more » ... riant, Hybrid MCTS (H-MCTS), which minimizes both types of regret in different parts of the tree. H-MCTS uses SHOT, a recursive version of Sequential Halving, to minimize simple regret near the root, and UCT when descending further down the tree. We discuss the motivation for this new search technique, and show the performance of H-MCTS in six distinct two-player games: Amazons, AtariGo, Ataxx, Breakthrough, NoGo, and Pentalath. Recently, simple regret has been proposed as a new criterion for assessing the performance of both MAB [2, 6] and MCTS [7, 9, 18] algorithms. Simple regret is defined as the expected error between an algorithm's recommendation, and the optimal decision. It is a naturally fitting quantity to optimize in the MCTS setting, because all simulations executed by MCTS are for the mere purpose of learning good moves. Moreover, the final move chosen after all simulations are performed, i.e., the recommendation, is the one that has real consequence. Nonetheless, since the introduction of Monte-Carlo Tree Search (MCTS) [11] and its subsequent adoption by games researchers UCT [11], or some variant thereof, has become the "default" selection policy (cf. [5]). In this paper we present a new, MCTS technique, named Hybrid MCTS (H-MCTS) that utilizes both UCT and Sequential Halving [10] . As such, the new technique uses both simple and cumulative regret minimizing policies to their best effect. We test H-MCTS in six distinct two-player games: Amazons, AtariGo, Ataxx, Breakthrough, NoGo, and Pentalath. The paper is structured as follows, first MCTS and UCT are introduced in Section 2. Section 3 explains the difference between cumulative and simple regret, and how this applies to MCTS. Next, in Section 4 a recently introduced, simple regret minimizing technique for the MAB problem, Sequential Halving [10], is discussed. Sequential Halving is used recursively in SHOT [7], which is described in detail in Section 5. Together, SHOT and UCT form the basis for the new, hybrid MCTS technique discussed in Section 6. This is followed by the experiments, in Section 7 and finally by the conclusion and an outline of future research, in Section 8.
doi:10.1007/978-3-319-14923-3_1 fatcat:opso7iiy4faenogqefylqelxui