Monte-Carlo Planning: Theoretically Fast Convergence Meets Practical Efficiency

2013·arXiv

Abstract

1 INTRODUCTION

Markov decision processes (MDPs) are a standard model for planning under uncertainty [19]. An MDP is defined by a set of possible agent states S, a set of agent actions A, a stochastic transition function [0, 1], and a reward function . The current state of the agent is fully observable, and the objective of the agent is to act so to maximize its accumulated reward. In the finite horizon setting considered here, the reward is accumulated over some predefined number of steps H. The description of the MDP is assumed to be concise, and, depending on the problem domain and the representation language, it can be either declarative or generative (or mixed). While declarative models provide the agents with greater algorithmic flexibility, generative models are more expressive, and both types of models allow for simulated execution of all feasible action sequences, from any state of the MDP.

In online planning for MDPs, the agent focuses on its current state only, deliberates about the set of possible policies from that state onwards and, when interrupted, uses the outcome of that exploratory deliberation to choose what action to perform next. The quality of the action a, chosen for state s with H steps-to-go, is assessed in terms of the probability that a is sub-optimal, or in terms of the (closely related) measure of simple regret. The latter captures the performance loss that results from taking a and then following an optimal policy for the remaining 1 steps, instead of following from the beginning [5].

With a few recent exceptions developed for declarative MDPs [4, 16, 7], most algorithms for online MDP planning constitute variants of what is called Monte-Carlo tree search (MCTS) [23, 18, 15, 9, 8, 21, 25]. Most MCTS algorithms for online planning, such as tree search and UCT, aim at rapidly identifying a reasonably good action, but offer only polynomial-rate reduction of simple regret over the deliberation time. In contrast, a recently introduced MCTS algorithm BRUE guarantees exponential-rate reduction of simple regret over time, yet it does not make special efforts to home in on a reasonable alternative fast [11]. Of course, “good” is often the best one can hope for in large MDPs of interest under practically reasonable deliberation-time allowances. Therefore, as we recon-firm by an evaluation on benchmarks from the recent probabilistic planning competition, BRUE is often empirically inferior to its (guarantees-wise inferior) competitors.

Reflecting on the differences between the two types of algorithms, here we show that BRUE can be redesigned to perform extremely well also under early interruptions of planning, and this without compromising much neither the long-term empirical performance nor theoretical guarantees. We do that in two steps. First, we connect between the iterative tree expansion of the standard MCTS scheme and the “separation of concerns” principle that underlies BRUE. The resulting modification of , already substantially improves over BRUE in short-term effec-tiveness. Building upon , we then introduce a machinery of selective tree expansion that further pushes the boundaries of online MDP planning. Viewing MCTS as a message passing within the hierarchy of forecasters, this mechanism is based on (i) classifying the roles that different forecasters in the hierarchy should fulfill in order to improve the quality of the decision at the root, and on (ii) exploit this classification to adaptively decide how each forecaster should aim at fulfilling its role. As testified by our empirical evaluation, the resulting algorithm, , favorably and robustly competes with other MCTS algorithms under short planning times, while preserving both the attractive formal properties of BRUE, as well as the empirical strength of the latter under permissive deliberation-time allowances.

2 BACKGROUND

Henceforth, denotes the set of all valid policies for the MDP in question, denotes the actions applicable in state s, the operation of drawing a sample from a distribution D over set is denoted by ], U denotes uniform distribution, and for denotes the set {1, . . . , n}.

2.1 CANONICAL MCTS SCHEME

MCTS, a canonical scheme underlying various MCTS algorithms for online MDP planning, is depicted in Figure 1a. Starting with the current state performs an iterative construction of a treerooted at . At each iteration, MCTS rollouts a state-space sample from , which is then used to update T . First, each state/action pair (s, a) is associated with a counter n(s, a) and a value accumulator ), both initialized to 0. When a sample is rolled out, for all states ) and ) are updated on the basis of by the update-node procedure. Second, T can also be expanded with any part of ; The standard choice is to expand T with only the first state along that is new to T . In any case, once the sampling is interrupted, MCTS uses the information stored at the tree’s root to recommend an action to perform in .

Figure 1: (a) Monte-Carlo tree search template, and (b) the UCT specifics.

Numerous concrete instances of MCTS have been proposed, with -greedy [23] probably being the most widely known, and UCT [15] and its modifications [9, 24] being the most popular such instances these days [12, 22, 3, 2, 10, 14]. Concrete instances of MCTS vary mostly along the implementation of the rollout-policy sub-routine, that is, in their policies for directing the rollout within T . For instance, the specific rollout-policy of UCT is shown in Figure 1b. This policy is based on the deterministic decision rule UCB1 [1], originally proposed for optimal balance between exploration and exploitation for cumulative regret minimization in stochastic multi-armed bandit (MAB) problems [20]. In general, different instances of MCTS vary in their balance between exploration and exploitation. However, it has already been noticed that exploitation may considerably slow down the reduction of simple regret over time [6]. In-

Figure 2: Monte-Carlo tree search with “separation of concerns” (a), and the BRUE specifics (b).

deed, UCB1 (and thus UCT) achieves only polynomial-rate reduction of simple regret over time [6], and the number of samples after which the bounds of UCT on simple regret become meaningful might be as high as hyper-exponential in H [9]. In fact, no instance of the MCTS scheme (Figure 1a) suggested so far breaks the barrier of the worst-case polynomial-rate reduction of simple regret over time.

2.2 SEPARATION OF CONCERNS

If fast convergence to optimal choice is of interest, then Monte-Carlo planning should be as exploratory as possible [6]. However, what it means to be “as exploratory as possible” with MDPs is less straightforward than it is in MABs. In particular, recently it was observed that “forecasters” should be devoted to two, somewhat competing, exploratory objectives, namely identifying an optimal action ), and estimating the value of that action, because this information is needed by the predecessor(s) of s in T [11].

Following this observation, Feldman and Domshlak [11] introduced MCTS2e, a refinement of MCTS scheme that implements the principle of “separation of concerns,” whereby different parts of each sample are devoted to different exploration objectives. In MCTS2e (Figure 2a), rollouts are generated by a twophase process in which the actions are selected according to an exploratory policy until an (iteration-specific) switching point, and from that point on, the actions are selected according to an estimation policy. The sub-routines end-of-probe and update in MCTS2e are trivial, the update-node sub-routine extends this of MCTS with tree expansion, and, instead of rollout-policy of MCTS, specific instances of MCTS2e are achieved by instantiating three new sub-routines that determine the switching point for a rollout, and the action selection protocols for the two phases of the rollouts.

Feldman and Domshlak [11] show that a specific instance of MCTS2e, dubbed BRUE, achieves an exponential-rate reduction of simple regret over time, with the bounds on simple regret becoming meaningful after only exponential in number of samples. The specific MCTS2e sub-routines that define the BRUE algorithm are shown in Figure 2b. Similarly to UCT, each node/action pair (s, a) is associated with variables n(s, a) and ), but with the latter being initialized to instantiates MCTS2e by choosing actions uniformly at the exploration phase of the sample, choosing the best empirical actions at the estimation phase, and changing the switching point in a round-robin fashion over the entire horizon. Importantly, if the switching point of a rollout is , then only the state/action pair () is updated by the information collected by . That is, the information obtained by the estimation phase of is used only for improving the estimate at state , and is not pushed further up the sample. While that may appear wasteful and counterintuitive, this locality of update is required to satisfy the formal guarantees of BRUE on exponential-rate reduction of simple regret over time [11].

3 FAST OPTIMAL VS. FAST GOOD

A comparative evaluation on Sailing [18] and PGame [15] domains showed that BRUE is continually improving towards an optimal solution, rather quickly obtaining results better than UCT [11]. However, that evaluation also showed that UCT sometimes

Figure 3: IPPC-2011 scores for the different MCTS algorithms under the (unknown to the algorithms) average deliberation time allowance of 5 seconds per step.

manages to identify reasonably good actions rather quickly, while BRUE is still “warming up”. Focusing on tight deliberation deadlines, we conducted a wider empirical evaluation on MDP benchmarks from the last International Probabilistic Planning Competition (IPPC-2011). These benchmarks appear ideal for our purpose of evaluating algorithms under tight time constraints: Most IPPC-2011 domains induce very large branching factors, and thus allow only a very shallow sampling of the underlying search tree in reasonable time.

We used five IPPC-2011 domains, Game-of-Life, SysAdmin, Traffic, Crossing, and Navigation, with five randomly picked problem instances from each domain. Instances of Game-of-Life, SysAdmin, and Traffic were ran with planning horizon of 20 steps, whereas instances of (goal-driven) Crossing, and Navigation were ran with horizon of 40 steps. Each algorithm was allowed a (rather arbitrary chosen) deliberation time that started with 10 seconds for the first step, and decreased linearly to 1 second at the last step. In addition to UCT and BRUE, the comparison includes a trivial baseline of random action selection (Random), as well as MAB-Uniform, a simple algorithm that chooses actions everywhere uniformly at random, estimating ) for each ) by the value of taking a and then choosing actions uniformly at random for the remaining 1 steps. Obviously, the estimates of MAB-Uniform are typically erroneous, and so a priori, its only positive property appears to be its computational efficiency.

The results are presented in Figure 3; the names of the problem instances on the x-axis are these from the IPPC-2011 repository, and y-axis captures the IPPC-2011 scoring scheme: The algorithms are scored relatively to each other on each problem instance, with the relative scores being then averaged over 300 runs on that specific problem instance. The relative score of a particular algorithm in a specific run is the difference between the total reward achieved by that algorithm and the worst total reward among all the algorithms, divided by the difference between such best and worst total rewards.

According to Figure 3, all the examined MCTS algorithms, including the seemingly naive MAB-Uniform, performed much better than Random, with SysAdmin and Traffic being the most prominent examples for that. In other words, even under severely limited deliberation time allowance, the value of deliberation in the benchmarks in use was substantial. Likewise, the experiment reconfirmed that UCT is typically more effective than BRUE under short deliberation times. This suggests that the “fast optimal” scheme of BRUE lacks some ingredients that make its MCTS competitors “fast good”. Having said that, note that different problems favored different approaches, with MAB-Uniform being surprisingly superior on many of the examined benchmarks. Hence, the quest for a clear recipe for “fast good” remains open, especially if we do not want to neglect striving for optimality, and even more so, if we want that recipe to be robust on a wide palette of MDPs. This is precisely the quest we consider in what comes next.

4 TWO TYPES OF FORECASTERS

Consider the state/steps-to-go pairs (s, h) as a hierarchy of forecasters, all acting on behalf of the root forecaster () that aims at minimizing its own simple regret in a stochastic MAB induced by the applicable actions ). In the setup of online planning, there is a conceptual difference between the exploration objective of the root forecaster and this of all other forecasters in the hierarchy. To see that, suppose there is an oracle that can provide each forecaster (s, h) either with the identity of the optimal action ) but without revealing its value )), or with the value )) but without revealing the identity of ). For the root forecaster (), the first type of information is all he needs, while the second type of information buys him very little, if anything. In contrast, even if the oracle provides all the forecasters () with (only) the identities of the respective optimal actions ), then the root forecaster () in some sense remains as clueless as it was before, and needs to explore the state space in order to obtain at least some ordinal information about the expected value of the alternative choices ). However, if the oracle provides a non-root forecaster (s, h) (only) with the best Q-value among its alternative choices A(s), then (s, h) can stop working since no further exploration of the sub-hierarchy rooted in (s, h) is needed.

In sum, what matters to the root forecaster is only what to execute, while all other forecaster care only about the value they can provide to their ancestors in the hierarchy, and not about how this value can actually be acquired. Of course, the reader may question this classification by arguing that these two objectives are just two sides of the same coin: estimating the value of optimal action assumes aiming at identifying an optimal action and vice versa. To some extent, that is true, but only to some extent. For instance, the very realization that this coin has two sides, and that these two sides are somewhat competing, is precisely what motivates the “separation of concerns” principle behind the MCTS2e scheme. Turns out that this clas-sification of objectives suggests further insights into the dynamics of MCTS algorithms.

In all MCTS algorithms for online MDP planning, each iteration corresponds to examining a chain of forecasters within the overall hierarchy under (), with the difference between the algorithm boiling down to two decisions:

(II) how to estimate )) for each forecaster (s, h) in the hierarchy.

At first view, choosing the right strategy for (I) seems to be the key to rapid homing in on “good” decisions. The details of various MCTS algorithms suggest that their design was indeed primarily guided by choices for (I), with choices for (II) being implied by the former. Here, however, we suggest that decoupling these two decisions is important, and that the key to the quest of our interest actually lies in decision (II).

A closer look at different Monte-Carlo planning algorithms for MDPs reveals an interesting generalizing perspective on the way they all approach decision (II). Let (s) be the value of (s, h) under policy , ()) be the value of (s, h) under the optimal policy, and let ((s) denote em- pirical estimates of these two quantities, respectively. In all MCTS algorithms, at each point of time, the entire hierarchy of forecasters can be seen as consisting of two types of forecasters.

forecaster (s, h) (possibly schematically) estimate (s) by an estimate of (s), that is, of the expected total reward of a policy sampled from uniformly at random.

forecaster (s, h) distinguishes between its choices A(s), and estimates (s) by an estimate of max) + where the estimate of ) is based on the in- formation provided by to s.

Consider the way in which the specific MCTS algorithms approach decision (II) in terms of this Tpartition of the forecasters. In both UCT and T , while all other state/steps-to-go pairs correspond to . Note that these T-forecasters are very much not virtual. For instance, in UCT they are queried by the evaluate sub-routine, and in BRUE they are queried, possibly in interleaving with T-forecasters, by both exploration-policy and estimation-policy sub-routines.

At first view, T-forecasters appear to be strangely lazy and potentially very misleading, while T-forecasters seem to be doing the right thing. However, it is not all that simple. First, while each T-forecaster samples a single random variable, each T-forecaster (s, h) has to sample |A(s)| random variables. Thus, T-forecasters converge to quality estimates of quantities of their interest much faster than their Tcounterparts. Second, while T-forecasters try to estimate the right thing, their success totally depends on the quality of estimates of ) they re- ceive from their successors. Hence, it is not clear that forecasters of type Tare always more effective.

We return to this issue in more detail later on. For now, note only that both UCT and BRUE can be seen as continuously reconsidering the typing of the forecasters. Specifically, in both UCT and BRUE, (at most) a single forecaster is “converted” from Tto Tat every iteration: in UCT it is the -forecaster found along the rollout, and in BRUE, it is the T-forecaster that happens to lie at the rollout’s switching point . This way, the set of T-forecasters in UCT grows incrementally as a single community connected to the root forecaster (). In contrast, T-forecasters in BRUE evolve in H independent sets, each distributed over the respective depth level of the forecast hierarchy according to the transition distribution induced by the uniform action selection at the preceding levels.

This specific difference between UCT and BRUE is directly related to their relative efficiency under differ-ent orders of deliberation time allowance. Populating T-forecasters at all levels of the hierarchy is generally necessary to guarantee fast convergence to optimal choice at the root. However, the marginal value of T-forecasters at different levels vary with the deliberation time allowance: Information gathered by T-forecasters at deep levels takes time to be propagated to the root, making their near-term influence on the choices at () smaller than this of the T-forecasters closer to the root.

In that respect, a modification of BRUE that suggests itself almost immediately is as simple as it gets: Instead of converting the T-forecaster at the switching point , we can resort to converting the shallowest T-forecaster on the exploratory part of the rollout, that is, up to the level . By offering both exponential-rate reduction of the simple regret at the root, as well as incremental conversion of T-forecasters as a connected set around (), the resulting algorithm, , substantially improves over BRUE in short-term effectiveness. (The specific empirical results for are shown later in the paper. ) However, this simple modification of BRUE is not our final destination, and next we show that this simple bridge between MCTS and MCTS2e opens a much wider window of opportunity.

5 SELECTIVE TREE EXPANSION

Similarly to UCT and BRUE, each iteration of either finds no candidate for type conversion, or unconditionally converts a concrete single T-forecaster (s, h) to type T. However, suppose we know that (s) equals max) +

Since direct Monte-Carlo estimation of the quantity on the left-hand side is substantially easier than this of the right-hand side, converting (s, h) to type Tis clearly not a good idea. In fact, both (s, h) and all of its exclusive descendants in the hierarchy would better remain T-forecasters for the entire deliberation process, no matter how long it is. Of course, this equality rarely holds, and, more importantly, we have no prior knowledge about the size of the gap between the quality of the best policy under (s, h) and the expected quality of the randomly picked policy. However, this extreme example still hints on the promise of selective type conversion, and below we examine the prospects of this direction.

The variance of a Monte-Carlo estimator ) of the value of action a at state s stems from two sources. The first source of variance comes from following different policies (aka action selections) along different rollouts. The other source of variance comes from the stochastic nature of the action outcomes. That is, if r is the reward obtained by following policy for h steps starting from state s, then

At one extreme, we have all policies yielding the same expected reward, and thus all the variance comes from the action outcomes. In that case, distinguishing between the policies under (s, h) is not only useless, but also computationally harmful. Thus both (s, h) and its descendants should be left as type T, that is, not added to T . At the other extreme, we have all actions being deterministic, but different policies yield very different reward. In that case, it may be valuable to convert (s, h) to T, increasing the resolution at which the policies under (s, h) are examined. Unsurprisingly, in between these two extremes, the “value of conversion” is less straightforward. In the absence of any information about the stopping time, online planning algorithms should strive to do well under the assumption that termination point is near, while ensuring continuous improvement as more time is allowed. And as long as the precise mixture of these two desiderata remains vague, so remains the precise formulation of the value of conversion.

Having said that, as we do understand the high-level factors that affect the value of conversion, we can try estimating and combining these factors so to reflect the purported value of conversion. Here we propose and evaluate a simple and intuitive rule: The candi- variance of the expected reward over different policies under (s, h) exceeds the average variance of the policies, that is, iff

which is equivalent to

where r is the reward obtained by following policy for h steps starting from state s, and ) is the number of samples that induce the estimate (s).

The quantity on the left indicates the distance between the average quality of the policies and the quality of the optimal one, that is

The quantity on the right indicates the distance between the estimator and the mean of the policies value, where the division by ) captures the effect of averaging on the variance. As the number of samples from each policy grows, estimators (s) approaches their means (s), and thus the gap captured by Eq. 4 becomes the dominant factor, in which case converting (s, h) to Tis estimated as valuable.

Based on this decision rule, we suggest a new instance of (standing for BRUE with incremental and selective type conversion). The respective MCTS2e sub-routines of are depicted in Figure 4.

• Similarly to BRUE, the actions are selected uniformly at random at the exploration phase of the rollout, and the empirically best actions are selected at the estimation phase.

• Similarly to UCT and , the tree is expanded in an incremental fashion, maintaining the set of Tforecasters connected to the root.

• Unlike UCT and , after a forecaster (s, h) is added to the tree, it goes through an “evaluation period”, and remains of type Tuntil it passes that evaluation. Hence, the leaves of T consist of some T-forecasters, but also of all Tcandidates for T.

The evaluation of the “conditional candidates” (s, h) is captured by the convert procedure depicted in Figure 5. This procedure estimates the two sources of variance at (s, h) (lines 1-4, explained below), and only when the variance of the policies’ values exceeds the average variance of the policies value in the sense of Eq. 2 (the condition in line 5 fails), (s, h) is converted to a T-forecaster. At the actual conversion (lines 6-9), the information gathered at the evaluation period is used to initialize the standard node variables ) and n(s, a).

Figure 4: MCTS2e sub-routines for . For the convert procedure called by end-of-probe, see Figure 5.

Technically, candidate evaluation is performed as follows. For each Tcandidate (s, h), the algorithm maintains a set of policies Π, as well as a subset of “active” policies Π, size of which is bounded by a fixed parameter . The set Πis used by the sub- routine evaluate that selects a policy from Πfor evaluation uniformly at random. In case allows expanding the active subset Π, evaluate uses a newly generated policy (e.g., by selecting actions uniformly at all states that can be reached by following the actions at preceding states). The new policy is then added to Π(and thus to Π).

After a policy is selected by evaluate, it is exe-

Figure 5: Implementation of the decision rule of for type conversion.

cuted once starting at (s, h). The resulting cumulative reward r is then used to update statistics about the particular policy , including the empirical mean ) (which is just a more convenient for us naming for the estimator (s)), the empirical variance

Var(), and the counter ). The reward r is also used later on, in the recursive rollback of probe (Figure 2). When the variance of decreases below a predefined threshold , it is removed from the active subset Πto be replaced by a new policy. This mechanism ensures that all policies will eventually be sampled, which is essential to guarantee convergence of our estimates of the two sources of variance.

The sub-routine end-of-probe bares similarity to this of UCT, but it is extended with enforcing T-candidates to remain leaves as long as the respective convert attempts come out unsuccessful. In convert, the statistics about policies are used to estimate the mean of the policies’ variance EV, the variance in the policies value VE, the overall mean value of the policies EE, and m, the total number of samples from all the policies in Π. If the evaluated T-candidate (s, h) meats the conversion condition, convert returns true, and the newborn T-forecaster (s, h) is no longer forced to remain a leaf. Just before that, the standard node variables ) and n(s, a) are initialized with the information gathered around the empirically best policy that starts with the respective action a. Importantly, since the candidate evaluation criterion is not blocking, all candidates eventually convert to Tand act as in BRUE. This provides with BRUE’s quality convergence rate in long term.

At last, borrows its incremental tree expansion mechanism from : When a leaf is encountered before the switching point , no node is updated, and a global flag retract is set to true, signal-

Table 1: Aggregated IPPC-2011 scores from Figure 6; boldfacing indicates top performance in the respective domain (and overall, in the last row).

ing switch-function to set the switching point to 0 in the subsequent iteration. This way, the switching point is selected systematically as in BRUE, but now ranging between the root and a leaf rather than on the entire horizon.

This finalizes the description of . To examine its relative effectiveness, we conducted a comparative evaluation under the same experimental setup as described in Section 3. We evaluated five algorithms: , and . The results are presented in Figure 3 and summarized in Table 1; all scores in Table 1 are within confidence bounds of 01. For the sake of readability, here we excluded Random from the presentation. Overall, exhibited a robustly good performance across the domains, finishing top performer on most problem instances. is also consistently better than , indicating that selective type conversion plays an important role in its performance. We also notice that, in most cases, MAB-Uniform was not very far off from the leader, falling short only in few instances of Navigation. Since MAB-Uniform can be seen as with an ultra-conservative, unsatisfi-able condition for type conversion, this suggests that further introspection into the specific decision rule of , as well as into the impact of the parameters in its implementation, should be valuable.

Finally, we examine the simple regret reduction of under varying time budgets, using the experimental settings for Sailing domain from [11]. In the bottom part of Figure 6, we plot the simple regret of the actions recommended by MAB-Uniform, UCT, BRUE, and under different planning time windows, averaged over 300 instances of 10x10 and 20x20 grid sizes. The results reveal that not only does significantly outperforms UCT uniformly over time, right from the beginning of deliberation when BRUE is still lagging behind, but also that its simple regret reduction rate is rather comparable to BRUE’s in the longer term.

Figure 6: Performance of the MCTS algorithms with and without selective tree expansion. Top: Per-instance IPPC-2011 scores under the experimental setup as in Section 3. Bottom: Absolute performance of the algorithms in terms of average simple regret on the Sailing domain problems with 1010 and 2020 grid maps.

6 SUMMARY

Considering online planning for generative MDPs, we have investigated and combined the high-level principles that underly different computational schemes for this problem, and showed that their individual strengths can be put together at work. We then rationalized a principle of selective tree expansion that aims at automatically adapting Monte-Carlo exploration to the specifics of the MDP in hand, and suggested a concrete implementation of this principle within MonteCarlo tree search methods. The resulting algorithm, , favorably competes with other MCTS algorithms under short planning times, while preserving the attractive convergence properties of the (not so effective under short planning windows) algorithm BRUE [11], as well as the empirical strength of the latter under permissive planning windows.

In previous works, some forms of selective tree expansion has been shown extremely effective in forwardsearch planning for declarative MDPs [7, 4]. In that context, our work can be seen as the first selective tree expansion framework for Monte-Carlo planning, which is applicable in generative MDPs as well. Likewise, our work joins some other recent techniques for enhancing Monte-Carlo planning with adaptivity to the given problem, such as hindsight optimization for declarative MDPs [17, 26], and metalevel decision procedures for sample routing for both declarative and generative MDPs [13]. At the high level, these techniques appear complementary to selective tree expansion, and thus studying their interplay and cumulative value is a promising venue for future research.

Acknowledgements This work was partially supported by the EOARD grant FA8655-12-1-2096, and the ISF grant 1045/12.

References

[1] P. Auer, N. Cesa-Bianchi, and P. Fischer. Finitetime analysis of the multiarmed bandit problem. Machine Learning, 47(2-3):235–256, 2002.

[2] R. Balla and A. Fern. UCT for tactical assault planning in real-time strategy games. In IJCAI, pages 40–45, 2009.

[3] R. Bjarnason, A. Fern, and P. Tadepalli. Lower bounding Klondike Solitaire with Monte-Carlo planning. In ICAPS, 2009.

[4] B. Bonet and H. Geffner. Action selection for MDPs: Anytime AOvs. UCT. In AAAI, 2012.

[5] S. Bubeck and R. Munos. Open loop optimistic planning. In COLT, pages 477–489, 2010.

[6] S. Bubeck, R. Munos, and G. Stoltz. Pure exploration in finitely-armed and continuous-armed bandits. Theor. Comput. Sci., 412(19):1832–1852, 2011.

[7] L. Busoniu and R. Munos. Optimistic planning for Markov decision processes. In AISTATS, number 22 in JMLR (Proceedings Track), pages 182– 189, 2012.

[8] T. Cazenave. Nested Monte-Carlo search. In IJCAI, pages 456–461, 2009.

[9] P-A. Coquelin and R. Munos. Bandit algorithms for tree search. In Proceedings of the 23rd Conference on Uncertainty in Artificial Intelligence (UAI), pages 67–74, Vancouver, BC, Canada, 2007.

[10] P. Eyerich, T. Keller, and M. Helmert. Highquality policies for the Canadian Traveler’s problem. In AAAI, 2010.

[11] Z. Feldman and C. Domshlak. Simple regret optimization in online planning for markov decision processes. CoRR, arXiv:1206.3382v2 [cs.AI], 2012.

[12] S. Gelly and D. Silver. Monte-Carlo tree search and rapid action value estimation in computer Go. AIJ, 175(11):1856–1875, 2011.

[13] N. Hay, S. E. Shimony, D. Tolpin, and S. Russell. Selecting computations: Theory and applications. In UAI, 2012.

[14] T. Keller and P. Eyerich. Probabilistic planning based on UCT. In ICAPS, 2012.

[15] L. Kocsis and C. Szepesv´ari. Bandit based MonteCarlo planning. In ECML, pages 282–293, 2006.

[16] A. Kolobov, Mausam, and D. Weld. LRTDP vs. UCT for online probabilistic planning. In AAAI, 2012.

[17] A. Y. Ng and M. Jordan. PEGASUS: A policy search method for large MDPs and POMDPs. In UAI, 2000.

[18] L. P´eret and F. Garcia. On-line search for solving Markov decision processes via heuristic sampling. In ECAI, pages 530–534, 2004.

[19] M. Puterman. Markov Decision Processes. Wiley, 1994.

[20] H. Robbins. Some aspects of the sequential design of experiments. Bull. Amer. Math. Soc., 58(5):527535, 1952.

[21] C. D. Rosin. Nested rollout policy adaptation for Monte Carlo tree search. In IJCAI, pages 649– 654, 2011.

[22] N. Sturtevant. An analysis of UCT in multi-player games. In CCG, page 3749, 2008.

[23] R. S. Sutton and A. G. Barto. Reinforcement Learning: An Introduction. MIT Press, 1998.

[24] D. Tolpin and S. E. Shimony. Doing better than UCT: Rational Monte Carlo sampling in trees. , 2011.

[25] D. Tolpin and S. E. Shimony. MCTS based on simple regret. In AAAI, 2012.

[26] S. W. Yoon, A. Fern, R. Givan, and S. Kambhampati. Probabilistic planning via determinization in hindsight. In AAAI, pages 1010–1016, 2008.

Designed for Accessibility and to further Open Science