Why Non-myopic Bayesian Optimization is Promising and How Far Should We Look-ahead? A Study via Rollout

2019·Arxiv

Abstract

Abstract

Lookahead, also known as non-myopic, Bayesian optimization (BO) aims to find optimal sampling policies through solving a dynamic program (DP) that maximizes a long-term reward over a rolling horizon. Though promising, lookahead BO faces the risk of error propagation through its increased dependence on a possibly mis-specified model. In this work we focus on the rollout approximation for solving the intractable DP. We first prove the improving nature of rollout in tackling lookahead BO and provide a sufficient condition for the used heuristic to be rollout improving. We then provide both a theoretical and practical guideline to decide on the rolling horizon stagewise. This guideline is built on quantifying the negative effect of a mis-specified model. To illustrate our idea, we provide case studies on both single and multi-information source BO. Empirical results show the advantageous properties of our method over several myopic and non-myopic BO algorithms.

1 Introduction

Bayesian optimization is a popular technique to optimize an unknown and expensive-to-evaluate objective function through sequential sampling strategies. Traditionally BO has focused on myopic (also referred to as greedy) algorithms, where sampling points are decided based on a one-step lookahead utility function, oblivious to how this design will affect the future steps of the optimization and the remaining budget.

Recently, motivated by reinforcement learning, there

Proceedings of the 23International Conference on Artifi-cial Intelligence and Statistics (AISTATS) 2020, Palermo, Italy. PMLR: Volume 108. Copyright 2020 by the author(s).

have been attempts to extend greedy BO methods into multi-step lookahead algorithms that maximize a reward over a rolling horizon. Though it seems promising to look further into the future, this approach might sabotage performance due to accumulated errors and increased dependence on a possibly mis-specified model. This raises the question: is a practical implementation of non-myopic approaches indeed useful? Although we cannot give a universal answer, we can shed light on a specific class of non-myopia: rollout dynamic programming.

Rollout is a sub-optimal approximation algorithm to sequentially solve intractable dynamic programming problems. It utilizes problem-dependent heuristics to approximate the future reward using simulations over several future steps (i.e., the rolling horizon). Indeed, rollout has been successfully applied to the non-myopic BO scenario (Lam et al., 2016; Lam and Willcox, 2017). Yet, rollout still faces two challenges: theoretical justification/guarantees and error propagation as errors from a mis-specified model will accumulate as we look further into the future. These challenges raise the question whether long term planning in BO is necessary.

In this work, we first provide theoretical justification for rollout in BO settings. Specifically we show that under the class of sequentially improving heuristics, rollout is guaranteed to outperform its myopic counterpart. We then provide a guideline to carefully choose a rolling horizon at each stage of the discounted DP. Based on these facts, we argue that a short horizon is beneficial and also computationally economical. Therefore, using non-myopia is promising and deserves further research attention.

We organize the remaining paper as follows. In Sec. 2, we briefly review BO, DP and rollout. We then prove the performance guarantee of rollout in Sec. 3 and give a practical guideline on choosing the rolling horizon in Sec. 4. In Sec. 5, we provide case studies to evidence our theoretical argument. Detailed literature review can be found in Sec. 6. Some algorithmic details and multi-information source BO are included in the supplementary material.

2 Background

In this section, we provide the problem description and a brief review on the technical background needed for this paper.

2.1 Bayesian Optimization

Let be an objective function which is expensive to evaluate. We consider the optimization problem:

where x is a d-dimensional input/design vector and X is a compact set in . Given limited budget B, BO aims to search for the optimal by iteratively updating a surrogate model of f(x), where this surrogate is used to find the next design point. Typically, in BO, the surrogate model is a Gaussian process (GP), due to its Bayesian interpretation and uncertainty quan-tification capability (see Rasmussen (2003) for more information).

Without loss of generality, suppose we can sample N design points. Given the current data {1, . . . , N}, BO aims to determine the next informative sampling point by solving the auxiliary problem:

where is a acquisition/utility function that only involves evaluating the surrogate and not the expensive objective function f. Typically, evaluation of an acquisition function is relatively cheap. The rationale is to seek design points that produce maximum increment in the objective function. After Eq. (2) is solved, we sample at location and observe the output . The iterative algorithm proceeds by augmenting the current training data with a new observation to obtain . Popular choices of acquisition functions are entropy search (ES) (Hennig and Schuler, 2012), predictive entropy search (PES) (Hern´andez-Lobato et al., 2014) and expected improvement (EI) (Lam et al., 2016). All aforementioned functions exploit myopic strategies and ignore the future information.

2.2 Dynamic Programming

Lookahead BO can be directly viewed as an instance of DP. In such settings the non-myopic acquisition function quantifies rewards over future steps. Due to limited budge or sampling capacity, we consider a finite N-stage DP formulation. Denote by . At each stage k, define the state space as ) and denote by dataset the current state, where is the state in the state space . A policy is a sequence of rules (i.e., sampling actions) mapping the state space to the design space X, where is a policy space. We use to emphasize the rule under policy . Let .

Now denote by the reward function at stage k. The reward function ) quantifies the benefits of sampling at location given the current dataset . For example, one popular choice of the reward function is the expected improvement function (Frazier, 2018).

As there is no sampling action at the end-stage (i.e., k = N + 1), we define the end-stage reward as : . As a result, the discounted expected cumulative reward of a finite N-step horizon under policy given initial dataset can be expressed as

where [0, 1] is the discount factor. The discount factor plays an important role in this setting, as it controls the effect of error propagation. In the greedy In the policy space , we are interested in the optimal policy which maximizes Eq. (3). Specifically,

Based on the Bellman optimality equation, we can then formulate (3) and (4) as a recursive DP:

where ) is known as the reward-to-go function. Without loss of generality, we set Therefore, at stage k, the next sampling location is decided by maximizing )]. In the context of BO, we can naturally set the acquisition function to be

.

2.3 Rollout

The DP formulation in Sec. 2.2 is subject to a huge computational burden and curse of dimensionality due to the uncountable state and action space. Furthermore, the formulation assumes that data in the last step is available and computes the acquisition function in a backward manner, which is impractical in BO. In order to solve the intractable DP, an approximate dynamic programming (ADP) approach - rollout (Bertsekas, 1995) has been proposed. Rollout has recently enjoyed success across a variety of domains as it builds on several heuristic rules ˜(details later) (Lam et al., 2016) and is efficient for large-scale and finite-horizon DP problems. Instead of solving DP in a backward manner, rollout solves DP in a forward manner. Here we briefly describe the rollout algorithm. We first define a key component - the rolling horizon 1) and let ˜N = min{k + h, N}. At stage k, rollout first decides a sampling location ˜using a heuristic policy and collect a simulated ˜from the surrogate model. For example, the heuristic policy and the simulated output ˜can be the sampling action and the output generated from the expected improvement function. Afterward, we create a simulated dataset ˜. Based on this simulated dataset, one can use a similar aforementioned procedure to collect . Using the simulated dataset ˜˜, one can further quantify rewards . As a result, we select the optimal sampling location that maximizes the accumulated reward over a rolling horizon. Mathematically, we are optimizing the following reward-to-go functions:

where ˜is the heuristic rule at every iteration [ ˜N] = {1, . . . , ˜N} such that ˜. For example, if one uses the EI acquisition function, the heuristic rule is “ sampling at location ˜that provides the maximal improvement”. Here note that the heuristic rule only samples at the location that maximizes the current acquisition function and ignores the long-term reward. However, this does not indicates rollout is myopic. In fact, ˜is a simulated sampling location that will be used to create simulated datasets ˜˜. The final decision on the sampling location (without the tilde notation) is selected to maximize the accumulated reward over a rolling horizon. In essence, this feature makes rollout a non-myopic algorithm.

Eq. 5 and Eq. 6 have a key difference: the former one has a maximization operator. In Eq. 5, to compute the optimal sampling location at stage k, one needs to know the optimal at stage k + 1. This is apparently infeasible as we do not have any information about at stage k. Eq. 6, on the other hand, circumvents this situation. It removes the maximization operator so that the sampling action at stage k is independent of the future. Therefore, the intractable acquisition function ) from Sec. 2.2 can be approximated by an approximate acquisition function ˜). At the end stage, we define policy ˜such that (x), where is the updated mean function from data . For instance, if the surrogate model is a GP, then is the posterior mean of a GP (Rasmussen, 2003).

3 Rollout Performance Guarantees

Without loss of generality and for the sake of neatness, we omit the discount factor and assume Given a state s, an algorithm H(s) is a method to select a sequence of feasible rules and policy which generates states . Now, to estab- lish theoretical guarantees, we first provide the following definitions (Bertsekas et al., 1997; Goodson et al., 2017).

Definition 1. Consider a maximization problem. Algorithm H is said to be sequentially consistent if for every state , whenever H generates the state path () starting at state also generates the path () starting at state .

In the context of DP, let and let be a state on a path generated by policy using algorithm H(s). Denote this policy as . Consequently, sequential consistency can equivalently be defined as

Definition 2. Consider a maximization problem. Algorithm H is said to be sequentially consistent if and subsequent , we have

Definitions 1 implies that an algorithm is sequentially consistent if it produces the same subsequent states when started at any intermediate state of a path that it generates. Equivalently, by Definition 2, the algorithm will generate the same subsequent rules ().

Now, consider a probability space (Ω, F, P). Define ) as the sub -algebra generated by the state s, then we have the following definition:

Definition 3. Consider a maximization problem. Algorithm H is said to be sequentially improving if for every state , whenever H generates the path () starting at state , the following

property will hold

It directly follows that if H is sequentially consistent, then the equality will hold in (8). Therefore, a sequentially consistent algorithm is also sequentially improving. However, the converse is not true. Next, we will present our theorem about rollout improving.

Theorem 1. The sequentially improving algorithm H is also rollout improving. Formally, given the rollout policy , we have the following property

Proof. We will prove this theorem by mathematical induction. When , this statement is trivial. Now assume this statement holds for + 11. Then, when , define ) as the sub -algebra generated by state . Since each subsequent state is an augmented , we have F. By the law of total expectation, we have

By assumption, since the algorithm is sequentially im-

proving, we have

The last equality follows the definition of the rollout algorithm. The rest of the proof is completed by the induction hypothesis.

Theorem 1 shows that the rollout approach is guaranteed to perform better than its myopic counterpart under the same base heuristic rules. Intuitively, when rollout generates a path, it exploits the base heuristic to generate a collection of other paths and picks up the best one. In the next section, we will provide a guideline on choosing a sequence of feasible rolling horizons.

4 Deciding on the Rolling Horizon

One interesting question remains: how to decide the rolling horizon h? In most of the literature, h is chosen to be a fixed value within 2 and 5 (Lam and Willcox, 2017; Ulmer et al., 2018) in order to alleviate computational burden. Though those choices give very promising results, those decisions are very subjective. Fortunately, based on the rollout theory (Bertsekas et al., 1997; Bertsekas, 2005), we can provide a practical guideline to select a stagewise feasible h. The big picture is as follows: we quantitatively obtain the ben-efits of rollout given a modeling error and discount factor, we then compare this long-term discounted benefit with the reward from the greedy algorithm counterpart and decide a feasible rolling horizon accordingly. We provide a detailed argument below.

At each stage k, define a profit function related to the rolling horizon h such that

where ) is a non-negative function. The rolling profit function can be viewed as the total benefits incurred when choosing a rolling horizon h at stage k. For example, at stage (1) is the reward function using rolling horizon 1 and (2) is the accumulated reward function using h = 2. Although long horizons provide more future information, it is not guaranteed to be helpful. In practice we are running the risk of model mis-specification due to modeling the objective function using a GP and then using this surrogate to simulate scenarios over future steps. Therefore, a larger rolling horizon implies an increased dependence on a possibly erroneous model which might in turn cause adverse effects compared to myopic algorithms where errors accumulate only from a one-step lookahead. However, if we can arbitrarily quantify the error from mis-specified model, then we can utilize the rollout improving nature and accordingly decide on the feasible rolling horizon.

In order to quantify the aforementioned error, we de-fine an error function E(x) bounded by a constant ¯. The ) is a metric to quantify the negative effect from model mis-specification. In the next section, we will provide an error bound on GP prediction and use this error bound as an error.

4.1 Error Bound on the GP

The recent work of Wang et al. (2019) sheds light on the model mis-specification issue.

Corollary 1. (Wang et al., 2019) Assume a GP with zero mean and stationary convariance function. Then, under some regularity conditions, the interpolation error is (non-asymptotically)

with probability 1 , where is a function of and ) is the true output at input x and ) is a power function with mis-specified covariance function at observation x, K and u are some constants and is the variance parameter.

In practice, if one uses the Mat´ern kernel with smooth parameter v (Rasmussen, 2003), then the upper bound can be approximated by (Wang et al., 2019)

where minand X is the current dataset that contains all design points (Johnson et al., 1990). In this paper, we only focus on the GP with Mat´ern kernel as it is robust to model mis-specifications (Wang et al., 2019; Burt et al., 2019).

Given this result, at each stage k, we can define ˆsupˆ. One regularity condition in Corollary 1 is that the mis-specified kernel is no smoother than the true kernel. The mat´ern kernel is one of the perfect candidates to this requirement (Burt et al., 2019). In the next section we use this error function to find feasible h stagewise. We note that the proposed framework can be substituted with a different error function. For example, one can use the standard deviation obtained from the surrogate model. In this paper, we focus on GPs as they are the most commonly used surrogate in BO.

4.2 Deciding on the Rolling Horizon

In this section, we provide our main theorem on deciding h stagewise. Define a function

where ˜(2) (1) ) + ˜) is a modified rollout reward function to quantify both profit and error effects. Note that we do not consider ) since it is shared by both algorithms when the rolling horizon is 1. denotes the optimal reward from stage k to N, given the current state and an unknown rolling horizon h. Eq. (13) returns the maximum element between two values: the first one is the reward when we consider a greedy algorithm (i.e., h = 1) and the second one is the reward when we consider the rollout algorithm given a certain error function. Based on Eq. (13), we can obtain the following theorem.

Theorem 2. The set of feasible rolling horizons at stage k is defined as

Theorem 2 implies that for any h within this set the rollout is more beneficial than a greedy algorithm. In other words, the benefits gained from looking further ahead outweigh that of the error effects. However, calculating is hard to implement in practice. In the next theorem, we will provide an equivalent but more practical equation.

Theorem 3. The Rolling Horizon Theorem Given a constant ¯on the error function E(x) and the profit function defined in Eq. (12). The feasible rolling horizon at stage k is defined as

Proof. Based on Theorem 2 and Eq. (13), it is equivalent to consider

By definition, max(1). Since ˜, we have ˜(2)(1)) + ˜(2) . Therefore, Eq. (15) can be simplified as (2)

. The remaining part can be obtained by induction.

In practice, we can pick up the minimal j from the set . We can also set an upper bound on the rolling horizon. Denote by it ¯h. In the Theorem 3, if we could not find feasible h till , we stop searching and use h = 1 at the current stage.

5 Case Study

In this section we provide two case studies to evidence our theoretical arguments. We use the well-known knowledge gradient (KG, see Appendix) (Poloczek et al., 2017) as the base algorithm in our rollout algorithm. More specifically, we will use sampling actions generated by KG as heuristic rules. We then show that KG is both sequentially consistent and improving, and thus it is rollout improving as shown in Theorem 1. We then illustrate that, under Theorem 3 and through carefully choosing the rolling horizon, non-myopic BO has strong advantages over greedy BO. Our algorithm is tested for both single source and multi-information source BO (misoBO). In the misoBO setting, we sample from auxiliary information sources to make inference. Here we note that the details for misoBO are deferred to the appendix due to space limitation and similar conclusions to that of single source BO.

5.1 Setting

We use the same setting in Sec. 2.1. Specifically, when sampling from original function at input x, we observe an outcome y(x). We assume the observation y(x) is normally distributed with mean f(x) and variance ). For the purpose of robustness, we assume that the covariance belongs to some non-smooth parametric family. Specifically, we will use the Mat´ern kernel. Parameters are estimated using maximum likelihood estimation (MLE).

5.2 Algorithm

We utilize the non-greedy acquisition function de-fined in Sec. 2.2. This acquisition function considers far horizon planning and is given by the DP formulation. Specifically, the original acquisition function can be defined as ) + . This is solved by the rollout with KG as the base heuristic. We denote our algorithm as DPsingleBO. The general procedure for DP-singleBO is listed in Algorithm 2. We also extend this algorithm to the multi-information source scenario and denote it as DP-misoBO (see Appendix).

5.3 Guarantees

In order to apply Theorem 1, we need to show that the heuristic greedy KG is sequentially consistent and thus sequentially improving.

Corollary 2. The KG algorithm is sequentially consistent and sequentially improving.

Proof. Remember that state is the dataset . Assume KG algorithm starts at a state (i.e., current dataset ). At each iteration of KG, given a path () and is not the state at the end, the next state is obtained by solving the acquisition function of KG (see appendix) and augmenting with (). If is not the terminating state, the algorithm will then start with the path (). Otherwise, the algorithm will terminate with state and N = m + 1. Therefore, KG is sequentially consistent.

Let () be the path generated by the rollout starting from . Define ) as the sub -algebra generated by state s. Since KG is se-

quentially consistent, we have

Therefore, KG is sequentially improving and we complete our proof.

5.4 Results

5.4.1 Performance Comparison

In this section, we apply algorithms DP-singleBO and DP-misoBO to a variety of classical functions with a range of dimensions, support sets and information sources. We provide three information sources in this experiment: original objective function y(x), biased source one y(1, x) and biased source two y(2, x). Following the setting from Poloczek et al. (2017), we de-fine y(1, x) = y(x) + 2 sin(10+ 5) in the two dimensional space and y(1, x) = y(x) + 2 sin(10+ 5+ 3) in the three dimensional space. We de-fine y(2, x) = y(x) + ), where ) is simulated from GP with radial basis function (RBF) kernel with length-scale variance 5. The RBF kernel is defined as exp). See Table 1 for more information. For the Goldsteinprice and Bohachevsky functions, we provide two biased sources and run DP-misoBO algorithm. For the Branin-Hoo, Six-Hump and Griewant, we run DPsingleBO algorithm. These objective functions have two notable challenges: (1) six-hump and Goldsterinprice have several local maxima; (2) Griewant function has a large design space. We benchmark our algorithms with several state-of-the-art techniques.

Table 1: Functions used in the experiment. More information about each function can be found at the open source library http://www.sfu.ca/~ssurjano/ optimization.html.

Experimental Details To mitigate the negative effect of model mis-specification, we fit GPs with the mat´ern p + kernel and all hyperparameters are opti- mized by MLE. We set discount factor to be 0.9. The optimal rolling horizon h is calculated at each stage. The initial 9 sampling points are chosen by the fill distance design. For a fixed dimension d, we set an upper limit for sampling budget B and only allow around 10d evaluations of each algorithm. We set , cost c = 5d and . For each algorithm, we conduct 30 experiments with different initial points. In Table 2 we provide the testing results in terms of the mean and median of Gap, defined in Eq. (17).

Benchmark Models There is a limited literature on the non-greedy BO. We will benchmark our model with the state-of-the-art GLASSES algorithm with fixed horizon, a DP-based algorithm using M-EI with fixed rolling horizon (Lam et al., 2015), Markov chain Monte Carlo (MCMC) based maximum probability of improvement (MPI) (Snoek et al., 2012), MCMC based lower confidence bound (LCB) (Snoek et al., 2012) and the misoKG (Poloczek et al., 2017). Note that GLASSES, MPI and LCB cannot be applied to the miso setting. We refer to section 6 for more details on the benchmarked models.

Table 2: Mean and median Gap G over 30 experiments with different initial points. The best result for each function is bolded. “NA” indicates not applicable. The discount factor is set to be 0.9.

Performance The performance is measured in terms of Gap G, which is a common metric in many BO literature (Huang et al., 2006; Gonz´alez et al., 2016; Lam et al., 2016). Specifically,

where and are optimal values given the ini- tial and augmented data at stage N respectively and is the global maximum of the testing function. Table 2 shows the comparative results across differ-ent functions and algorithms. Furthermore, we collect the selected over an experiment and plot the distribution of those rolling horizons in Figure 1.

Based on Table 2 and Figure 1, we can obtain some important insights. First, the results indicate that our model clearly outperforms the state-of-the-art methods including non-myopic algorithms. The average and median Gaps of our algorithm are above 0.85, indicating that the estimations are improved 85% compared to the initial iteration. The key reason is that GLASSES and M-EI only consider fixed rolling horizon h, which is risky: the error propagation might eliminate the benefits of looking ahead. Indeed, choosing h stagewise allows us to carefully avoid the negative effect of model mis-specification. As shown in Table 3, when we choose fixed rolling horizon at each stage, the resulting Gap will be affected. When h = 4, 5, the non-greedy algorithm will even sabotage the performance. Here we note that we believe a dynamic rolling horizon can also improve the performance of GLASSES and M-EI. However, this requires further analysis and theoretical inquiries.

Second, non-myopic algorithms are capable of beating greedy algorithms. Interestingly, the feasible rolling horizon is usually not large (Figure 1). This result is encouraging as it implies that the computational burden does not need to increase significantly since a short horizon is most beneficial. Therefore, it is overpessimistic to discard non-myopia if one is afraid of error accumulation and computational complexity.

Lastly, the results indicate that the benefits of our method become increasingly significant for the high dimensional scenarios. This is intuitively understandable, due to ability of the non-greedy algorithm to ef-ficiently explore the horizon.

Figure 1: Distribution of Rolling Horizon

5.4.2 Discount Factor

We study the effect of different discount factors. Specifically, we choose from set {0.6, 0.7, 0.8, 0.9}. The discount factor plays a role in ceiling the value of the rolling horizon as shown in Theorem 3. An extreme case is when immediately (i.e., greedily). Based on Table 2 and 4, it seems that when , the performance is promising. This result is intuitive as a moderate discount factor encourages an algorithm to consider collecting future reward and is capable of generating improving results.

6 Literature Review

6.1 Nonmyopia

Few literature has focused on the non-myopic BO. Ginsbourger and Le Riche (2010) propose an expectation improvement (EI) criterion to derive sequen-

Table 3: Mean Gap G with respect to different fixed rolling horizon over 30 experiments with different initial points.

Table 4: Mean Gap G with respect to different discount factor over 30 experiments with different initial points.

tial sampling strategies using Monte-Carlo simulation. Later, some approximation algorithms have been proposed that provide theoretical guarantees when sampling spaces are finite (Marchant et al., 2014; Ling et al., 2016). Unfortunately such algorithms scale poorly with the number of rolling horizon considered. Later, Gonz´alez et al. (2016) provided the GLASSES algorithm that relieves the myopia assumption of BO and can efficiently tackle an uncountable sampling space. GLASSES utilizes the long-sight loss function in Osborne (2010) and then propose an ef-ficient optimization-marginalization scheme to solve that loss. Despite its strength, this approach assumes that the objective function is L-Lipschitz continuous. Besides the aforementioned methods, there exists some efficient multi-step look-ahead algorithms in the area of Bayesian feasibility determination and root-finding problems (Waeber et al., 2013; Cashore et al., 2016). Nevertheless, they are only applicable to a very spe-cific physical setting and cannot be easily generalized to a general framework. More Recently, Lam et al. (2016); Lam and Willcox (2017) proposed a look-ahead DP formulation using EI as a heuristic reward function. A direct extension to this work includes using the modified-EI (M-EI) (Groot et al., 2010; Lam et al., 2015) instead of EI to handle multi-information sources. However, a crucial drawback of the M-EI is that its selects sampling point and query sources separately. This might lead to reduced accuracy as joint optimality is not considered. Recently, Jian and Peter (2019) has proposed a practical two-step lookahead BO algorithm. This is one successful example that illustrates the benefits of looking sightly ahead.

6.2 Multi-information Source

We provide a short review on misoBO for completeness. Multi-information source optimization was thoroughly studied by Swersky et al. (2013). The authors argue that auxiliary tasks can aid in solving some expensive optimization problems. Swersky et al. (2013) utilize a multivariate Gaussian process GP (Seeger et al., 2005; Bonilla et al., 2008) to model uncertainties in the objective function and predictive entropy search to decide on the next sampling location. Very recently, Poloczek et al. (2017) improved the misoBO algorithm through utilizing a more flexible GP construction, using the linear model of coregionalization, and extending the KG algorithm to the setting with multiple information sources. They showed that the improved method (denoted as misoKG) can find sampling locations with higher value at reduced cost. Despite this seminal work, the misoKG does not consider far horizon planning since it uses a one-step look-ahead approach that only considers reducing regret at the next step. Besides misoBO, other closely related work belong to the problem of multi-fidelity optimization (McLeod et al., 2017; Kandasamy et al., 2016; Cutajar et al., 2019). These models have been mainly based on hierarchical model structures that restrict the information to be shared from low fidelity models. Also, they implement a myopic approach and fail to account for the future information such as remaining budget.

Conclusion

We provide a theoretical proof of the “improving” nature of the rollout DP algorithm and a practical guideline on choosing a sequence of rolling horizons. We argue that rollout with a well chosen rolling horizon is beneficial in the sense that the error propagation is not catastrophic and the profits from the rollout improving nature remain. Therefore, the rollout DP has great promise in BO theory and applications. One possible future work is to generalize our analysis and apply it to other non-myopic methods. We hope our work will help inspire continued exploration into the non-myopic algorithms.

Appendix

7 Formulation

In misoBO scenario, we have access to several sampling sources and we are interested in deciding both optimal sampling points and sampling sources.

7.1 Setting

We want to solve the unconstrained optimization problem ). Due to limited budget, sampling from the original source is expensive and incurs a cost c(x) : . Now suppose we have access to I possibly biased auxiliary sources indexed by I = {1, . . . , I}. Each source has a query cost . When sampling from source at point x, we observe a noisy and biased outcome y(i, x). We assume the observation y(i, x) is normally distributed with mean f(i, x) and variance (x). Denote by ) : the bias term and ) from each auxiliary source . We set )) and )). Therefore, f(i, x) is a GP with mean function ) and covariance function Σ(()). Specifi-cally, Σ(() + ), where . Here we note that a mean function (or a constant) can be added to model systematic discrepancy in the bias (Higdon et al., 2008).

Given data , we would like to determine the next sampling duplet () by solving the following optimization problem: (arg max). After observing the optimal sampling duplet, we augment the current training data with the new observation and obtain .

7.2 Dynamic Programming

Denote by . At each stage k, define the state space as ) and denote by dataset the current state, where is the potential state in the state space . A policy is a sequence of rules mapping the state space to the design space X and sources I. We use to emphasize the rule under pol- icy . Let ). Now denote by the reward function at stage k. Define the end-stage reward as . The discounted expected cumulative reward of a finite N-step horizon under policy given initial dataset

can be expressed as

In the policy space , we are interested in the optimal policy which maximizes Eq. (18). Specifically,

Based on the Bellman optimality equation, we can formulate (18) as a recursive DP:

7.3 Knowledge Gradient

The reward function at each stage k quantifies the gains of applying rule given state . To handle multi-information source BO efficiently, we will adopt a normalized KG as our expected stage-reward function (Ryzhov et al., 2012; Poloczek et al., 2017). Specifically,

The first part in the expected KG can be expressed as

where Z is a standard normal random variable and

such that Σis the posterior covariance function of f given current data . Since we are taking expectation with respect to Gaussian random variables, equations (22) and (23) are easy to compute and can be efficiently estimated by a Gauss-Hermite quadrature with n nodes. Under the single information source scenario, we simply let I = 1. We summarize our misoKG algorithm in Algorithm 1.

8 Algorithm

The algorithm for the multi-information source BO is lised in Algorithm 1.

9 Performance Guarantees

Under the multi-information source setting, the heuristic KG is also sequentially consistent and sequentially improving.

Corollary 3. The KG algorithm is sequentially consistent and sequentially improving.

Proof. Remember that state is the dataset . Assume KG algorithm starts at a state (i.e., current dataset ). At each iteration of KG, given a path () and is not the state at the end, the next state is obtained by solving the acquisition function of KG and augmenting with (). If is not the terminating state, the algorithm will start with the path (). Otherwise, the algorithm will terminate with state and N = m+1. Therefore, KG is sequentially consistent.

Let () be the path generated by rollout starting from . Define ) as the sub -algebra generated by state s. Since KG is sequentially

consistent, we have

where is the subsequent state of s. Therefore, the rollout is sequentially improving and we complete our proof.

Reference

D. P. Bertsekas. Dynamic programming and optimal control, volume 1. Athena scientific Belmont, MA, 1995.

D. P. Bertsekas. Rollout algorithms for constrained dynamic programming. Lab. for Information and Decision Systems Report, 2646, 2005.

D. P. Bertsekas, J. N. Tsitsiklis, and C. Wu. Rollout algorithms for combinatorial optimization. Journal of Heuristics, 3(3):245–262, 1997.

E. V. Bonilla, K. M. Chai, and C. Williams. Multitask gaussian process prediction. In Advances in neural information processing systems, pages 153– 160, 2008.

D. Burt, C. E. Rasmussen, and M. Van Der Wilk. Rates of convergence for sparse variational gaussian process regression. In International Conference on Machine Learning, pages 862–871. PMLR, 2019.

J. M. Cashore, L. Kumarga, and P. I. Frazier. Multi-step bayesian optimization for onedimensional feasibility determination. arXiv preprint arXiv:1607.03195, 2016.

K. Cutajar, M. Pullin, A. Damianou, N. Lawrence, and J. Gonz´alez. Deep gaussian processes for multi-fidelity modeling. arXiv preprint arXiv:1903.07320, 2019.

P. I. Frazier. A tutorial on bayesian optimization. arXiv preprint arXiv:1807.02811, 2018.

D. Ginsbourger and R. Le Riche. Towards gaussian process-based optimization with finite time horizon. In mODa 9–Advances in Model-Oriented Design and Analysis, pages 89–96. Springer, 2010.

J. Gonz´alez, M. Osborne, and N. D. Lawrence. Glasses: Relieving the myopia of bayesian optimisation. 2016.

J. C. Goodson, B. W. Thomas, and J. W. Ohlmann. A rollout algorithm framework for heuristic solutions to finite-horizon stochastic dynamic programs. European Journal of Operational Research, 258(1):216– 229, 2017.

P. Groot, A. Birlutiu, and T. Heskes. Bayesian monte carlo for the global optimization of expensive functions. In ECAI, pages 249–254, 2010.

P. Hennig and C. J. Schuler. Entropy search for information-efficient global optimization. Journal of Machine Learning Research, 13(Jun):1809–1837, 2012.

J. M. Hern´andez-Lobato, M. W. Hoffman, and Z. Ghahramani. Predictive entropy search for effi-cient global optimization of black-box functions. In

Advances in neural information processing systems, pages 918–926, 2014.

D. Higdon, J. Gattiker, B. Williams, and M. Rightley. Computer model calibration using high-dimensional output. Journal of the American Statistical Association, 103(482):570–583, 2008.

D. Huang, T. T. Allen, W. I. Notz, and N. Zeng. Global optimization of stochastic black-box systems via sequential kriging meta-models. Journal of global optimization, 34(3):441–466, 2006.

W. Jian and F. Peter. Practical two-step lookahead bayesian optimization. In Advances in neural information processing systems, 2019.

M. E. Johnson, L. M. Moore, and D. Ylvisaker. Minimax and maximin distance designs. Journal of statistical planning and inference, 26(2):131–148, 1990.

K. Kandasamy, G. Dasarathy, J. B. Oliva, J. Schneider, and B. P´oczos. Gaussian process bandit optimisation with multi-fidelity evaluations. In Advances in Neural Information Processing Systems, pages 992–1000, 2016.

R. Lam and K. Willcox. Lookahead bayesian optimization with inequality constraints. In Advances in Neural Information Processing Systems, pages 1890–1900, 2017.

R. Lam, D. L. Allaire, and K. E. Willcox. Multifidelity

R. Lam, K. Willcox, and D. H. Wolpert. Bayesian optimization with a finite budget: An approximate dynamic programming approach. In Advances in Neural Information Processing Systems, pages 883– 891, 2016.

C. K. Ling, K. H. Low, and P. Jaillet. Gaussian process planning with lipschitz continuous reward functions: Towards unifying bayesian optimization, active learning, and beyond. In Thirtieth AAAI Conference on Artificial Intelligence, 2016.

R. Marchant, F. Ramos, S. Sanner, et al. Sequential bayesian optimisation for spatial-temporal monitoring. In Uncertainty in Artificial Intelligence, pages 553–562, 2014.

M. McLeod, M. A. Osborne, and S. J. Roberts. Practical bayesian optimization for variable cost objectives. arXiv preprint arXiv:1703.04335, 2017.

M. A. Osborne. Bayesian Gaussian processes for sequential prediction, optimisation and quadrature. PhD thesis, Oxford University, UK, 2010.

M. Poloczek, J. Wang, and P. Frazier. Multiinformation source optimization. In Advances

in Neural Information Processing Systems, pages 4288–4298, 2017.

C. E. Rasmussen. Gaussian processes in machine learning. In Summer School on Machine Learning, pages 63–71. Springer, 2003.

I. O. Ryzhov, W. B. Powell, and P. I. Frazier. The knowledge gradient algorithm for a general class of online learning problems. Operations Research, 60 (1):180–195, 2012.

M. Seeger, Y.-W. Teh, and M. Jordan. Semiparametric latent factor models. Technical report, 2005.

J. Snoek, H. Larochelle, and R. P. Adams. Practical bayesian optimization of machine learning algorithms. In Advances in neural information processing systems, pages 2951–2959, 2012.

K. Swersky, J. Snoek, and R. P. Adams. Multi-task bayesian optimization. In Advances in neural information processing systems, pages 2004–2012, 2013.

M. W. Ulmer, J. C. Goodson, D. C. Mattfeld, and M. Hennig. Offline–online approximate dynamic programming for dynamic vehicle routing with stochastic requests. Transportation Science, 53 (1):185–202, 2018.

R. Waeber, P. I. Frazier, and S. G. Henderson. Bisection search with noisy responses. SIAM Journal on Control and Optimization, 51(3):2261–2279, 2013.

W. Wang, R. Tuo, and C. Jeff Wu. On prediction properties of kriging: Uniform error bounds and robustness. Journal of the American Statistical Association, pages 1–27, 2019.

designed for accessibility and to further open science