Parallel Predictive Entropy Search for Batch Global Optimization of Expensive Objective Functions

2015·Arxiv

Abstract

Abstract

We develop parallel predictive entropy search (PPES), a novel algorithm for Bayesian optimization of expensive black-box objective functions. At each iteration, PPES aims to select a batch of points which will maximize the information gain about the global maximizer of the objective. Well known strategies exist for suggesting a single evaluation point based on previous observations, while far fewer are known for selecting batches of points to evaluate in parallel. The few batch selection schemes that have been studied all resort to greedy methods to compute an optimal batch. To the best of our knowledge, PPES is the first non-greedy batch Bayesian optimization strategy. We demonstrate the benefit of this approach in optimization performance on both synthetic and real world applications, including problems in machine learning, rocket science and robotics.

1 Introduction

Finding the global maximizer of a non-concave objective function based on sequential, noisy observations is a fundamental problem in various real world domains e.g. engineering design [1], finance [2] and algorithm optimization [3]. We are interesed in objective functions which are unknown but may be evaluated pointwise at some expense, be it computational, economical or other. The challenge is to find the maximizer of the expensive objective function in as few sequential queries as possible, in order to minimize the total expense.

A Bayesian approach to this problem would probabilistically model the unknown objective function, f. Based on posterior belief about f given evaluations of the the objective function, you can decide where to evaluate f next in order to maximize a chosen utility function. Bayesian optimization [4] has been successfully applied in a range of difficult, expensive global optimization tasks including optimizing a robot controller to maximize gait speed [5] and discovering a chemical derivative of a particular molecule which best treats a particular disease [6].

Two key choices need to be made when implementing a Bayesian optimization algorithm: (i) a model choice for f and (ii) a strategy for deciding where to evaluate f next. A common approach for modeling f is to use a Gaussian process prior [7], as it is highly flexible and amenable to analytic calculations. However, other models have shown to be useful in some Bayesian optimization tasks e.g. Student-t process priors [8] and deep neural networks [9]. Most research in the Bayesian optimization literature considers the problem of deciding how to choose a single location where f should be evaluated next. However, it is often possible to probe several points in parallel. For example, you may possess 2 identical robots on which you can test different gait parameters in parallel. Or your computer may have multiple cores on which you can run algorithms in parallel with different hyperparameter settings.

Whilst there are many established strategies to select a single point to probe next e.g. expected improvement, probability of improvement and upper confidence bound [10], there are few well known strategies for selecting batches of points. To the best of our knowledge, every batch selection strategy proposed in the literature involves a greedy algorithm, which chooses individual points until the batch is filled. Greedy choice making can be severely detrimental, for example, a greedy approach to the travelling salesman problem could potentially lead to the uniquely worst global solution [11]. In this work, our key contribution is to provide what we believe is the first non-greedy algorithm to choose a batch of points to probe next in the task of parallel global optimization.

Our approach is to choose a set of points which in expectation, maximally reduces our uncertainty about the location of the maximizer of the objective function. The algorithm we develop, parallel predictive entropy search, extends the methods of [12, 13] to multiple point batch selection. In Section 2, we formalize the problem and discuss previous approaches before developing parallel predictive entropy search in Section 3. Finally, we demonstrate the benefit of our non-greedy strategy on synthetic as well as real-world objective functions in Section 4.

2 Problem Statement and Background

Our aim is to maximize an objective function , which is unknown but can be (noisily) evaluated pointwise at multiple locations in parallel. In this work, we assume X is a compact subset of . At each decision, we must select a set of Q points , where the objective function would next be evaluated in parallel. Each evaluation leads to a scalar observation , where we assume i.i.d. We wish to minimize a future regret, , where is an optimal decision (assumed to exist) and is our guess of where the maximizer of f is after evaluating T batches of input points. It is highly intractable to make decisions T steps ahead in the setting described, therefore it is common to consider the regret of the very next decision. In this work, we shall assume f is a draw from a Gaussian process with constant mean and differentiable kernel function .

Most Bayesian optimization research focuses on choosing a single point to query at each decision i.e. Q = 1. A popular strategy in this setting is to choose the point with highest expected improvement over the current best evaluation, i.e. the maximizer of , where D is the set of ob-

servations, is the best evaluation point so far, , and and are the standard Gaussian p.d.f. and c.d.f.

Aside from being an intuitive approach, a key advantage of using the expected improvement strategy is in the fact that it is computable analytically and is infinitely differentiable, making the problem of finding amenable to a plethora of gradient based optimization methods. Unfortunately, the corresponding strategy for selecting Q > 1 points to evaluate in parallel does not lead to an analytic expression. [14] considered an approach which sequentially used the EI criterion to greedily choose a batch of points to query next, which [3] formalized and utilized by defining

A similar but different approach called simulated matching (SM) was introduced by [15]. Let be a baseline policy which chooses a single point to evaluate next (e.g. EI). SM aims to select a batch of size Q, which includes a point ‘close to’ the best point which would have chosen when applied sequentially Q times, with high probability. Formally, SM aims to maximize

where is the set of Q points which policy would query if employed sequentially. A greedy k-medoids based algorithm is proposed to approximately maximize the objective, which the authors justify by the submodularity of the objective function.

The upper confidence bound (UCB) strategy [16] is another method used by practitioners to decide where to evaluate an objective function next. The UCB approach is to maximize , where is a domain-specific time-varying positive parameter which trades off exploration and exploitation. In order to extend this approach to the parallel setting, [17] noted that the predictive variance of a Gaussian process depends only on where observations are made, and not the observations themselves. Therefore, they suggested the GP-BUCB method, which greedily populates the set by maximizing a UCB type equation Q times sequentially, updating at each step, whilst maintaining the same for each batch. Finally, a variant of the GP-UCB was proposed by [18]. The first point of the set is chosen by optimizing the UCB objective. Thereafter, a ‘relevant region’ which contains the maximizer of f with high probability is defined. Points are greedily chosen from this region to maximize the information gain about f, measured by expected reduction in entropy, until . This method was named Gaussian process upper confidence bound with pure exploration (GP-UCB-PE).

Each approach discussed resorts to a greedy batch selection process. To the best of our knowledge, no batch Bayesian optimization method to date has avoided a greedy algorithm. We avoid a greedy batch selection approach with PPES, which we develop in the next section.

3 Parallel Predictive Entropy Search

Our approach is to maximize information [19] about the location of the global maximizer , which we measure in terms of the negative differential entropy of . Analogous to [13], PPES aims to choose the set of Q points, , which maximizes

where is the differential entropy of its argument and the expectation above is taken with respect to the posterior joint predictive distribution of given the previous evaluations, D, and the set . Evaluating (1) exactly is typically infeasible. The prohibitive aspects are that would have to be evaluated for many different combinations of , and the entropy computations are not analytically tractable in themselves. Significant approximations need to be made to (1) before it becomes practically useful [12]. A convenient equivalent formulation of the quantity in (1) can be written as the mutual information between and given D [20]. By symmetry of the mutual information, we can rewrite as

where is the joint posterior predictive distibution for given the ob- served data, D and the location of the global maximizer of f. The key advantage of the formulation in (2), is that the objective is based on entropies of predictive distributions of the observations, which are much more easily approximated than the entropies of distributions on .

In fact, the first term of (2) can be computed analytically. Suppose is multivariate Gaussian with covariance K, then . We develop an approach to approximate the expectation of the predictive entropy in (2), using an expectation propagation based method which we discuss in the following section.

3.1 Approximating the Predictive Entropy

Assuming a sample of , we discuss our approach to approximating in (2) for a set of query points . Note that we can write

where is the posterior distribution of the objective function at the locations , given previous evaluations D, and that is the global maximizer of f. Recall that is Gaussian for each q. Our approach will be to derive a Gaussian approximation to , which would lead to an analytic approximation to the integral in (3).

The posterior predictive distribution of the Gaussian process, , is multivariate Gaussian distributed. However, by further conditioning on the location , the global maximizer of f, we impose the condition that for any . Imposing this constraint for all is extremely difficult and makes the computation of highly intractable. We instead impose the following two conditions (i) for each , and (ii) , where is the largest observed noisy objective function value and . Constraint (i) is equivalent to imposing that is larger than objective function values at current query locations, whilst condition (ii) makes larger than previous objective function evaluations, accounting for noise. Denoting the two conditions C, and the variables and , where , we incorporate the conditions as follows

where I(.) is an indicator function. The integral in (4) can be approximated using expectation propagation [21]. The Gaussian process predictive is . We approximate the integrand of (4) with , where each and are positive, and for is a vector of length Q + 1 with entry entry 1, and remaining entries 0, whilst . The approximation approximates the Gaussian CDF, , and each indicator function, I(.), with a univariate, scaled Gaussian PDF. The site parameters, , are learned using a fast EP algorithm, for which details are given in the Appendix, where we show that , where

and hence . Since multivariate Gaussians are consistent under marginalization, a convenient corollary is that , where is the vector containing the first Q elements of , and is the matrix containing the first Q rows and columns of . Since sums of independent Gaussians are also Gaussian distributed, we see that . The final convenient attribute of our Gaussian approximation, is that the differential entropy of a multivariate Gaussian can be computed analytically, such that

3.2 Sampling from the Posterior over the Global Maximizer

So far, we have considered how to approximate , given the global maximizer, . We in fact would like the expected value of this quantity over the posterior distribution of the global maximizer, . Literally, , the posterior probability that is the global maximizer of f. Computing the distribution is intractable, but it is possible to approximately sample from it and compute a Monte Carlo based approximation of the desired expectation. We consider two approaches to sampling from the posterior of the global maximizer: (i) a maximum a posteriori (MAP) method, and (ii) a random feaure approach.

MAP sample from . The MAP of is its posterior mode, given by . We may approximate the expected value of the predictive entropy by replacing the posterior distribution of with a single point estimate at . There are two key advantages to using the MAP estimate in this way. Firstly, it is simple to compute , as it is the global maximizer of the posterior mean of f given the observations D. Secondly, choosing to use assists the EP algorithm developed in Section 3.1 to converge as desired. This is because the condition for is easy to enforce when , the global maximizer of the poserior mean of f. When is sampled such that the posterior mean at is significantly suboptimal, the EP approximation may be poor. Whilst using the MAP estimate approximation is convenient, it is after all a point estimate and fails to characterize the full posterior distribution. We therefore consider a method to draw samples from using random features.

Random Feature Samples from . A naive approach to sampling from would be to sample , and choosing . Unfortunately, this would require sampling g over an uncountably infinite space, which is infeasible. A slightly less naive method would be to sequentially construct g, whilst optimizing it, instead of evaluating it everywhere in X. However, this approach would have cost where m is the number of function evaluations of g necessary to find its optimum. We propose as in [13], to sample and optimize an analytic approximation to g.

By Bochner’s theorem [22], a stationary kernel function, k, has a Fourier dual s(w), which is equal to the spectral density of k. Setting , a normalized density, we can write

where . Let denote an m-dimensional feature mapping where W and b consist of m stacked samples from p(w, b), then the kernel k can be approximated by the inner product of these features, [23]. The linear model g(x) = where is an approximate sample from p(f|D), where y is a vector of objective function evaluations, and . In fact, is a true sample from p(f|D) [24].

The generative process above suggests the following approach to approximately sampling from : (i) sample random features and corresponding posterior weights using the process above, (ii) construct , and (iii) finally compute using gradient based methods.

3.3 Computing and Optimizing the PPES Approximation

Let denote the set of kernel parameters and the observation noise variance, . Our posterior belief about is summarized by the posterior distribution , where is our prior belief about and is the GP marginal likelihood given the parameters . For a fully Bayesian treatment of , we must marginalize with respect to . The expectation with respect to the posterior distribution of is approximated with Monte Carlo samples. A similar approach is taken in [3, 13]. Combining the EP based method to approximate the predictive entropy with either of the two methods discussed in the previous section to approximately sample from , we can construct an approximation to (2), defined by

where is constructed using the sample of M from is constructed as in Section 3.1, assuming the global maximizer is . The PPES approximation is simple and amenable to gradient based optimization. Our goal is to choose which maximizes in (7). Since our kernel function is differentiable, we may consider taking the derivative of with respect to , the component of ,

Computing is simple directly from the definition of the chosen kernel function. is a function of and , and we know how to compute , and that each is a constant vector. Hence our only concern is how the EP site parameters, , vary with . Rather remarkably, we may invoke a result from Section 2.1 of [25], which says that converged site parameters, , have 0 derivative with respect to parameters of . There is a key distinction between explicit dependencies (where actually depends on K) and implicit dependencies where a site parameter, , might depend implicitly on K. A similar approach is taken in [26], and discussed in [7]. We therefore compute

On first inspection, it may seem computationally too expensive to compute derivatives with respect to each q and d. However, note that we may compute and store the matrices , and once, and thatis symmetric with exactly one non-zero row and non-zero column, which can be exploited for fast matrix multiplication and trace computations.

Figure 1: Assessing the quality of our approximations to the parallel predictive entropy search strategy. (a) Synthetic objective function (blue line) defined on [0, 1], with noisy observations (black squares). (b) Ground truth defined on , obtained by rejection sampling. (c) Our approximation using expectation propagation. Dark regions correspond to pairs with high utility, whilst faint regions correspond to pairs with low utility.

4 Empirical Study

In this section, we study the performance of PPES in comparison to aforementioned methods. We model f as a Gaussian process with constant mean and covariance kernel k. Observations of the objective function are considered to be independently drawn from . In our experiments, we choose to use a squared-exponential kernel of the form . Therefore the set of model hyperparameters is , a broad Gaussian hyperprior is placed on and uninformative Gamma priors are used for the other hyperparameters.

It is worth investigating how well (7) is able to approximate (2). In order to test the approximation in a manner amenable to visualization, we generate a sample f from a Gaussian process prior on X = [0, 1], with and , and consider batches of size Q = 2. We set M = 200. A rejection sampling based approach is used to compute the ground truth , defined on . We first discretize , and sample in (2) by evaluating samples from p(f|D) on the discrete points and choosing the input with highest function value. Given , we compute using rejection sampling. Samples from p(f|D) are evaluted on discrete points in and rejected if the highest function value occurs not at . We add independent Gaussian noise with variance to the non rejected samples from the previous step and approximate using kernel density estimation [27].

Figure 1 includes illustrations of (a) the objective function to be maximized, f, with 5 noisy observations, (b) the ground truth obtained using the rejection sampling method and finally (c) using the EP method we develop in the previous section. The black squares on the axes of Figures 1(b) and 1(c) represent the locations in X = [0, 1] where f has been noisily sampled, and the darker the shade, the larger the function value. The lightly shaded horizontal and vertical lines in these figures along the points The figures representing and appear to be symmetric, as is expected, since the set is not an ordered set, since all points in the set are probed in parallel i.e. . The surface of is similar to that of . In paticular, the approximation often appeared to be an annealed version of the ground truth , in the sense that peaks were more pronounced, and non-peak areas were flatter. Since we are interested in , our key concern is that the peaks of occur at the same input locations as . This appears to be the case in our experiment, suggesting that the is a good approximation for .

We now test the performance of PPES in the task of finding the optimum of various objective functions. For each experiment, we compare PPES (M = 200) to EI-MCMC (with 100 MCMC samples), simulated matching with a UCB baseline policy, GP-BUCB and GP-UCB-PE. We use the random features method to sample from , rejecting samples which lead to failed EP runs. An experiment of an objective function, f, consists of sampling 5 input points uniformly at random and running each algorithm starting with these samples and their corresponding (noisy) function values. We measure performance after t batch evaluations using immediate regret, , where is the known optimizer of f and is the recommendation of an algorithm after t batch

Figure 2: Median of the immediate regret of the PPES and 4 other algorithms over 100 experiments on benchmark synthetic objective functions, using batches of size Q = 3.

evaluations. We perform 100 experiments for each objective function, and report the median of the immediate regret obtained for each algorithm. The confidence bands represent one standard deviation obtained from bootstrapping. The empirical distribution of the immediate regret is heavy tailed, making the median more representative of where most data points lie than the mean.

Our first set of experiments is on a set of synthetic benchmark objective functions including BraninHoo [28], a mixture of cosines [29], a Shekel function with 10 modes [30] (each defined on ) and the Hartmann-6 function [28] (defined on ). We choose batches of size Q = 3 at each decision time. The plots in Figure 2 illustrate the median immediate regrets found for each algorithm. The results suggest that the PPES algorithm performs close to best if not the best for each problem considered. EI-MCMC does significantly better on the Hartmann function, which is a relatively smooth function with very few modes, where greedy search appears beneficial. Entropy-based strategies are more exploratory in higher dimensions. Nevertheless, PPES does significantly better than GP-UCB-PE on 3 of the 4 problems, suggesting that our non-greedy batch selection procedure enhances performance versus a greedy entropy based policy.

We now consider maximization of real world objective functions. The first, boston, returns the negative of the prediction error of a neural network trained on a random train/text split of the Boston Housing dataset [31]. The weight-decay parameter and number of training iterations for the neural network are the parameters to be optimized over. The next function, hydrogen, returns the amount of hydrogen produced by particular bacteria as a function of pH and nitrogen levels of a growth medium [32]. Thirdly we consider a function, rocket, which runs a simulation of a rocket [33] being launched from the Earth’s surface and returns the time taken for the rocket to land on the Earth’s surface. The variables to be optimized over are the launch height from the surface, the mass of fuel to use and the angle of launch with respect to the Earth’s surface. If the rocket does not return, the function returns 0. Finally we consider a function, robot, which returns the walking speed of a bipedal robot [34]. The function’s input parameters, which live in , are the robot’s controller. We add Gaussian noise with to the noiseless function. Note that all of the functions we consider are not available analytically. boston trains a neural network and returns test error, whilst rocket and robot run physical simulations involving differential equations before returning a desired quantity. Since the hydrogen dataset is available only for discrete points, we define hydrogen to return the predictive mean of a Gaussian process trained on the dataset.

Figure 3 show the median values of immediate regret by each method over 200 random initializations. We consider batches of size Q = 2 and Q = 4. We find that PPES consistently outperforms competing methods on the functions considered. The greediness and nonrequirement of MCMC sampling of the SM-UCB, GP-BUCB and GP-UCB-PE algorithms make them amenable to large batch experiments, for example, [17] consider optimization in with batches of size 10. However, these three algorithms all perform poorly when selecting batches of smaller size. The performance on the hydrogen function illustrates an interesting phenemona; whilst the immediate regret of PPES is mediocre initially, it drops rapidly as more batches are evaluated.

This behaviour is likely due to the non-greediness of the approach we have taken. EI-MCMC makes good initial progress, but then fails to explore the input space as well as PPES is able to. Recall that after each batch evaluation, an algorithm is required to output , its best estimate for the maximizer

Figure 3: Median of the immediate regret of the PPES and 4 other algorithms over 100 experiments on real world objective functions. Figures in the top row use batches of size Q = 2, whilst figues on the bottom row use batches of size Q = 4.

of the objective function. We observed that whilst competing algorithms tended to evaluate points which had high objective function values compared to PPES, yet when it came to recommending , PPES tended to do a better job. Our belief is that this occured exactly because the PPES objective aims to maximize information gain rather than objective function value improvement.

The rocket function has a strong discontinuity making if difficult to maximize. If the fuel mass, launch height and/or angle are too high, the rocket would not return to the Earth’s surface, resulting in a 0 function value. It can be argued that a stationary kernel Gaussian process is a poor model for this function, yet it is worth investigating the performance of a GP based models since a practitioner may not know whether or not their black-box function is smooth apriori. PPES seemed to handle this function best and had fewer samples which resulted in 0 function value than each of the competing methods and made fewer recommendations which led to a 0 function value. The relative increase in PPES performance from increasing batch size from Q = 2 to Q = 4 is small for the robot function compared to the other functions considered. We believe this is a consequence of using a slightly naive optimization procedure to save computation time. Our optimization procedure first computes at 1000 points selected uniformly at random, and performs gradient ascent from the best point. Since is defined on , this method may miss a global optimum. Other methods all select their batches greedily, and hence only need to optimize in . However, this should easily be avoided by using a more exhaustive gradient based optimizer.

5 Conclusions

We have developed parallel predictive entropy search, an information theoretic approach to batch Bayesian optimization. Our method is greedy in the sense that it aims to maximize the one-step information gain about the location of , but it is not greedy in how it selects a set of points to evaluate next. Previous methods are doubly greedy, in that they look one step ahead, and also select a batch of points greedily. Competing methods are prone to under exploring, which hurts their perfomance on multi-modal, noisy objective functions, as we demonstrate in our experiments.

References

[1] G. Wang and S. Shan. Review of Metamodeling Techniques in Support of Engineering Design Optimization. Journal of Mechanical Design, 129(4):370–380, 2007.

[2] W. Ziemba & R. Vickson. Stochastic Optimization Models in Finance. World Scientific Singapore, 2006.

[3] J. Snoek, H. Larochelle, and R. P. Adams. Practical Bayesian Optimization of Machine Learn- ing Algorithms. NIPS, 2012.

[4] J. Mockus. Bayesian Approach to Global Optimization: Theory and Applications. Kluwer, 1989.

[5] D. Lizotte, T. Wang, M. Bowling, and D. Schuurmans. Automatic Gait Optimization with Gaussian Process Regression. IJCAI, pages 944–949, 2007.

[6] D. M. Negoescu, P. I. Frazier, and W. B. Powell. The Knowledge-Gradient Algorithm for Sequencing Experiments in Drug Discovery. INFORMS Journal on Computing, 23(3):346– 363, 2011.

[7] Carl Rasmussen and Chris Williams. Gaussian Processes for Machine Learning. MIT Press, 2006.

[8] A. Shah, A. G. Wilson, and Z. Ghahramani. Student-t Processes as Alternatives to Gaussian Processes. AISTATS, 2014.

[9] J. Snoek, O. Rippel, K. Swersky, R. Kiros, N. Satish, N. Sundaram, M. Patwary, Mr Prabat, and R. P. Adams. Scalable Bayesian Optimization Using Deep Neural Networks. ICML, 2015.

[10] E. Brochu, M. Cora, and N. de Freitas. A Tutorial on Bayesian Optimization of Expensive Cost Functions, with Applications to Active User Modeling and Hierarchical Reinforcement Learning. Technical Report TR-2009-23, University of British Columbia, 2009.

[11] G. Gutin, A. Yeo, and A. Zverovich. Traveling salesman should not be greedy:domination analysis of greedy-type heuristics for the TSP. Discrete Applied Mathematics, 117:81–86, 2002.

[12] P. Hennig and C. J. Schuler. Entropy Search for Information-Efficient Global Optimization. JMLR, 2012.

[13] J. M. Hern´andez-Lobato, M. W. Hoffman, and Z. Ghahramani. Predictive Entropy Search for Efficient Global Optimization of Black-box Functions. NIPS, 2014.

[14] D. Ginsbourger, J. Janusevskis, and R. Le Riche. Dealing with Asynchronicity in Parallel Gaussian Process Based Optimization. 2011.

[15] J. Azimi, A. Fern, and X. Z. Fern. Batch Bayesian Optimization via Simulation Matching. NIPS, 2010.

[16] N. Srinivas, A. Krause, S. Kakade, and M. Seeger. Gaussian Process Optimization in the Bandit Setting: No Regret and Experimental Design. ICML, 2010.

[17] T. Desautels, A. Krause, and J. Burdick. Parallelizing Exploration-Exploitation Tradeoffs with Gaussian Process Bandit Optimization. ICML, 2012.

[18] E. Contal, D. Buffoni, D. Robicquet, and N. Vayatis. Parallel Gaussian Process Optimization with Upper Confidence Bound and Pure Exploration. In Machine Learning and Knowledge Discovery in Databases, pages 225–240. Springer Berlin Heidelberg, 2013.

[19] D. J. MacKay. Information-Based Objective Functions for Active Data Selection. Neural Computation, 4(4):590–604, 1992.

[20] N. Houlsby, J. M. Hern´andez-Lobato, F. Huszar, and Z. Ghahramani. Collaborative Gaussian Processes for Preference Learning. NIPS, 2012.

[21] T. P. Minka. A Family of Algorithms for Approximate Bayesian Inference. PhD thesis, Masachusetts Institute of Technology, 2001.

[22] S. Bochner. Lectures on Fourier Integrals. Princeton University Press, 1959.

[23] A. Rahimi and B. Recht. Random Features for Large-Scale Kernel Machines. NIPS, 2007.

[24] R. M. Neal. Bayesian Learning for Neural Networks. PhD thesis, University of Toronto, 1995.

[25] M. Seeger. Expectation Propagation for Exponential Families. Technical Report, U.C. Berkeley, 2008.

[26] J. P. Cunningham, P. Hennig, and S. Lacoste-Julien. Gaussian Probabilities and Expectation Propagation. arXiv, 2013. http://arxiv.org/abs/1111.6832.

[27] I. Ahmad and P. E. Lin. A Nonparametric Estimation of the Entropy for Absolutely Continuous Distributions. IEEE Trans. on Information Theory, 22(3):372–375, 1976.

[28] D. Lizotte. Practical Bayesian Optimization. PhD thesis, University of Alberta, 2008.

[29] B. S. Anderson, A. W. Moore, and D. Cohn. A Nonparametric Approach to Noisy and Costly Optimization. ICML, 2000.

[30] J. Shekel. Test Functions for Multimodal Search Techniques. Information Science and Systems, 1971.

[31] K. Bache and M. Lichman. UCI Machine Learning Repository, 2013.

[32] E. H. Burrows, W. K. Wong, X. Fern, F.W.R. Chaplen, and R.L. Ely. Optimization of ph and nitrogen for enhanced hydrogen production by synechocystis sp. pcc 6803 via statistical and machine learning methods. Biotechnology Progress, 25(4):1009–1017, 2009.

[33] J. E. Hasbun. In Classical Mechanics with MATLAB Applications. Jones & Bartlett Learning, 2008.

[34] E. Westervelt and J. Grizzle. Feedback Control of Dynamic Bipedal Robot Locomotion. Control and Automation Series. CRC PressINC, 2007.

[35] T. Minka. EP: A Quick Reference. Technical Report, 2008.

Appendix

Recall that and , where and is the global maximizer of f. By being a Gaussian process predictive distribution, we know that follows a multivariate Gaussian distribution of the form .

We impose two conditions, (i) that is larger than f(x) for each x in the query set and (ii) that is larger than previous observations, accounting for Gaussian noise. We denote the conditions C. Our goal is to make a Gaussian approximation to

An elegant approach to making such a Gaussian approximation, is by using expectation propagation. We approximate the term involving and each I(.) with a univariate scaled Gaussian p.d.f. such that our unnormalized approximation to is

where each and is positive, and for is a vector of length Q + 1 with entry entry 1, and remaining entries 0, whilst . We have approximated each indicator function and Gaussian c.d.f. with a scaled Gaussian p.d.f. The site parameters, , are to be optimized such that the Kullback-Leibler divergence of from is minimized.

Since products of Gaussian p.d.f.s lead to Gaussian p.d.f.s, , where

We now describe the steps required to update the site parameters. We closely follow the derivations in [26]. We first compute the cavity distributions,

and compute their Gaussian parameters. Since we are dividing a Gaussian p.d.f. by another Gaussian p.d.f. we have simple parameter updates given by

The next step of EP is the projection step and requires moment matching with , where is the true factor be- ing approximated. We use derivatives of the logarithm of the zeroth moment [35] to compute the

parameters

where for and

update the site parameters to achieve the moments computed above by setting

Finally we update the parameters and as in equations (12) and (13), and repeat the proces until convergence.

designed for accessibility and to further open science