b

DiscoverSearch
About
My stuff
Adaptive Expansion Bayesian Optimization for Unbounded Global Optimization
2020·arXiv
Abstract
Abstract

Bayesian optimization is normally performed within fixed variable bounds. In cases like hyperparameter tuning for machine learning algorithms, setting the variable bounds is not trivial. It is hard to guarantee that any fixed bounds will include the true global optimum. We propose a Bayesian optimization approach that only needs to specify an initial search space that does not necessarily include the global optimum, and expands the search space when necessary. However, over-exploration may occur during the search space expansion. Our method can adaptively balance exploration and exploitation in an expanding space. Results on a range of synthetic test functions and an MLP hyperparameter optimization task show that the proposed method out-performs or at least as good as the current state-of-the-art methods.

Keywords Bayesian optimization  ·Efficient global optimization  ·Surrogate model  ·Unknown search space  ·Experimental design

This work was funded by The Defense Advanced Research Projects Agency (DARPA-16-63-YFA-FP-059) via the Young Faculty Award (YFA) Program.

image

Bayesian optimization (BO) is a global optimization technique targeted for expensive black-box functions (Shahriari et al, 2015). Particularly, one of its important applications in the machine learning community is automated hyperparameter tuning (Snoek et al, 2012; Swersky et al, 2013; Springenberg et al, 2016). In a standard BO process, the objective function is modeled as a random function with a prior distribution. This prior updates to form a posterior after new observations (i.e., a Gaussian process or GP (Rasmussen and Williams, 2006)). The decision about which observation to collect next is made by globally maximizing an acquisition function based on the posterior. This step requires fixed variable bounds, which are sometimes not trivial to set. It is hard to guarantee that any fixed bounds will include the true global optimum.

In this paper, we modified the standard BO approach so that the fixed variable bounds are not required. When the search space is unbounded, the acquisition function can have suprema at infinity, where the uncertainty is maximized. Thus we search only in the region with sufficiently low uncertainty, which we referred to as the low-uncertainty region. This low-uncertainty region expands as we add more observations. We call this method Adaptive Expansion Bayesian Optimization (AEBO).

The main technical contributions of this paper are:

1. An acquisition strategy that bounds the GP model uncertainty and adaptively expands the low-uncertainty region; and

2. Theoretical results regarding how to adaptively set the threshold of the uncertainty bound to avoid the over-exploration problem that occurs in an expanding search space.

Bayesian Optimization uses a sequential strategy to search for the global optimum of expensive black-box functions. Assuming we have an objective function: f : Rd → R, and the observation of its output has Gaussian noise:  y ∼ N(f(x), σ2n). It is expensive to evaluate either the function f or its gradient (assuming we can only approximate the gradient by the fi-nite difference method when f is a black-box function and that Automatic Differentiation methods cannot be used). Thus the goal of BO is to minimize the number of evaluations needed to find the global minimum solution. BO treats the objective as a random function that has a prior distribution, and update this prior to form a posterior distribution over the function after observing data. This can be done by using a Gaussian process (GP). The posterior distribution can then be used to form an acquisition criterion that proposes to evaluate f at a promising point so that the regret is minimized. The GP posterior can then be updated after the new observation. This process repeats until the evaluation budget runs out or a satisfied solution is achieved. We will elaborate on the Gaussian process and the acquisition function in the following sections.

2.1 Gaussian Process

The Gaussian process (GP) (Rasmussen and Williams, 2006) estimates the distribution of the objective function. A kernel (covariance) function  k(x, x′) is used to measure the similarity between two points x and  x′. It encodes the assumption that “similar inputs should have similar outputs”. The specific choice of the kernel is not central to the core contributions of the paper.

Given N observations D = (X, y) =  {(xi, yi)|i= 1, ..., N}, the GP posterior f(x) at any point x is a Gaussian distribution:  f(x)|D, x ∼ N(µ(x), σ2(x)) with the mean and the variance expressed as

image

where k(x) is an N-dimensional vector with the i-th dimension being  k(x, xi), and K is an  N ×Ncovariance matrix with  Kij = k(xi, xj).

2.2 Acquisition Function

Bayesian optimization picks the next point to evaluate by maximizing an acquisition function, which is computed based on the GP posterior. Common acquisition functions include the probability of improvement (PI) Kushner (1964), the expected improvement (EI) Jones et al (1998), the Gaussian Process upper confidence bound (GP-UCB) Srinivas et al (2009), and those based on entropy search Hennig and Schuler (2012); Hern´andez-Lobato et al (2014); Wang and Jegelka (2017).

In this paper, we use EI as our acquisition function. It measures the expected amount of improvement over the current best solution based on the learned GP model:

image

where  f ′is the current best objective function value, u = (µ(x) − f ′)/σ(x), and  Φand  φare the cumulative density function (CDF) and probability density function (PDF) of the standard normal distribution, respectively.

image

Normally Bayesian optimization is performed within fixed variable bounds. But in cases such as algorithm hyperparameter tuning (Shahriari et al, 2016; Swersky et al, 2013; Springenberg et al, 2016) and shape optimization (Palar and Shimoyama, 2019), setting the variable bounds are not trivial. It is hard to guarantee that any fixed bounds will include the true global optimum.

Two types of solutions were proposed to handle this problem: 1) performing BO in an unbounded space by regularization via non-stationary prior means so that the acquisition function’s suprema will not be at in-finity (regularization-based methods) (Shahriari et al, 2016; Rainforth et al, 2016); and 2) performing BO in “soft bounds” that are gradually expanded over iterations (expansion-based methods) (Shahriari et al, 2016; Nguyen et al, 2017, 2018). The first solution computes an acquisition function that is biased toward regions near some user-specified center point, thus insufficient exploitation may occur when the optimal solution is far from the center. The second solution either expands each direction equally, which often yields to unnecessarily large search spaces (Shahriari et al, 2016; Rain- forth et al, 2016); or expands only to the promising region where the upper confidence bound (UCB) is larger than the lower confidence bound (LCB) of the current best solution (Nguyen et al, 2017, 2018). But the latter approach has to perform global search twice at each iteration—one for the maximum LCB to filter out the non-promising region; and one for the maximum acquisition function value. It expands the bounds of the first search according to a hard-coded rule, and hence may show a lack of adaptability to different optimization problems.

The adaptive expansion Bayesian optimization is different from the previous methods in that it adaptively expands the search space based on the uncertainty of the GP model. It can essentially avoid the aforementioned issues by employing a strategy that we will introduce in the following section.

In this section, we will introduce the main ingredients of AEBO, namely, its acquisition strategy (Sect. 3.1), global optimization of the acquisition function (Sect. 3.2), the way of adaptively balancing exploration and exploitation (Sect. 3.3), and a trick to improve exploitation when expanding the search space (Sect. 3.4).

3.1 Acquisition Strategy

Our acquisition strategy can be expressed as the following constrained optimization problem:

image

where  τ ∈(0, 1) is a coefficient controlling the aggressiveness/conservativeness of exploration, and  k0= σ2(x∞) with  x∞denotes a point infinitely far away from the observations. Based on Eq. 2, we have  k0= σ2(x∞) = k(x, x). When using a RBF or a Mat´ern kernel, for example, simply we have  k0= 1. Under this acquisition strategy, only points with low GP model uncertainty will be picked for evaluation. When a GP adds evaluations, the  σ2(x) ≤ τk0region will always expand, because the predictive variance near the added evaluations will decrease. Thus our method guarantees a growing search space.

To avoid excessive local exploitation, we can modify the equation for EI (Eq. 3) as

image

where  ϵ >0 is the minimum improvement parameter (Jones, 2001; Shahriari et al, 2016; Shahriari, 2016).

In many real-world cases, we need to deal with constrained Bayesian optimization problems of the following two kinds: 1) there are infeasible regions in the input space (e.g., some experimental configurations are infeasible); and 2) the objective function f does not have definition in some regions of the input space (e.g., when the hyperparameters of a neural network are not properly chosen, exploding gradients may occur, which may lead to NaN weight values and hence the NaN accuracy). This is especially common when the we have an unbounded or expanding search space. Therefore, it is worth extending AEBO to make it suitable for constrained BO problems. Specifically, we can modify Eq. 4 based on Refs. (Basudhar et al, 2012) and (Gel- bart et al, 2014):

image

where C(x) is an indicator of whether the constraints are satisfied or whether the objective function has definition.

3.2 Feasible Domain Bounds

image

Fig. 1: Feasible domain bounds. In each iteration, we expand the minimum bounding box of evaluated samples along the i-th axis by  ri.

The feasible domain defined by Eq. 4 is bounded by the isocontour  σ2(x) =  τk0. However, it is easier to search inside a bounding box instead of an irregular isocontour when solving the global optimization problem in Eq. 4. We can show that the solution to Eq. 4 is inside a bounding box, which we call the feasible domain bounds. The feasible domain bounds can be derived by expanding the minimum bounding box of evaluated samples along the i-th axis by an expansion rate  ri(Fig. 1). Then constrained global optimization of the acquisition function can be performed within that feasible domain bounds. The derivation of  riis included in the supplementary material.

3.3 Adaptive Exploration-Exploitation Trade-off

A problem of an expanding search space is that a new evaluation may get too far away from the region of interest, due to the high uncertainty and hence the high EI in far-away regions (Fig. 2). The informativeness of those high-uncertainty regions, however, is low because of the sparsity of observed data near them. Thus sampling at those regions is like shooting in the dark. This is fine in the fixed-bound BO, because it will exploit the region of interest eventually after finishing exploring those high-uncertainty regions. However, with expanding bounds, the high-uncertainty regions are expanding and BO could continuously sample in those regions and never head back to exploit the region of interest. As a result, the algorithm will spend too much budget on randomly exploring the search space but have insuffi-cient exploitation, as shown in the left plot of Fig. 3. We call this over-exploration.

This over-exploration problem can exist in every unbounded Bayesian optimization algorithm with an aggressive expansion strategy. Adaptive Expansion BO can solve this problem by avoiding exploring in regions where our estimated model is uncertain (i.e., constraining the GP’s predictive variance  σ2(x), see Eq. 4). However, one has to choose a proper coefficient  τto set the uncertainty threshold. In this section, we derive a way of setting  τadaptively to balance exploration and exploitation as the search space expands.

The simplest way to avoid over-exploration is to force the algorithm to stop exploring and start to refine the solution by exploiting near the current best point. In AEBO, exploration is performed by sampling points along the feasible domain boundary (i.e., σ2(x) =  τk0). Thus, we can avoid over-exploration by decreasing  τso that the expected improvement on the boundary is lower than that near the current best solution  x′.

The expected improvement on the boundary can be expressed via the predictive mean  µτ:

EIτ(µτ) = (µτ −f ′)Φ�µτ − f ′√τk0

image

Also we have

image

since EI monotonically increases with the predictive mean.

The expected improvement near the current best solution  x′is

EI+= (µ+ −f ′)Φ�µ+ − f ′σ+

where  µ+and  σ+are the predictive mean and standard deviation respectively at a point  x+near the current best solution. Assuming that the GP mean function µ(x) is Lipschitz continuous, we have  f ′ − µ+ = δ, where  δis a small positive real number. Thus we have u+ = −δ/σ+.

Now we can set EI+ >EIτ(µm) to encourage exploitation. However, we do not want pure exploitation. Specifically, we want to stop exploitation at  x+whenever the room for improvement over the current solution  f ′within the neighborhood of  x′is sufficiently low with a high probability:

image

where  f+ ∼ N(µ+, σ2+), and  ξ ≥0 and 0  < κ <1 are small real numbers. From Eq. 10 we can derive

image

Thus we only need to exploit at  x+when  σ+ > σ0. By substituting it into Eq. 9, we get a lower bound for EI+:

image

since EI monotonically increases with the predictive variance. We can set this lower bound EI0equal to EIτ(µm) to enforce EI+ >EIτ(µm) when  σ+ > σ0 (i.e., when exploitation is necessary). Using Eq. 7, we can write EIτ(µm) = EI0as

(µm −f ′)Φ�µm − f ′√τk0

image

We can solve for  τby using any root finding algorithm (e.g., Newton’s method).

At the beginning of the optimization process, we do not need to make sure the room for improvement over f ′is small within the neighborhood of  x′. Rather, we want to explore other regions that may contain better local optima. Thus we can set  ξ = ξ0at the beginning, where a larger  ξ0allows more exploration, and then linearly anneal  ξover iterations until  ξ= 0 (e.g., towards the end of a computational budget). As a result, AEBO’s focus gradually switches from exploration to exploitation.

image

Fig. 2: Over-exploration in FBO-EI (left) and AEBO with large  τ(middle); By adaptively setting  τ, AEBO enforces exploitation based on an accuracy criterion (right).

image

Fig. 3: In contrast with AEBO (right), FBO-EI (left) spends too much budget on randomly exploring the search space (left).

In practice, we can set  µmas the prior mean (0 by default), since it is usually the case when over-exploration occurs. Thus this adaptive approach can effectively avoid over-exploration, as shown in Fig. 2 and Fig. 3. If in reality  µm <0, then EIτ(µm) <EI0 <EI+, AEBO will exploit near the current best solution even when it is unnecessary (i.e., σ+ < σ0); while if  µm >0, then EIτ(µm) >EI0, AEBO may explore when it should exploit.

3.4 Local Search for Better Exploitation

In practice, to perform the global optimization of Eq. 4, we can sample initial candidate solutions within the bounds derived in Sect. 3.2, and refine those solutions using a constrained optimization method (e.g., Sequential Least Squares Programming). If the GP kernel is fixed, the search space is always expanding, because  σis monotonically non-increasing as the number of observations increases. Specifically, the space near the queried point will be added to the search space volume. This results in a volume increase that is exponential with respect to the search space dimensionality. It will become harder for the candidate solutions to maintain the coverage of the search space as the optimization proceeds, especially when the problem has high dimensionality. Although we keep increasing exploitation by annealing  ξ, it does not guarantee that we will exploit near the current best solution in a large search space. This problem was not addressed in previous unbounded Bayesian optimization methods (Shahriari et al, 2016; Nguyen et al, 2018). A straight-forward way to solve the problem is to increase the density of search algorithms, but this continuously increases the computational cost for each iteration. Alternatively, we propose local search near the current best solution to allow better exploitation. Specifically, in each iteration, we generate the same number of candidate solutions but divide it for two tasks—global search and local search. Global search tries to find a promising point in the entire feasible domain in Eq. 4; while local search tries to find a promising point near the current best solution. This avoids insufficient exploitation but will not increase the computational cost.

The optimization process is summarized in Algorithm 1.

Table 1: Optimization results for synthetic benchmarks

image

Fig. 4: Optimization history for synthetic benchmarks

We evaluate the AEBO on both a range of synthetic test functions and an MLP hyperparameter optimization task. We also demonstrate the effect of dimensionality on AEBO’s performance and the significance of adaptive exploration-exploitation trade-off in AEBO.

4.1 Experimental Protocol

The evaluation budget was set to 50d, and an initial sample size of 5d was drawn by using Latin hypercube sampling (McKay et al, 1979; Jones et al, 1998). For simplicity we used an isotropic RBF kernel for the GP (i.e., k(x, x′) = exp�− �di=1(xi − x′i)2/(2l2i)�, where l1 = ... = ld). We normalized the observed function outputs before fitting a GP regression model. We set ξ0= 0.1, κ= 0.1, ϵ= 0.01, and  δ= 0.01. For each test function, we set the initial bounds to be [10%, 30%] of its original bounds, as was also configured in Ref. (Nguyen et al, 2018). All the initial bounds do not include global optima. We compared AEBO to methods from Ref. (Shahriari et al, 2016) (i.e., EI-Q and EI-H) and Ref. (Nguyen et al, 2018) (i.e., FBO-EI and FBO-UCB).

4.2 Synthetic Benchmarks

We used seven standard global optimization test functions. As shown in Table 1, AEBO out-performs other

image

methods on most test functions. Note that AEBO’s results have lower variance compared to other methods, which is an indication of robustness. Figure 4 shows the optimization history on benchmark functions. It shows that compared to the other two state-of-the-art methods, AEBO converged faster and achieved a better solution in most cases.

We demonstrated the effects of problem dimensionality on AEBO’s performance by using two synthetic benchmarks, as shown in Fig. 5. We compared AEBO to: 1) the other two state-of-the-art methods—FBO-EI and EI-H, and 2) the standard BO with the original function bounds (which include global optima). Here we define the  optimality gap e = ysol − yopt, where  ysoland  yoptare the minimal observation and the true minimum of the objective function, respectively. The results show that the optimality gap increases with the problem dimension, which can be explained by the curse of dimensionality (Bellman, 1957). AEBO demonstrated the best performance among methods dealing with an unbounded or expanding search space, and is almost as good as the standard BO on the Rastrigin function. Since the global optimum of the Rosenbrock function is inside a narrow flat valley, it is trivial to find the valley but difficult to converge to the global optimum. Thus it requires large budget for exploitation in that valley (i.e., exploitation-intense). The standard BO and expansion-based methods like FBO-EI may have unnecessarily large search space and hence waste budget on exploring regions far from the global optimum, rather than exploiting the valley. Thus compared to AEBO, it is more difficult for these three methods to find good solutions on the Rosenbrock function, especially when the dimensionality is high (Fig. 5).

image

Fig. 5: The effect of problem dimension on the optimality gap (top) and the reachability of optimal solutions found by different methods (bottom). Note that the standard BO, unlike other methods, was performed within the bounds that include the actual global optima.

image

Fig. 6: Optimality gap results on noisy test functions.

We examined the reachability of optimal solutions in different methods by measuring the distance-to-center metric  s = ∥x∗ − c∥2, where  x∗is the optimal solution found and c is a center point (Fig. 5). In EI-Q and EI-H, c is the user-specified center; and in AEBO or FBO, c represents the center of initial bounds. Regularization-based methods like EI-Q and EI-H are biased toward regions near some user-specified center point, thus insufficient exploitation may occur when the optimal solution is far from the center. This is demonstrated in Fig. 5, where EI-H shows a relatively smaller distance-to-center and higher optimality gap comparing to AEBO and FBO.

image

Fig. 7: The effect of fixed and adaptive  τon optimality gaps. Here FBO-EI is shown as a baseline.

To evaluate the robustness of AEBO under noise, we tested the case where the observations are corrupted by Gaussian noise with a standard deviation of 0.1. The results are shown in Fig. 6.

To demonstrate the effectiveness of the adaptive exploration-exploitation trade-off, we ran AEBO with both fixed  τand adaptive  τsolved from Eq. 12. The two test functions, Rosenbrock and Rastrigin, have different characteristics and hence prefer different exploration-exploitation trade-offs. Since the Rastrigin function has a large number of local optima, the difficulty for optimizing on Rastrigin is to avoid getting stuck in those local optima. Thus an algorithm with a higher search space expanding rate (e.g., AEBO with a  large τ) is likely to perform better since it will spend less budget exploiting local optima and more budget expanding towards the global optimum (i.e., exploration-intense). In contrast, due to the narrow flat valley in the Rosenbrock function, the difficult part is exploiting near the global optimum to refine the solution. Thus a lower expanding rate (e.g., AEBO with a  small τ) is likely to be preferred since less budget will be wasted for exploration. The results shown in Fig. 7 are consistent with our expectation: the optimality gap increases with the value of the fixed  τon the Rosenbrock function, while the opposite behavior was observed on the Rastrigin function. However, by using an adaptive  τ, AEBO performs better than most other configurations on both test functions. Note that the behavior of FBO-EI is similar to AEBO with a large  τ(without considering the high performance variance on the Rosenbrock function).

As every objective function weights exploitation and exploration differently, BO methods with a fixed expansion schedule may succeed for one function, but fail for another. The AEBO can avoid this by adaptively balancing exploitation-exploration while expanding the search space.

We include the experimental results for constrained BO problems in the supplementary material.

4.3 MLP on MNIST

image

Fig. 8: Optimization history for the hyperparameter tuning of a MLP trained on MNIST.

We use the hyperparameter optimization of a multilayer perceptron (MLP) as a real-world example to demonstrate the performance of the AEBO. MNIST was used as the training data. The MLP has three hidden layers, each having 64 hidden units with ReLU activations and was implemented using TensorFlow (Abadi et al, 2015). We used Adam (Kingma and Ba, 2015) as the MLP’s optimizer. We optimized 13 hyperparameters, namely the learning rate, the learning rate decay, the dropout rate for each hidden layer, and the L1 and L2 regularization coefficients for each hidden layer and the output layer. We performed the AEBO, FBO-EI, and EI-H in the log space (base 10) with the initial bounds of [−5, −4]7. We also compared them to standard BO with bounds fixed at [−5, −4]7(small bounds) and [−10,1]7(large bounds). Note that large and small bounds are different in whether they cover solutions found by unbounded methods, such that the size of the bounds will (for small bounds) or will not (for large bounds) be a factor that limits the solution of BO. The objective is to maximize the accuracy of the MLP. As shown in Fig. 8, AEBO, FBO-EI, and EI-H found better solutions than BO with small bounds. AEBO outperformed EI-H and BO with large bounds, and was at least as good as FBO-EI.

We proposed a Bayesian optimization method, AEBO, that gradually expands the search space, so that we can find the global optimum without having to specify the input space bounds that include it. The proposed method only evaluates samples at regions with low GP model uncertainty, and expands the search space adaptively to avoid over-exploration in an expanding search space. This method is useful in cases where we are not confident about the range of the global optimum. The experimental results show that our method out-performs the other state-of-the-art methods in most cases.

In the standard BO, even if the input space bounds are set large enough to cover the global optimum, too much budget may be spent on needlessly exploring the large space. This will result in bad solutions when optimizing an exploitation-intense objective function, as shown by the Rosenbrock and the MLP examples.

Abadi M, Agarwal A, Barham P, Brevdo E, Chen Z, Citro C, Corrado GS, Davis A, Dean J, Devin M, Ghemawat S, Goodfellow I, Harp A, Irving G, Isard M, Jia Y, Jozefowicz R, Kaiser L, Kudlur M, Levenberg J, Man´e D, Monga R, Moore S, Murray D, Olah C, Schuster M, Shlens J, Steiner B, Sutskever I, Talwar K, Tucker P, Vanhoucke V, Vasudevan V, Vi´egas F, Vinyals O, Warden P, Wattenberg M, Wicke M, Yu Y, Zheng X (2015) TensorFlow: Large-scale machine learning on heterogeneous systems. URL https://www.tensorflow.org/, software available from tensorflow.org

Basudhar A, Dribusch C, Lacaze S, Missoum S (2012) Constrained efficient global optimization with support vector machines. Structural and Multidisciplinary Optimization 46(2):201–221

Bellman R (1957) Dynamic programming. Princeton University Press, Princeton, NY

Gelbart MA, Snoek J, Adams RP (2014) Bayesian opti- mization with unknown constraints. In: Proceedings of the Thirtieth Conference on Uncertainty in Artifi-cial Intelligence, AUAI Press, pp 250–259

Hennig P, Schuler CJ (2012) Entropy search for information-efficient global optimization. Journal of Machine Learning Research 13(Jun):1809–1837

Hern´andez-Lobato JM, Hoffman MW, Ghahramani Z (2014) Predictive entropy search for efficient global optimization of black-box functions. In: Advances in neural information processing systems, pp 918–926

Jones DR (2001) A taxonomy of global optimization methods based on response surfaces. Journal of global optimization 21(4):345–383

Jones DR, Schonlau M, Welch WJ (1998) Efficient global optimization of expensive black-box functions. Journal of Global optimization 13(4):455–492

Kingma DP, Ba J (2015) Adam: A method for stochas- tic optimization. In: Proceedings of the 3rd International Conference on Learning Representations

Kushner HJ (1964) A new method of locating the maximum point of an arbitrary multipeak curve in the presence of noise. Journal of Basic Engineering 86(1):97–106

McKay MD, Beckman RJ, Conover WJ (1979) Com- parison of three methods for selecting values of input variables in the analysis of output from a computer code. Technometrics 21(2):239–245

Nguyen V, Gupta S, Rane S, Li C, Venkatesh S (2017) Bayesian optimization in weakly specified search space. In: 2017 IEEE International Conference on Data Mining, IEEE, pp 347–356

Nguyen V, Gupta S, Rana S, Li C, Venkatesh S (2018) Filtering bayesian optimization approach in weakly specified search space. Knowledge and Information Systems pp 1–29

Nowacki H (1980) Modelling of design decisions for cad. In: Computer Aided Design Modelling, Systems Engineering, CAD-Systems, Springer, pp 177–223

Palar PS, Shimoyama K (2019) Efficient global opti- mization with ensemble and selection of kernel functions for engineering design. Structural and Multidisciplinary Optimization 59(1):93–116

Rainforth T, Le TA, van de Meent JW, Osborne MA, Wood F (2016) Bayesian optimization for probabilistic programs. In: Advances in Neural Information Processing Systems, pp 280–288

Rasmussen C, Williams C (2006) Gaussian Processes for Machine Learning. the MIT Press

Shahriari B (2016) Practical bayesian optimization with application to tuning machine learning algorithms. PhD thesis, University of British Columbia

Shahriari B, Swersky K, Wang Z, Adams RP, De Freitas N (2015) Taking the human out of the loop: A review of bayesian optimization. Proceedings of the IEEE 104(1):148–175

Shahriari B, Bouchard-Cˆot´e A, Freitas N (2016) Un- bounded bayesian optimization via regularization. In: Artificial Intelligence and Statistics, pp 1168–1176

Singh P, Van Der Herten J, Deschrijver D, Couckuyt I, Dhaene T (2017) A sequential sampling strategy for adaptive classification of computationally expensive data. Structural and Multidisciplinary Optimization 55(4):1425–1438

Snoek J, Larochelle H, Adams RP (2012) Practi- cal bayesian optimization of machine learning algorithms. In: Advances in neural information processing systems, pp 2951–2959

Springenberg JT, Klein A, Falkner S, Hutter F (2016) Bayesian optimization with robust bayesian neural networks. In: Advances in Neural Information Processing Systems, pp 4134–4142

Srinivas N, Krause A, Kakade SM, Seeger M (2009) Gaussian process optimization in the bandit setting: No regret and experimental design. arXiv preprint arXiv:09123995

Swersky K, Snoek J, Adams RP (2013) Multi-task bayesian optimization. In: Advances in neural information processing systems, pp 2004–2012

Wang Z, Jegelka S (2017) Max-value entropy search for efficient bayesian optimization. In: Proceedings of the 34th International Conference on Machine LearningVolume 70, JMLR. org, pp 3627–3635

Appendix A: Derivation of the Feasible Domain Bounds

In this section, we derive the bounding box that contains the feasible domain of Eq. 4.

image

Fig. A1: Feasible domain bounds. In each iteration, we expand the minimum bounding box of evaluated samples along the i-th axis by  ri.

For any point  xτon the isocontour (Fig. A1), i.e., σ2(xτ) =  τk0, based on Eq. 2 we have  k0 − k⊤τ Akτ= τk0, or

k⊤τ Akτ= (1  − τ)k0(A1) where A = (K + σ2nI)−1and  kτ=  k(xτ).

Since A is symmetric, we have  λmink⊤τ kτ ≤k⊤τ Akτ ≤ λmaxk⊤τ kτ, where  λminand  λmaxare the smallest and largest eigenvalues of A, respectively. Thus k⊤τ kτhave the following bounds for any  xτ:

(1  − τ)k0/λmax ≤ k⊤τ kτ ≤(1  − τ)k0/λmin(A2)

Suppose  xnlis the nearest evaluated point to  xτ(Fig. 1), the following inequality holds:

k⊤τkτ < Nk2(xnl, xτ) (A3)

where N > 1 is the number of evaluated points. According to Eq. A2 and Eq. A3, we have

Nk2(xnl, xτ) > (1  −τ)k0/λmax(A4)

for any  xτ. Given any stationary kernel  k(x, x′) =  f(δ), where  δ = x−x′, we can find the upper bound of  δifor the i-th axis. Then we can set that upper bound as the expansion rate  ri.

For example, when using the RBF kernel, we have

image

k2(xnl, xτ) = exp

image

where  δ = xnl − xτ. Substituting Eq. A5 into Eq. A4, we get the following inequality

image

where  C = −log((1  − τ)k0/(Nλmax)). Equation A6 shows that  xτis inside a d-dimensional hyperellipsoid that centered at  xnlwith  ri =√Clicorresponding to half the length of the i-th principal axis.

image

�minj{x(j)i } − ri,maxj{x(j)i } + ri�, we can include the entire feasible domain of Eq. 4. This means that in each iteration, we get the minimum bounding box of all evaluated samples, and expand the bounding box along the i-th axis by  ri(Fig. 1). Then constrained global optimization of the acquisition function is performed within the new bounds. In practice, because Eq. A3 is usually quite loose (especially when N is large), the above derived bounds are usually unnecessarily large, causing large volume of infeasible domain inside the bounds. In that case, we can replace  λmaxwith  λminin Eq. A4, i.e., substituting the upper bound of  k⊤τ kτ(Eq. A2) into Eq. A4.

image

We created two test problems to evaluate the performance of AEBO in dealing with constrained BO problems. Specifically, the constrained Rastrigin problem uses the Rastrigin function as the objective function, and the feasible domain is defined by an ellipse 0.01x21+ (x2+ 2)2 ≤1 (Fig. B1). The Nowacki beam problem is a real-world test problem originally described by Nowacki (Nowacki, 1980; Singh et al, 2017). The goal is to minimize the cross-sectional area of a tiploaded cantilever beam subject to certain constraints.1

The results of the two problems are shown in Figs. B1 and B2. For the constrained Rastrigin problem, AEBO achieved a better solution than the other two methods. For the Nowacki beam problem, FBOEI’s solution has the lowest mean value but a very high variance; while AEBO found a fairly close optimal solu- tion with much lower variance. The evaluated points by AEBO were dense near optima (either local or global). This behavior was, however, not obvious for the other two methods. This is likely because that FBO-EI and EI-H over-trusted the GP posterior even where its uncertainty was high. This resulted in sampling patterns with too much randomness, and hence higher variance of optimal solutions.

image

Fig. B1: Optimization history and evaluated points for the constrained Rastrigin problem

image

Fig. B2: Optimization history and evaluated points for the Nowacki beam problem


Designed for Accessibility and to further Open Science