On Hyper-parameter Tuning for Stochastic Optimization Algorithms

2020·Arxiv

Abstract

Abstract

This paper proposes the first- ever algorithmic framework for tuning hyper-parameters of stochastic optimization algorithm based on reinforcement learning. Hyperparameters impose significant influences on the performance of stochastic optimization algorithms, such as evolutionary algorithms (EAs) and meta-heuristics. Yet, it is very time-consuming to determine optimal hyper-parameters due to the stochastic nature of these algorithms. We propose to model the tuning procedure as a Markov decision process, and resort the policy gradient algorithm to tune the hyper-parameters. Experiments on tuning stochastic algorithms with different kinds of hyper-parameters (continuous and discrete) for different optimization problems (continuous and discrete) show that the proposed hyper-parameter tuning algorithms do not require much less running times of the stochastic algorithms than bayesian optimization method. The proposed framework can be used as a standard tool for hyper-parameter tuning in stochastic algorithms.

I. INTRODUCTION

Hyper-parameters impose significant influences on the performance of stochastic optimization algorithms, such as evolutionary algorithms (EAs) and meta-heuristics. Assume that we have a stochastic algorithm A with parameters . The solution obtained by applying A to minimizing a function f from an initial solution can be considered as a random vector with unknown distribution. Without loss of generality, we assume that the stochastic algorithm’s return follows an additive noise model, i.e.

where follows an unknown distribution P and is the search space. The randomness of comes from the intrinsic stochasticity of A. Therefore, the hyper-parameter tuning can be formalized in terms of minimization as follows:

where is the decision space of the hyper-parameters and is a deterministic metric to measure the performance of the found solution. That is, to find optimal hyper-parameter, we shall minimize the expectation of the metric over the

HZ, JS and ZX are all with the School of Mathematics and Statistics and National Engineering Laboratory for Big Data Analytics, Xi’an Jiaotong University, Xi’an, China. This work was supported by Major Project of National Science Foundation of China (grant no. U1811461), Key Project of National Science Foundation of China (grant no. 11690011) and Project of National Science Foundation of China (grant no. 61721002). Corresponding author: Jianyong Sun, email: jy.sun@xjtu.edu.cn

stochasticity, including the randomness of the initial solution and the unknown distribution P.

Usually, for each hyper-parameter configuration , we need to compute the expected metric over a sample of initial vectors by Monte Carlo approximation2. The averaged metric value is used to approximate the expectation. That is,

where is the number of runs of the stochastic algorithm. Obviously, for each hyper-parameter configuration, considerable amount of computational resources are required to compute the average. Therefore, it is prohibitively time-consuming to determine optimal hyper-parameters.

In literature, hyper-parameter tuning methods include grid search, random search, orthogonal experimental design method, and model-based methods such as Bayesian optimization algorithm (BOA) [1], response surface [], etc. Among these methods, grid search is mostly used simply because it is easy to implement, BOA is most used and studied. And The study of hyper-parameter tuning for stochastic algorithm is very rare if not at all. The only paper we have found is [], in which the authors simply used BOA for hyper-parameter tuning. They studied how different acquisition functions affect the performance of BOA for hyper-parameter tuning in a stochastic algorithm (DE is considered in the paper). Further, to the best of our knowledge, there is no standard tool in literature for hyper-parameter tuning in the computational intelligence community.

Bayesian optimization is a kind of optimization method to find global minimizer of a ’black box’ function without knowing its convexity, differentiability or continuity. Bayesian optimization iteratively samples new point by an acquisition function, and uses Gaussian process [2], [3] to estimate a surrogate model to fit f(x) based on the observed points. When using specific acquisition function, its regret bound has been proved as is the number of sampled points) in [4]. [5] [6] proposed methods that use Bayesian

2In some papers, the optimal configuration is expected to be obtained also over a set of test functions, in which case the formulation can be written as

is the number of sampled functions. In this paper, we do not consider to find the optimal hyper-parameter for a set of functions, though the proposed method is capable. Our philosophy is that a stochastic optimization is casesensitive.

optimization with dividing region, which successfully accelerates convergence rate. In application, Eq. 2 is minimizing a ’black box’ function, thus using BOA to tune hyper-parameters is appropriate. Recently, [7], [8] have used BOA in EA hyper-parameters tuning, and [9] successfully used BOA in machine learning hyper-parameters tuning. Details of BOA will be shown in Section. VI-A.

However, there are many questions when using BOA to tune hyper-parameters: First, BOA has cubic complexity w.r.t. number of sampled points, which means its efficiency cannot be guaranteed. Second, as Gaussian Process is probability model, thus acquisition function should consider mean and variance of the model. Hence much work based on BOA may assume the space of input like [10]. Actually in EAs, function value or fitness value is in a vary large range, which will make BOA hard to use. Third, BOA assumes objective ’black box’ function has prior with zero mean which is easy to implement but sometimes unreasonable. Last but not the least, for stochastic problems (such as EAs), we should use Eq. 3, which means we should cost several times.

Meta learning is an approach to learn intrinsic common knowledge between different tasks. Meta learning aims to transfer knowledge from some tasks called ’meta train set’ to another tasks called ’meta test set’. In meta learning framework, ’knowledge’ can be various things, therefore, it has been widely used. For instance, firstly, meta learning can tell us how to optimize an objective function (learning to optimize): [11],[12],[13] have learned a good optimization rule (descent direction) from meta train set (some deep neural networks) and perform better on meta test set (another neural networks)than common methods (like ADAM and SGD). [14] learned a method for ’black box’ function. Secondly, few-shot learning is an application to use meta learning [15],[16]. Thirdly, common knowledge can be an initialization of a model: [17] [18] learned the initialization of a model without any extra parameters. Then domain adaptation (DA) [19] naturally can use meta learning as DA aims to transfer knowledge from source domain to target domain which is corresponding to meta train and meta test. Recently, even disentangling causal mechanisms [20] can be mined by using meta learning. Details of [17] [18] will be talked about in Section. II-C.

When meta learning meeting bayesian optimization, it can be used to handle the first problem and the third problem in BOA, then it is called bayesian optimization warm starting. [10] used neural network to substitute mean and variance of Gaussian Process, which has only linear complexity. [21] used handcrafted meta feature to measure distance of tasks to choose good initial point for BOA. [22] used metric learning to learn meta feature to measure distance of tasks. [23] learned a distribution of several tasks to handle zero prior problem. [24],[25] used ranking based method to learn a prior of BOA to make BOA convergent fast. Reinforcement learning is another method that widely used and studied. The RL algorithms based on Markov Decision Process (MDP) [26] and have succeeded overwhelmingly for playing games such as GO [27], atari [28] , robotics [29] and many others. Details of RL methods willed talked about in Section. II-B.

To make the hyper-parameter tuning efficient and effective,

Fig. 1. The introduction of RL framework.

this paper proposes to use reinforcement learning approach for the hyper-parameter tuning in EAs. We define a scale invariant reward (feedback) to eliminate scale problem in EAs. Furthermore, we use meta-learning to learn the initialization of hyper-parameters from meta train set and we hope this initialization can help our tuning algorithm find optimal hyper-parameters fast. Experimental scene shows that our method perform better than standard BOA and some warm starting methods. We hope the developed hyper-parameter tuning framework could become a standard tool for future research.

In the rest of the paper, Section II-B presents basic preliminaries consists of Differential Evolutionary Algorithms, Reinforcement learning methods and Meta learning for determining initialization. The proposed hyper-parameter tuning framework is presented in Section III. Experimental results on tuning parameters of stochastic algorithms (DE), is presented in Section IV. Section V concludes the paper.

II. PRELIMINARIES

A. Differential Evolutionary Algorithms

Differential evolution (DE) [30],[31] is a global optimization algorithm to minimize function . DE starts with initial population , and iterates , where are randomly chosen from is the best point in k-th iteration, F is hyper-parameter. Then we can use binomial crossover and selection methods to create . We generalize this iteration formulation to be

using 2 hyper-parameters F and D, which is a general formulation covering most DE formulations. F and D influence optimization performance of DE (e.g., finding minimum of functions by DE) significantly, and thus we can utilize our method to tune these hyper-parameters. We use function set of CEC 2017 for optimization by DE and use our method to find optimal hyper-parameters.

B. Markov Decision Process and Reinforcement learning

Reinforcement learning (RL) aims to find a policy so that agent(s) is able to take actions in an environment so as to maximize its cumulative reward. Fig. 1 shows the typical representation of reinforcement learning. An agent interacts with environment. At each state , it takes an action according to present policy . Given this action, environment responses with a new state and an immediate reward . The goal is to find a policy with parameter so as to maximize the expectation of total rewards where denote time-step dependent weighting factors. In practice, weighting factors is always set to exponential power of constant i.e. .

RL can be modeled as a Markov decision process (MDP) which is a 4-tuple . It consists of a set of agent and environment states S, a set of actions A, and a probability transition which specifies the probability of transition from state s to under action a, while r(s) is the immediate reward with state s.

To train an RL for the parameter of the policy, various RL algorithms, such as temporal difference methods, policy gradient and others have been widely used.

In reinforcement learning, the learner (agent) chooses to take an action at each time step; the action changes the state of environment; (possibly delayed) feedback (reward) returns and affects the learner’s next decision. The learner aims to find an optimal policy so that the actions decided by the policy maximize the cumulative rewards along time.

Consider a finite-horizon MDP with continuous state and action space defined by the tuple where denotes the state space, the action space, the initial distribution of the state, the reward, and T the time horizon, respectively. At each time t, there are and a transition probability R where denotes the transition probability of conditionally based on and . The policy , where is the probability of choosing action when observing current state with as the parameter.

As shown in Fig. 1, starting from a state , the agent chooses ; after executing the action, agent arrives at state . Meanwhile, agent can get reward from the environment. Iteratively, a trajectory can be obtained. The optimal policy is to be found by maximizing the expectation of the cumulative reward :

where the expectation is taken over trajectory where

A variety of reinforcement learning algorithms have been proposed for different scenarios of the state and action spaces, please see [32] for recent advancements. The RL algorithms have succeeded overwhelmingly for playing games such as GO [27], Atari [28] and many others.

We briefly introduce the policy gradient method for continuous state space [33], which will be used in our study. Taking derivative of w.r.t. discarding unrelated terms, we have

The expectation of Eq. 7 can be calculated by sampling trajectories in practice:

where denotes action (state) at time t in the ith trajectory, is the cumulative reward of the ith trajectory. For continuous state and action space, normally assume

where can be any smooth function, like radial basis function, linear function, and even neural networks.

C. Meta learning for determining initialization

MAML [17] and Reptile [18] are two typical meta-learning methods to obtain an initialization of a model from meta train set to fast adapt to new problems. Given data set D, model M(w) and loss function , MAML learns initialization by minimizing . It implies that MAML learns a that can fast adapt by one gradient descent step. Optimization in MAML uses secondorder derivatives of L(w). In our application, Bayesian optimization procedures are non-differentiable. Therefore, MAML can not be applied to our task directly. Reptile is one of first-order meta-learning algorithms. As illustrated in Alg. 1, to obtain an optimal initialization of model parameters w that can adapt well to new problem by k-steps of optimization denoted by , the parameter w is iteratively updated by vectors that enforces w to be close to the updated parameters of different tasks. In this work, we borrow idea from Reptile, but design a weighted Reptile algorithm for learning initialization of hyper-parameters.

4 Update w .

III. METHODS

In this section, we propose a tuning algorithm to find optimal hyper-parameters based on RL framework. We model tuning hyper-parameters process into MDP framework, then finding optimal hyper-parameters is to learn an optimal policy. Then we design a scale invariant reward and use policy gradient method to learn a policy. Actually it will induce a well explainable formulation. Then we use Weighted Reptile Algorithm to learn initial hyper-parameters from meta train set, and we hope it can fast adapt on meta test set. Figure. 2 shows the framework of our work.

The expectation of the metric, defined as

usually exhibits as a smooth and multi-modal function. Fig. 3 shows a demonstration of the effect of the hyper-parameters F and D for a classical DE (rand/best/2) to the metric which is defined as the found optimum. Fig. 3(a) shows the plot for function , Fig. 3(b) shows the contour plot. From the figure, it is seen that there exists many local optima.

Optimizing , like optimizing other functions, needs an efficient local search algorithm and a good initial hyper-parameter configuration .

A. The local search algorithm Hλ

In terms of how a hyper-parameter influences the optimization procedure of a stochastic algorithm, there can be two categories. In the first category, we say the influence is linear in case that solution(s) in the new iteration is obtained through a generalized linear model. For example, in differential evolution (DE), a widely-used mutation operator, called rnd/1/best, can be written as follows:

where is the best solution found so far, and is a randomly selected solution from current population . Alternatively, it can be seen that the whole population

can be formalized as

In the second category, we say that the hyper-parameter is nonlinear to the stochastic algorithm if it is not linear as defined previously. For example, the crossover rate (CR) used in DE, and the population size N used in population-based meta-heuristics. For these parameters, we cannot write down an explicit relationship between the parameter and the iterates.

In the following, we present the hyper-parameter tuning framework. Note that we can readily model the stochastic optimization procedure as a Markov Decision Process (MDP). Formally, we take the current iterate as state , the solution generation process as the action , i.e. , and the best/current objective function value is the reward . The transition can be considered as the selection process.

Note that we use the same notation in Eq. 9, as our action is adding a gaussian noise. Thus, we see that finding optimal hyper-parameters equals to learn an optimal policy whose parameters are hyper-parameters .

B. Tuning formula

Now we assume . For using Reinforcement learning, we need to generate data, thus we set . Noticing that we add which we called balanced coefficients. For linear case (i.e. )), according to policy gradient, Eq. 8

For linear case, we take , then the above equation can be rewritten as

In [33], adding a baseline b can reduce the variance of policy gradient. Then we minus to induce:

For efficiency, we use nonlinear function R(x) .= max(0, x) on , that is, we only calculate direction which can lead to larger reward (The efficiency of adding R(x) will be shown in Sec. IV-C).

This equation tell to which direction we shall tune the parameter depending on the reward. If is positive, we shall tune the parameter towards ; otherwise keep the parameter as current .

For nonlinear case, assume that

Fig. 2. The meta-learning framework for tuning hyper-parameters in EAs.

Fig. 3. The contour of hyper-parameters in DE algorithm.

where is differential in terms of . That is, by Taylor expansion, we have:

φ

As we set is a continuous mapping, thus for , there exists a so that . By using policy gradient, we need to compute which is proportional to compute the gradient of

− ln ∥∇∥−

where . We denote the equation as . The second part can be computed approximately as follows by using Taylor expansion:

Then the distribution of is approximately . For the first part of , its gradient w.r.t. is proportional to , where is the second deriva- tive. In this paper, we assume that the influence of hyper-parameters not so violent in a small neighborhood, otherwise

the tuning process is intractable. Thus we assume is sufficiently small, and we ignore the first term.

Therefore, is approximately a ascent direction of. Thus, we can again use Eq. 13 to approximate the tuning of nonlinear hyper-parameters.

C. tuning algorithm

Actually, in practice, many tricks are guaranteed the results. Then in next subsections, we will introduce tricks in Alg. 2: scale reward, updating variance and momentum gradient descent.

1) scale reward: As mentioned above, reward can significantly influence the direction we tune. Thus we hope our reward does not have scale problem. First of all, the directly choice of reward is the best function value in first t generations for hyper-parameter (ith trajectory). Like [27] we set when t < T and . Then . However, in EAs, if we poorly choose hyper-parameters, then the optimization result is very poor (sometimes ), otherwise, the performance can be good (mostly several hundreds). Therefore, if we use function value can cause serious scale problem. We consider using logarithm to eliminate this problem. Meanwhile, using to constrain the range. Logarithm has two advantages: first it can eliminate large scale, second it can discriminate small scale (near zero).

However, it is not so discriminative because minimum of objective functions in evolutionary algorithm are always not zero. When function value is near to minimum, logarithm can not discriminate values except for minimum is zero. Thus we consider adding history best value: , which is the best function value of all individuals in ith trajectory. We define . Then we use:

The idea of adding is similar to adding baseline in policy gradient: adding mean of reward in policy gradient can make reward can promote discrimination power (reward minus mean can be negative or positive, otherwise reward is all positive). Adding can make is near to zero, then logarithm can discriminate them. And we show the difference of different reward in Section. IV-C.

2) updating variance: In reinforcement learning, we always fix in policy. Here we will talk about how to update variance . In practice, derivative of U w.r.t. is available:

As , where . Then we denote corresponding to and .

Same as before, we use baseline , which induces

λ, σN

Then we have:

The updating of is : . Then explanation of derivative w.r.t. : if is positive and is large, which implies good hyper-parameter is far from current , then prefers to be larger. Otherwise, if is negative and is large, which implies bad hyper-parameter is far from current , then prefers to be smaller. For instance: when current is not good, then much s sampled can get better performance than , then will be larger, that is, explore larger space. Otherwise, when current is almost the best, then few s sampled can get better performance than , then will be smaller, that is, exploit more. And term can be regarded as regularization term, which will make variance is smaller. Then this term can make tuning process more steady. Actually, in Section. IV-C, we will show can benefit our algorithm to be convergent fast and be more steady.

3) momentum gradient descent: Observing the Eq. 13, it is similar to stochastic gradient descent (SGD):

Then as we know, for optimizing neural networks, SGD with momentum[34], Adagrad[35], AdaDelta[36], ADAM[37] were proposed, which can accelerate convergence. Thus we use SGD with momentum:

In Section. IV-C, we will show momentum can benefit our algorithm to accelerate convergence.

4) tuning algorithm: Based on above analysis, we can propose Policy Gradient based Tuning Algorithm (PGTA). Alg. 2 shows the framework. As shown, in line.6-line.8, it

samples new hyper-parameter near given hyper-parameter and evaluate it. Then line. 11 calculates scale reward, line. 12 use on scale reward and line. 14 update sampling variance.

D. Determining λ0

To determine the initial hyper-parameter configuration , we can resort to the history optimal configurations of these hyper-parameters for a set of functions. This can be cast as a meta learning problem. Learning initial parameters has been explored in [17] and [18]. Here we borrow the thought of Reptile, and we proposed a Weighted-Reptile-Algorithm (WRA) Alg. 3. WRA has two difference: first adding noise on current hyper-parameter (line. 5), second updating initial parameter by weighted direction (line. 7). Here we set weight is the same as Eq. 14, where is the best function value of all functions values obtained by setting as hyper-parameter configure.

Here we give a probability aspect: In line. 7, the updating formula inducing , where

. And we can get this update formula by optimizing:

If we regard the and are the observations or data , then optimizing Eq. 18 equals to maximizing posterior function:

Noticing that , thus

can make sense. Thus optimizing Eq. 18 equals to making follows the Gaussian distribution . Meanwhile, the bigger is, the smaller the variance of distribution is, and should be nearer the corresponding mean .

IV. EXPERIMENTS

In this section, we show the performance of tuning hyper-parameters of DE (D, F, crossover). We use CEC 2017 data set. We aim to find best hyper-parameters con-figure of DE when using on CEC 2017. That is, we should minimize hyper-parameter response function

BOA: Basic bayesian optimization with RBF kernel is considered and is set to 1 in our experiment. The acquisition function is chosen to be EI (see Section. VI-A for detail). And Bayesian optimization is sensitive to function value, thus we calculate 20 times and get average function value for one fixed hyper-parameter configure. That is in our experiments. For eliminating scale problem, logarithm of average function value is used.

WRA: For our algorithm, first we use WRA to learn initial hyper-parameters from meta-train set. And then using PGTA to tune hyper-parameters in meta-test set with learned initialization. In experiments, Meta-train set consists of basic function , and Metatest set consists of test function . As we know, test functions are made by basic functions, thus, these two set have some relations but do not intersect.

PGTA: First we initial hyper-parameters by learned initialization, then adding a Gaussian noise, then evaluating it. In experiments, we set which means we will add noise 20 times and get 20 different hyper-parameters, and we only evaluate only once for each hyper-parameter. Therefore, we guarantee the evaluations in BOA and PGTA are the same. Then calculating R to update hyper-parameters. Finally, we will update variance using Eq. 15. Other settings: initial .

A. D and F

As mentioned above, D and F are linear case, in this section we will first tune D and F. We test in 10 and 30 dimension. As We minimize Eq. 4, then it should be stochastic optimization problem. Thus we use ’regret’ cost function as measure like [38], that is, . In this paper, we set maximum tuning steps T = 10. Meanwhile, setting CR = 1, we run BOA and WRA-PGTA for 20 times and calculate median of . Then Fig. 4 and Fig. 5 show the median of of WRA-PGTA and BOA. In figures we can see, WRA-PGTA can get better hyper-parameters than BOA in most functions.

B. Crossover

For CR, we test on 10D functions. Same as D and F, we run 20 times and calculate median. Setting D = F = 0.5, we use WRA-PGTA and BOA to tune the CR. Figure. 6 shows the median of of WRA-PGTA and BOA. In figures we see, for CR, the gap of two methods is very small (please notice the y-axis in each figure).

C. Ablation Study

In this section, we will clarify our tricks are effective, for detail: SGD vs momentum; in Eq. 13; updating ; scale reward and stability of meta-train.

First we use Alg. 2 on test function 1. Figure. 7 shows tuning process by using SGD and SGD with momentum. Here we see that momentum can find good hyper-parameter faster than SGD. Second Figure. 8 shows tuning process of effect of and updating . Here we find that using can make PGTA available (blue line and yellow line). And Using updating can make PGTA more steady. Then we use Alg. 2 on test function 13, whose minimum is 1300 (far from zero). And Figure. 9 shows tuning process by using and without using scale reward. Then we can find by using scale reward (red line), Alg. 2 can get better hyper-parameters than without using scale reward.

Then Figure. 10 shows the D and F and its variance tuning process. From figure we see, variance is first increasing and then decreasing. It means that initial hyper-parameters are not good, thus first increase the searching region then decreasing the searching region as hyper-parameters are good enough. Figure. 11 shows the crossover and its variance tuning process. The trend of crossover and its variance are the same as D and F.

Fig. 4. Performance of WRA-PGTA and BOA on tuning D and F in DE algorithm on test functions in CEC 2017 (10D): y-axis is median of 20 runs, x-axis is tuning steps (iterations).

Fig. 5. Performance of WRA-PGTA and BOA on tuning D and F in DE algorithm on test functions in CEC 2017 (30D): y-axis is median of 20 runs, x-axis is tuning steps (iterations).

Fig. 6. Performance of WRA-PGTA and BOA on tuning CR in DE algorithm on test functions in CEC 2017 (10D): y-axis is median of for 20 runs, x-axis is tuning steps (iterations).

Fig. 7. Ablation study of momentum SGD.

Fig. 8. Ablation study of using updating

At last, we show the stability of meta-train: Figure. 12 shows D and F in 5 epochs (each epoch has 6 iterations) for 5 meta-train functions, that is, every 30 iterations is an epoch and for first epoch 1-6 iterations are for meta-train function 1, 7-12 iterations are for meta-train function 2, ... , 25-30 iterations are for meta-train function 5. From figure we see, the meta-train of hyper-parameters is convergent.

Fig. 9. Ablation study of using scale reward (

D. Discussion

First, it is critical that finding a good initialization. Actually, human beings always make much effort on finding empirical configure by hand. Thus we use WRA to imitate this process. Second when tuning hyper-parameters by hand, we always try some times near empirical configure, and we will try configure which may have better performance. That is what Eq. 13 demonstrates. Finally, after several trials, we hope it can stay at a small region. Then Eq. 15 helps to do this. In other words, we successfully used machine to substitute human beings in

Fig. 10. Tuning process of scaling factor D, F and their variance.

Fig. 11. Tuning process of CR and its variance

tuning hyper-parameters this work.

V. CONCLUSION

In this paper, we proposed a novel meta learning based tuning method WRA-PGTA to find best hyper-parameters con-figure for EAs. This method can learn initial hyper-parameters from meta train set and fast adapt to meta test set. Experiments show that WRA-PGTA tune hyper-parameters faster and more steady than traditional tuning methods. Ablation study further shows efficiency of tricks used in our algorithms. Actually, WRA-PGTA can tune hyper-parameters which are stable w.r.t. time. There are many EAs with hyper-parameters varying by times. Then we consider using more machine learning tools to handle this problem.

Fig. 12. Meta-training process for scaling factor D and F.

REFERENCES

[1] M. Pelikan, D. E. Goldberg, and E. Cantupaz, “BOA: the bayesian optimization algorithm,” in Genetic and evolutionary computation conference, pp. 525–532, 1999.

[2] C. E. Rasmussen, “Gaussian processes in machine learning,” in Lecture notes in computer science, pp. 63–71, Springer, 2003.

[3] J. R. Gardner, M. J. Kusner, Z. E. Xu, K. Q. Weinberger, and J. P. Cunningham, “Bayesian optimization with inequality constraints.,” in ICML, pp. 937–945, 2014.

[4] N. Srinivas, A. Krause, S. Kakade, and M. Seeger, “Information- theoretic regret bounds for gaussian process optimization in the bandit setting,” IEEE Transactions on Information Theory, vol. 58, no. 5, pp. 3250–3265, 2012.

[5] N. D. Freitas, A. Smola, and M. Zoghi, “Exponential regret bounds for gaussian process bandits with deterministic observations,” in ICML, 2012.

[6] K. Kawaguchi, L. P. Kaelbling, and T. Lozano-P´erez, “Bayesian opti- mization with exponential convergence,” in NeurIPS, pp. 2809–2817, 2015.

[7] I. Roman, J. Ceberio, A. Mendiburu, and J. A. Lozano, “Bayesian optimization for parameter tuning in evolutionary algorithms,” in CEC, pp. 4839–4845, 2016.

[8] C. Huang, B. Yuan, Y. Li, and X. Yao, “Automatic parameter tuning using bayesian optimization method,” in CEC, pp. 2090–2097, 2019.

[9] J. Snoek, H. Larochelle, and R. P. Adams, “Practical bayesian optimiza- tion of machine learning algorithms,” in NeurIPS, 2012.

[10] J. Snoek, O. Rippel, K. Swersky, R. Kiros, N. Satish, N. Sundaram, M. M. A. Patwary, Prabhat, and R. P. Adams, “Scalable bayesian optimization using deep neural networks,” in ICML, pp. 2171–2180, 2015.

[11] M. Andrychowicz, M. Denil, S. Gomez, M. W. Hoffman, D. Pfau, T. Schaul, B. Shillingford, and N. De Freitas, “Learning to learn by gradient descent by gradient descent,” in NIPS, pp. 3981–3989, 2016.

[12] S. Wang, J. Sun, and Z. Xu, “Hyperadam: A learnable task-adaptive adam for network training,” in AAAI, 2019.

[13] K. Li and J. Malik, “Learning to optimize,” in ICLR, 2017.

[14] Y. Chen, M. W. Hoffman, S. G. Colmenarejo, M. Denil, T. P. Lillicrap, M. Botvinick, and N. De Freitas, “Learning to learn without gradient descent by gradient descent,” in ICML, pp. 748–756, 2017.

[15] S. Ravi and H. Larochelle, “Optimization as a model for few-shot learning,” in ICLR, 2017.

[16] P. H. T. Luca Bertinetto, Joao Henriques and A. Vedaldi, “Meta-learning with differentiable closed-form solvers,” in ICLR, 2019.

[17] C. Finn, P. Abbeel, and S. Levine, “Model-agnostic meta-learning for fast adaptation of deep networks,” in ICML, pp. 1126–1135, 2017.

[18] A. Nichol, J. Achiam, and J. Schulman, “On first-order meta-learning algorithms,” arXiv preprint arXiv:1803.02999, 2018.

[19] Y. Balaji, S. Sankaranarayanan, and R. Chellappa, “Metareg: Towards domain generalization using meta-regularization,” in NeurIPS, pp. 1006– 1016, 2018.

[20] Y. Bengio, T. Deleu, N. Rahaman, R. Ke, S. Lachapelle, O. Bilaniuk, A. Goyal, and C. Pal, “A meta-transfer objective for learning to disentangle causal mechanisms,” arXiv:Learning, 2019.

[21] M. Feurer, J. T. Springenberg, and F. Hutter, “Initializing bayesian hyperparameter optimization via meta-learning,” in AAAI, 2015.

[22] J. Kim, S. Kim, and S. Choi, “Learning to warm-start bayesian hyper- parameter optimization,” arXiv preprint arXiv:1710.06219, 2017.

[23] M. Poloczek, J. Wang, and P. I. Frazier, “Warm starting bayesian optimization,” in Winter simulation conference, pp. 770–781, 2016.

[24] M. Feurer, B. Letham, and E. Bakshy, “Scalable meta-learning for bayesian optimization.,” in ICML AutoML Workshop, 2018.

[25] M. Wistuba, N. Schilling, and L. Schmidt-Thieme, “Two-stage transfer surrogate model for automatic hyperparameter optimization,” in Joint European conference on machine learning and knowledge discovery in databases, pp. 199–214, Springer, 2016.

[26] M. L. Puterman, Markov decision processes: Discrete stochastic dynamic programming. John Wiley and Sons, Inc., 1994.

[27] D. Silver, A. Huang, C. J. Maddison, A. Guez, L. Sifre, d. D. G. Van, J. Schrittwieser, I. Antonoglou, V. Panneershelvam, and M. Lanctot, “Mastering the game of go with deep neural networks and tree search,” Nature, vol. 529, no. 7587, pp. 484–489, 2016.

[28] V. Mnih, K. Kavukcuoglu, D. Silver, A. Graves, I. Antonoglou, D. Wier- stra, and M. Riedmiller, “Playing atari with deep reinforcement learning,” Computer Science, 2013.

[29] S. Levine, C. Finn, T. Darrell, and P. Abbeel, “End-to-end training of deep visuomotor policies,” The Journal of Machine Learning Research, vol. 17, no. 1, pp. 1334–1373, 2016.

[30] J. Galletly, “Evolutionary algorithms in theory and practice,” Complexity, vol. 2, no. 8, pp. 26–27, 1996.

[31] S. Das and P. N. Suganthan, “Differential evolution: A survey of the state-of-the-art,” IEEE Transactions on Evolutionary Computation, vol. 15, no. 1, pp. 4–31, 2011.

[32] K. Arulkumaran, P. Deisenroth, M, M. Brundage, and et al., “Deep rein- forcement learning: A brief survey,” IEEE Signal Processing Magazine, vol. 34, no. 6, pp. 226–38, 2017.

[33] R. Sutton and A. Barto, Reinforcement Learning:An Introduction. 1998.

[34] B. T. Polyak, “Some methods of speeding up the convergence of iter- ation methods,” USSR Computational Mathematics and Mathematical Physics, vol. 4, no. 5, pp. 1–17, 1964.

[35] J. Duchi, E. Hazan, and Y. Singer, “Adaptive subgradient methods for online learning and stochastic optimization,” Journal of Machine Learning Research, vol. 12, no. Jul, pp. 2121–2159, 2011.

[36] M. D. Zeiler, “Adadelta: an adaptive learning rate method,” arXiv preprint arXiv:1212.5701, 2012.

[37] D. P. Kingma and J. Ba, “Adam: A method for stochastic optimization,” in ICLR, 2015.

[38] M. Zinkevich, “Online convex programming and generalized infinitesi- mal gradient ascent,” in ICML, pp. 928–935, 2003.

[39] C. M. Bishop, Pattern recognition and machine learning. springer, 2006.

[40] J. Mockus, “On bayesian methods for seeking the extremum,” in Optimization Techniques IFIP Technical Conference, pp. 400–404, Springer, 1975.

VI. APPENDIX

A. Bayesian Optimization

Bayesian optimization is to find a minimizer of a function f in a bounded space by iteratively querying the function at sampled points. At n-th iteration, we first construct a probabilistic model on D for estimating function values. Then using an acquisition function a(x) to select next point. We estimate f by surrogate model based on Gaussian process (GP). We assume , where is mean function and is covariance function or kernel. Given D , GP is simply a joint Gaussian: f, where Kis a covariance matrix with elements of defined as kernel function measuring relation between and . We choose RBF kernel expin this work. Then for a new point , we consider joint distribution of previously and newly sampled points:

k. The posterior satisfies Gaussian distribution [39]) with mean and variance as

which is the surrogate model of GP. As we can see, GP estimates function value via given data D based on surrogate model of Eq. (19) and Eq. (20). Then we use acquisition function a(x) to choose which point will be evaluated next. Upper confidence bound (UCB) [4]and expected improvement (EI) [40] have been shown to be efficient. The expected improvement is the most prominent choice for bayesian optimization. Formally, the improvement for x is defined as:

where is the current best function value and Y is a random variable modeling the function value f(x) depending on observation . Then the point with highest expected improvement will be chosen to be next point: