b

DiscoverSearch
About
My stuff
On Hyper-parameter Tuning for Stochastic Optimization Algorithms
2020·arXiv
Abstract
Abstract

This paper proposes the first- ever algorithmic framework for tuning hyper-parameters of stochastic optimization algorithm based on reinforcement learning. Hyperparameters impose significant influences on the performance of stochastic optimization algorithms, such as evolutionary algorithms (EAs) and meta-heuristics. Yet, it is very time-consuming to determine optimal hyper-parameters due to the stochastic nature of these algorithms. We propose to model the tuning procedure as a Markov decision process, and resort the policy gradient algorithm to tune the hyper-parameters. Experiments on tuning stochastic algorithms with different kinds of hyper-parameters (continuous and discrete) for different optimization problems (continuous and discrete) show that the proposed hyper-parameter tuning algorithms do not require much less running times of the stochastic algorithms than bayesian optimization method. The proposed framework can be used as a standard tool for hyper-parameter tuning in stochastic algorithms.

Hyper-parameters impose significant influences on the performance of stochastic optimization algorithms, such as evolutionary algorithms (EAs) and meta-heuristics. Assume that we have a stochastic algorithm A with parameters  λ. The solution x∗obtained by applying A to minimizing a function f from an initial solution  x01can be considered as a random vector with unknown distribution. Without loss of generality, we assume that the stochastic algorithm’s return follows an additive noise model, i.e.

image

where  ϵfollows an unknown distribution P and  Ωis the search space. The randomness of  x∗comes from the intrinsic stochasticity of A. Therefore, the hyper-parameter tuning can be formalized in terms of minimization as follows:

image

where  Ωλis the decision space of the hyper-parameters and M(·)is a deterministic metric to measure the performance of the found solution. That is, to find optimal hyper-parameter, we shall minimize the expectation of the metric over the

HZ, JS and ZX are all with the School of Mathematics and Statistics and National Engineering Laboratory for Big Data Analytics, Xi’an Jiaotong University, Xi’an, China. This work was supported by Major Project of National Science Foundation of China (grant no. U1811461), Key Project of National Science Foundation of China (grant no. 11690011) and Project of National Science Foundation of China (grant no. 61721002). Corresponding author: Jianyong Sun, email: jy.sun@xjtu.edu.cn

stochasticity, including the randomness of the initial solution and the unknown distribution P.

Usually, for each hyper-parameter configuration  λ, we need to compute the expected metric over a sample of initial vectors by Monte Carlo approximation2. The averaged metric value is used to approximate the expectation. That is,

image

where  Nλis the number of runs of the stochastic algorithm. Obviously, for each hyper-parameter configuration, considerable amount of computational resources are required to compute the average. Therefore, it is prohibitively time-consuming to determine optimal hyper-parameters.

In literature, hyper-parameter tuning methods include grid search, random search, orthogonal experimental design method, and model-based methods such as Bayesian optimization algorithm (BOA) [1], response surface [], etc. Among these methods, grid search is mostly used simply because it is easy to implement, BOA is most used and studied. And The study of hyper-parameter tuning for stochastic algorithm is very rare if not at all. The only paper we have found is [], in which the authors simply used BOA for hyper-parameter tuning. They studied how different acquisition functions affect the performance of BOA for hyper-parameter tuning in a stochastic algorithm (DE is considered in the paper). Further, to the best of our knowledge, there is no standard tool in literature for hyper-parameter tuning in the computational intelligence community.

Bayesian optimization is a kind of optimization method to find global minimizer of a ’black box’ function without knowing its convexity, differentiability or continuity. Bayesian optimization iteratively samples new point by an acquisition function, and uses Gaussian process [2], [3] to estimate a surrogate model to fit f(x) based on the observed points. When using specific acquisition function, its regret bound has been proved as  O(√n) (nis the number of sampled points) in [4]. [5] [6] proposed methods that use Bayesian

2In some papers, the optimal configuration is expected to be obtained also over a set of test functions, in which case the formulation can be written as

image

Nfis the number of sampled functions. In this paper, we do not consider to find the optimal hyper-parameter for a set of functions, though the proposed method is capable. Our philosophy is that a stochastic optimization is casesensitive.

optimization with dividing region, which successfully accelerates convergence rate. In application, Eq. 2 is minimizing a ’black box’ function, thus using BOA to tune hyper-parameters is appropriate. Recently, [7], [8] have used BOA in EA hyper-parameters tuning, and [9] successfully used BOA in machine learning hyper-parameters tuning. Details of BOA will be shown in Section. VI-A.

However, there are many questions when using BOA to tune hyper-parameters: First, BOA has cubic complexity w.r.t. number of sampled points, which means its efficiency cannot be guaranteed. Second, as Gaussian Process is probability model, thus acquisition function should consider mean and variance of the model. Hence much work based on BOA may assume the space of input like [10]. Actually in EAs, function value or fitness value is in a vary large range, which will make BOA hard to use. Third, BOA assumes objective ’black box’ function has prior with zero mean which is easy to implement but sometimes unreasonable. Last but not the least, for stochastic problems (such as EAs), we should use Eq. 3, which means we should cost several times.

Meta learning is an approach to learn intrinsic common knowledge between different tasks. Meta learning aims to transfer knowledge from some tasks called ’meta train set’ to another tasks called ’meta test set’. In meta learning framework, ’knowledge’ can be various things, therefore, it has been widely used. For instance, firstly, meta learning can tell us how to optimize an objective function (learning to optimize): [11],[12],[13] have learned a good optimization rule (descent direction) from meta train set (some deep neural networks) and perform better on meta test set (another neural networks)than common methods (like ADAM and SGD). [14] learned a method for ’black box’ function. Secondly, few-shot learning is an application to use meta learning [15],[16]. Thirdly, common knowledge can be an initialization of a model: [17] [18] learned the initialization of a model without any extra parameters. Then domain adaptation (DA) [19] naturally can use meta learning as DA aims to transfer knowledge from source domain to target domain which is corresponding to meta train and meta test. Recently, even disentangling causal mechanisms [20] can be mined by using meta learning. Details of [17] [18] will be talked about in Section. II-C.

When meta learning meeting bayesian optimization, it can be used to handle the first problem and the third problem in BOA, then it is called bayesian optimization warm starting. [10] used neural network to substitute mean and variance of Gaussian Process, which has only linear complexity. [21] used handcrafted meta feature to measure distance of tasks to choose good initial point for BOA. [22] used metric learning to learn meta feature to measure distance of tasks. [23] learned a distribution of several tasks to handle zero prior problem. [24],[25] used ranking based method to learn a prior of BOA to make BOA convergent fast. Reinforcement learning is another method that widely used and studied. The RL algorithms based on Markov Decision Process (MDP) [26] and have succeeded overwhelmingly for playing games such as GO [27], atari [28] , robotics [29] and many others. Details of RL methods willed talked about in Section. II-B.

To make the hyper-parameter tuning efficient and effective,

image

Fig. 1. The introduction of RL framework.

this paper proposes to use reinforcement learning approach for the hyper-parameter tuning in EAs. We define a scale invariant reward (feedback) to eliminate scale problem in EAs. Furthermore, we use meta-learning to learn the initialization of hyper-parameters from meta train set and we hope this initialization can help our tuning algorithm find optimal hyper-parameters fast. Experimental scene shows that our method perform better than standard BOA and some warm starting methods. We hope the developed hyper-parameter tuning framework could become a standard tool for future research.

In the rest of the paper, Section II-B presents basic preliminaries consists of Differential Evolutionary Algorithms, Reinforcement learning methods and Meta learning for determining initialization. The proposed hyper-parameter tuning framework is presented in Section III. Experimental results on tuning parameters of stochastic algorithms (DE), is presented in Section IV. Section V concludes the paper.

A. Differential Evolutionary Algorithms

Differential evolution (DE) [30],[31] is a global optimization algorithm to minimize function  f : Rd → R1. DE starts with initial population  X0 = [x01, x02, · · · , x0Np] ∈ Rd×Np, and iterates  uk+1i = xki +F(xkbest −xki )+F(xkr(1) −xkr(2)), where xkr(i)(i = 1, 2)are randomly chosen from  Xk, xkbestis the best point in k-th iteration, F is hyper-parameter. Then we can use binomial crossover and selection methods to create  Xk+1. We generalize this iteration formulation to be

image

using 2 hyper-parameters F and D, which is a general formulation covering most DE formulations. F and D influence optimization performance of DE (e.g., finding minimum of functions by DE) significantly, and thus we can utilize our method to tune these hyper-parameters. We use function set of CEC 2017 for optimization by DE and use our method to find optimal hyper-parameters.

B. Markov Decision Process and Reinforcement learning

Reinforcement learning (RL) aims to find a policy so that agent(s) is able to take actions in an environment so as to maximize its cumulative reward. Fig. 1 shows the typical representation of reinforcement learning. An agent interacts with environment. At each state  st, it takes an action  ataccording to present policy  π(at|st). Given this action, environment responses with a new state  st+1and an immediate reward  rt+1. The goal is to find a policy  π = p(at|st; θ)with parameter  θso as to maximize the expectation of total rewards  E��Tt=1 αtrt�where  αtdenote time-step dependent weighting factors. In practice, weighting factors is always set to exponential power of constant i.e.  αt = γt.

RL can be modeled as a Markov decision process (MDP) which is a 4-tuple  (s, a, r, pa). It consists of a set of agent and environment states S, a set of actions A, and a probability transition  pa(s, s′)which specifies the probability of transition from state s to  s′under action a, while r(s) is the immediate reward with state s.

To train an RL for the parameter  θof the policy, various RL algorithms, such as temporal difference methods, policy gradient and others have been widely used.

In reinforcement learning, the learner (agent) chooses to take an action at each time step; the action changes the state of environment; (possibly delayed) feedback (reward) returns and affects the learner’s next decision. The learner aims to find an optimal policy so that the actions decided by the policy maximize the cumulative rewards along time.

Consider a finite-horizon MDP with continuous state and action space defined by the tuple  (S, A, µ0, p, r, π, T)where S ∈ RDdenotes the state space,  A ∈ Rdthe action space, µ0the initial distribution of the state,  r : S → Rthe reward, and T the time horizon, respectively. At each time t, there are st ∈ S, at ∈ Aand a transition probability  p : S × A × S →R where  p(st+1|at, st)denotes the transition probability of st+1conditionally based on  stand  at. The policy  π : S ×A×{0, 1, · · · T} → R, where  π(at|st; θ)is the probability of choosing action  atwhen observing current state  stwith  θas the parameter.

As shown in Fig. 1, starting from a state  s0 ∼ µ0, the agent chooses  a0 ∼ π(a0|s0); after executing the action, agent arrives at state  s1 ∼ p(s1|a0, s0). Meanwhile, agent can get reward  r(s1)from the environment. Iteratively, a trajectory τ = {s0, a0, s1, a1, · · · , aT −1, sT }can be obtained. The optimal policy  π∗is to be found by maximizing the expectation of the cumulative reward  R(τ) =��T −1t=0 r(st+1)�:

image

where the expectation is taken over trajectory  τ ∼ q(τ; θ)where

image

A variety of reinforcement learning algorithms have been proposed for different scenarios of the state and action spaces, please see [32] for recent advancements. The RL algorithms have succeeded overwhelmingly for playing games such as GO [27], Atari [28] and many others.

We briefly introduce the policy gradient method for continuous state space [33], which will be used in our study. Taking derivative of  U(θ)w.r.t.  θdiscarding unrelated terms, we have

image

The expectation of Eq. 7 can be calculated by sampling trajectories  τ1, · · · , τNin practice:

image

where  a(i)t (s(i)t )denotes action (state) at time t in the ith trajectory,  R(τ i)is the cumulative reward of the ith trajectory. For continuous state and action space, normally assume

image

where  φcan be any smooth function, like radial basis function, linear function, and even neural networks.

C. Meta learning for determining initialization

MAML [17] and Reptile [18] are two typical meta-learning methods to obtain an initialization of a model from meta train set to fast adapt to new problems. Given data set D, model M(w) and loss function  L(·, ·), MAML learns initialization w∗by minimizing  L (M(w − α∇L(M(w), D)), D). It implies that MAML learns a  w∗that can fast adapt by one gradient descent step. Optimization in MAML uses secondorder derivatives of L(w). In our application, Bayesian optimization procedures are non-differentiable. Therefore, MAML can not be applied to our task directly. Reptile is one of first-order meta-learning algorithms. As illustrated in Alg. 1, to obtain an optimal initialization of model parameters w that can adapt well to new problem  τby k-steps of optimization denoted by  U kτ (w), the parameter w is iteratively updated by vectors  ( �w − w)that enforces w to be close to the updated parameters  �wof different tasks. In this work, we borrow idea from Reptile, but design a weighted Reptile algorithm for learning initialization of hyper-parameters.

image

4 Update w  ← w + ϵ( �w − w).

In this section, we propose a tuning algorithm to find optimal hyper-parameters based on RL framework. We model tuning hyper-parameters process into MDP framework, then finding optimal hyper-parameters is to learn an optimal policy. Then we design a scale invariant reward and use policy gradient method to learn a policy. Actually it will induce a well explainable formulation. Then we use Weighted Reptile Algorithm to learn initial hyper-parameters from meta train set, and we hope it can fast adapt on meta test set. Figure. 2 shows the framework of our work.

The expectation of the metric, defined as

image

usually exhibits as a smooth and multi-modal function. Fig. 3 shows a demonstration of the effect of the hyper-parameters F and D for a classical DE (rand/best/2) to the metric which is defined as the found optimum. Fig. 3(a) shows the plot for function  F1, Fig. 3(b) shows the contour plot. From the figure, it is seen that there exists many local optima.

Optimizing  Ef(λ), like optimizing other functions, needs an efficient local search algorithm  Hλand a good initial hyper-parameter configuration  λ0.

A. The local search algorithm Hλ

In terms of how a hyper-parameter influences the optimization procedure of a stochastic algorithm, there can be two categories. In the first category, we say the influence is linear in case that solution(s) in the new iteration is obtained through a generalized linear model. For example, in differential evolution (DE), a widely-used mutation operator, called rnd/1/best, can be written as follows:

image

where  xtbestis the best solution found so far, and  xtr1is a randomly selected solution from current population  Xt. Alternatively, it can be seen that the whole population  Xt+1

can be formalized as

image

In the second category, we say that the hyper-parameter is nonlinear to the stochastic algorithm if it is not linear as defined previously. For example, the crossover rate (CR) used in DE, and the population size N used in population-based meta-heuristics. For these parameters, we cannot write down an explicit relationship between the parameter and the iterates.

In the following, we present the hyper-parameter tuning framework. Note that we can readily model the stochastic optimization procedure as a Markov Decision Process (MDP). Formally, we take the current iterate  Xtas state  st, the solution generation process as the action  at, i.e.  at = φ(st; λ) .=φ(Xt, λ), and the best/current objective function value is the reward  r(st). The transition can be considered as the selection process.

Note that we use the same notation  φin Eq. 9, as our action is  φ(st; λ)adding a gaussian noise. Thus, we see that finding optimal hyper-parameters equals to learn an optimal policy whose parameters are hyper-parameters  λ.

B. Tuning formula

Now we assume  λ ∈ R1, s ∈ RNP ×d, φ(s, λ) ∈NP ×d. For using Reinforcement learning, we need to generate data, thus we set  at ∼ N(φ(st, λ), σ2∥∇λφ(s, λ)∥2). Noticing that we add  ∥∇λφ(s, λ)∥2which we called balanced coefficients. For linear case (i.e.  a ∼ N(φ(s; λ), σ2∥�φ(s)∥2)), according to policy gradient, Eq. 8

image

For linear case, we take  a(i)t = λi �φ(s(i)t ), λi ∼ N(λ, σ2), then the above equation can be rewritten as

image

In [33], adding a baseline b can reduce the variance of policy gradient. Then we minus  ¯r .= 1T 1N�T −1t=0 �Ni=1 r(s(i)t+1)to induce:

image

For efficiency, we use nonlinear function R(x) .= max(0, x) on �T −1t=0 (r(s(i)t+1) − ¯r), that is, we only calculate direction which can lead to larger reward (The efficiency of adding R(x) will be shown in Sec. IV-C).

image

This equation tell to which direction we shall tune the parameter  λdepending on the reward. If �T −1t=0 r(s(i)t+1) − T ¯ris positive, we shall tune the parameter towards  λi; otherwise keep the parameter as current  λ.

For nonlinear case, assume that

image

image

Fig. 2. The meta-learning framework for tuning hyper-parameters in EAs.

image

Fig. 3. The contour of hyper-parameters in DE algorithm.

where  φis differential in terms of  λ. That is, by Taylor expansion, we have:

φ(s; λ1) − φ(s; λ2) = ∇λφ(s; λ2)(λ1 − λ2) + O�(λ1 − λ2)2�

As we set  φis a continuous mapping, thus for  ∀ai ∼N(φ(s; λ), σ2∥∇λφ(s, λ)∥2), there exists a  λiso that  ai =φ(s; λi). By using policy gradient, we need to compute ∇λU(λ)which is proportional to compute the gradient of

image

ln ∥∇λφ(s, λ)∥−

image

where  φ(s; λi) ∼ N(φ(s; λ), σ2∥∇λφ(s, λ)∥2). We denote the equation as  F(λi; λ). The second part can be computed approximately as follows by using Taylor expansion:

image

Then the distribution of  λiis approximately  N(λ, σ2). For the first part of  F(λ; λi), its gradient w.r.t.  λis proportional to ∇λφ(s, λ)∇2λφ(s, λ), where  ∇2λφ(s, λ)is the second deriva- tive. In this paper, we assume that the influence of hyper-parameters not so violent in a small neighborhood, otherwise

image

the tuning process is intractable. Thus we assume  ∇2λφ(s, λ)is sufficiently small, and we ignore the first term.

Therefore, �∇λFis approximately a ascent direction ofF(λ; λi). Thus, we can again use Eq. 13 to approximate the tuning of nonlinear hyper-parameters.

C. tuning algorithm

Actually, in practice, many tricks are guaranteed the results. Then in next subsections, we will introduce tricks in Alg. 2: scale reward, updating variance and momentum gradient descent.

1) scale reward: As mentioned above, reward can significantly influence the direction we tune. Thus we hope our reward does not have scale problem. First of all, the directly choice of reward  r(s(i)t )is the best function value  f t−1,ibest −f t,ibestin first t generations for hyper-parameter  λi(ith trajectory). Like [27] we set  r(s(i)t ) = 0when t < T and  r(s(i)T ) =f 0,ibest − f T,ibest. Then  Ri ≜ �T −1t=0 r(s(i)t+1) = f 0,ibest − f T,ibest. However, in EAs, if we poorly choose hyper-parameters, then the optimization result is very poor (sometimes  106 − 1010), otherwise, the performance can be good (mostly several hundreds). Therefore, if we use function value can cause serious scale problem. We consider using logarithm to eliminate this problem. Meanwhile, using  Ri = ln(f 0,ibest)−ln(f T,ibest)ln(f 0,ibest)to constrain the range. Logarithm has two advantages: first it can eliminate large scale, second it can discriminate small scale (near zero).

However, it is not so discriminative because minimum of objective functions in evolutionary algorithm are always not zero. When function value is near to minimum, logarithm can not discriminate values except for minimum is zero. Thus we consider adding history best value:  f ibesth, which is the best function value of all individuals in ith trajectory. We define fbesth ≜ min{f ibesth|i = 1, · · · , N}. Then we use:

image

The idea of adding  fbesthis similar to adding baseline in policy gradient: adding mean of reward in policy gradient can make reward can promote discrimination power (reward minus mean can be negative or positive, otherwise reward is all positive). Adding  fbesthcan make  f T,ibest − fbesthis near to zero, then logarithm can discriminate them. And we show the difference of different reward in Section. IV-C.

2) updating variance: In reinforcement learning, we always fix  σin policy. Here we will talk about how to update variance σ. In practice, derivative of U w.r.t.  σis available:

image

As  ln π1 ≈ − (λi−λ)2σ2, where  λi ∼ N(λ, σ2). Then we denote ∇σU = ∇1σU + ∇2σUcorresponding to  ln π1and  ln π2.

image

Same as before, we use baseline  ¯r, which induces

∇1σU(λ, σ) ≈ 1N

Then we have:

image

The updating of  σis :  σ ← σ +∇σU(λ, σ). Then explanation of derivative w.r.t.  σ: if �T −1t=0 (r(s(i)t+1) − ¯r)is positive and (λi − λ)2is large, which implies good hyper-parameter  λiis far from current  λ, then  σprefers to be larger. Otherwise, if �T −1t=0 r(s(i)t+1) − T ¯ris negative and  (λi − λ)2is large, which implies bad hyper-parameter  λiis far from current  λ, then  σprefers to be smaller. For instance: when current  λis not good, then much  λis sampled can get better performance than  λ, then  σwill be larger, that is, explore larger space. Otherwise, when current  λis almost the best, then few  λis sampled can get better performance than  λ, then  σwill be smaller, that is, exploit more. And term  − 12σ2can be regarded as regularization term, which will make variance is smaller. Then this term can make tuning process more steady. Actually, in Section. IV-C, we will show  σcan benefit our algorithm to be convergent fast and be more steady.

3) momentum gradient descent: Observing the Eq. 13, it is similar to stochastic gradient descent (SGD):

image

Then as we know, for optimizing neural networks, SGD with momentum[34], Adagrad[35], AdaDelta[36], ADAM[37] were proposed, which can accelerate convergence. Thus we use SGD with momentum:

image

In Section. IV-C, we will show momentum can benefit our algorithm to accelerate convergence.

4) tuning algorithm: Based on above analysis, we can propose Policy Gradient based Tuning Algorithm (PGTA). Alg. 2 shows the framework. As shown, in line.6-line.8, it

image

samples new hyper-parameter near given hyper-parameter and evaluate it. Then line. 11 calculates scale reward, line. 12 use  max(·, 0)on scale reward and line. 14 update sampling variance.

D. Determining λ0

To determine the initial hyper-parameter configuration  λ0, we can resort to the history optimal configurations of these hyper-parameters for a set of functions. This can be cast as a meta learning problem. Learning initial parameters has been explored in [17] and [18]. Here we borrow the thought of Reptile, and we proposed a Weighted-Reptile-Algorithm (WRA) Alg. 3. WRA has two difference: first adding noise on current hyper-parameter (line. 5), second updating initial parameter by weighted direction (line. 7). Here we set weight is the same as Eq. 14, where  fbesthis the best function value of all functions values obtained by setting  λsTas hyper-parameter configure.

Here we give a probability aspect: In line. 7, the updating formula inducing  ∆λ0 = �Nss=1(λsT − λ0)Ws, where  Ws ≜

image

max{(ws−�Nss=1 ws), 0}. And we can get this update formula by optimizing:

image

If we regard the  λsTand  Wsare the observations or data Ds, then optimizing Eq. 18 equals to maximizing posterior function:

image

Noticing that  Ws ≥ 0, thus�

12Wscan make sense. Thus optimizing Eq. 18 equals to making  λ0follows the Gaussian distribution  N(λsT ,� 12Ws ). Meanwhile, the bigger  Wsis, the smaller the variance of distribution is, and  λ0should be nearer the corresponding mean  λsT.

In this section, we show the performance of tuning hyper-parameters of DE (D, F, crossover). We use CEC 2017 data set. We aim to find best hyper-parameters con-figure of DE when using on CEC 2017. That is, we should minimize hyper-parameter response function  Z(λ) ≜

image

BOA: Basic bayesian optimization with RBF kernel k(x, y) = ρ2∥x − y∥2is considered and  ρis set to 1 in our experiment. The acquisition function is chosen to be EI (see Section. VI-A for detail). And Bayesian optimization is sensitive to function value, thus we calculate 20 times and get average function value for one fixed hyper-parameter configure. That is  Nλ = 20in our experiments. For eliminating scale problem, logarithm of average function value is used.

WRA: For our algorithm, first we use WRA to learn initial hyper-parameters from meta-train set. And then using PGTA to tune hyper-parameters in meta-test set with learned initialization. In experiments, Meta-train set consists of basic function  f1 − f5, Ns = 20, σ1 = 0.1, β = 0.1, and Metatest set consists of test function  F1 − F30. As we know, test functions are made by basic functions, thus, these two set have some relations but do not intersect.

PGTA: First we initial hyper-parameters by learned initialization, then adding a Gaussian noise, then evaluating it. In experiments, we set  NT = 20which means we will add noise 20 times and get 20 different hyper-parameters, and we only evaluate only once for each hyper-parameter. Therefore, we guarantee the evaluations in BOA and PGTA are the same. Then calculating R to update hyper-parameters. Finally, we will update variance using Eq. 15. Other settings: initial σ = 0.1, α = 0.1.

A. D and F

As mentioned above, D and F are linear case, in this section we will first tune D and F. We test in 10 and 30 dimension. As We minimize Eq. 4, then it should be stochastic optimization problem. Thus we use ’regret’ cost function as measure like [38], that is,  C(λT ) = 1T�Ti=1 ln fbest(Xi; λT ). In this paper, we set maximum tuning steps T = 10. Meanwhile, setting CR = 1, we run BOA and WRA-PGTA for 20 times and calculate median of  C(λT ), T = 1, · · · 10. Then Fig. 4 and Fig. 5 show the median of  C(λT )of WRA-PGTA and BOA. In figures we can see, WRA-PGTA can get better hyper-parameters than BOA in most functions.

B. Crossover

For CR, we test on 10D functions. Same as D and F, we run 20 times and calculate median. Setting D = F = 0.5, we use WRA-PGTA and BOA to tune the CR. Figure. 6 shows the median of  C(λT )of WRA-PGTA and BOA. In figures we see, for CR, the gap of two methods is very small (please notice the y-axis in each figure).

C. Ablation Study

In this section, we will clarify our tricks are effective, for detail: SGD vs momentum;  max(·, 0)in Eq. 13; updating  σ; scale reward and stability of meta-train.

First we use Alg. 2 on test function 1. Figure. 7 shows tuning process by using SGD and SGD with momentum. Here we see that momentum can find good hyper-parameter faster than SGD. Second Figure. 8 shows tuning process of effect of max(·, 0)and updating  σ. Here we find that using  max(·, 0)can make PGTA available (blue line and yellow line). And Using updating  σcan make PGTA more steady. Then we use Alg. 2 on test function 13, whose minimum is 1300 (far from zero). And Figure. 9 shows tuning process by using and without using scale reward. Then we can find by using scale reward (red line), Alg. 2 can get better hyper-parameters than without using scale reward.

Then Figure. 10 shows the D and F and its variance tuning process. From figure we see, variance is first increasing and then decreasing. It means that initial hyper-parameters are not good, thus first increase the searching region then decreasing the searching region as hyper-parameters are good enough. Figure. 11 shows the crossover and its variance tuning process. The trend of crossover and its variance are the same as D and F.

image

Fig. 4. Performance of WRA-PGTA and BOA on tuning D and F in DE algorithm on test functions in CEC 2017 (10D): y-axis is median of  C(λT ) for20 runs, x-axis is tuning steps (iterations).

image

Fig. 5. Performance of WRA-PGTA and BOA on tuning D and F in DE algorithm on test functions in CEC 2017 (30D): y-axis is median of  C(λT ) for20 runs, x-axis is tuning steps (iterations).

image

Fig. 6. Performance of WRA-PGTA and BOA on tuning CR in DE algorithm on test functions in CEC 2017 (10D): y-axis is median of  C(λT )for 20 runs, x-axis is tuning steps (iterations).

image

Fig. 7. Ablation study of momentum SGD.

image

Fig. 8. Ablation study of using updating  σ and using max(·, 0).

At last, we show the stability of meta-train: Figure. 12 shows D and F in 5 epochs (each epoch has 6 iterations) for 5 meta-train functions, that is, every 30 iterations is an epoch and for first epoch 1-6 iterations are for meta-train function 1, 7-12 iterations are for meta-train function 2, ... , 25-30 iterations are for meta-train function 5. From figure we see, the meta-train of hyper-parameters is convergent.

image

Fig. 9. Ablation study of using scale reward (fbesth).

D. Discussion

First, it is critical that finding a good initialization. Actually, human beings always make much effort on finding empirical configure by hand. Thus we use WRA to imitate this process. Second when tuning hyper-parameters by hand, we always try some times near empirical configure, and we will try configure which may have better performance. That is what Eq. 13 demonstrates. Finally, after several trials, we hope it can stay at a small region. Then Eq. 15 helps to do this. In other words, we successfully used machine to substitute human beings in

image

Fig. 10. Tuning process of scaling factor D, F and their variance.

image

Fig. 11. Tuning process of CR and its variance

tuning hyper-parameters this work.

In this paper, we proposed a novel meta learning based tuning method WRA-PGTA to find best hyper-parameters con-figure for EAs. This method can learn initial hyper-parameters from meta train set and fast adapt to meta test set. Experiments show that WRA-PGTA tune hyper-parameters faster and more steady than traditional tuning methods. Ablation study further shows efficiency of tricks used in our algorithms. Actually, WRA-PGTA can tune hyper-parameters which are stable w.r.t. time. There are many EAs with hyper-parameters varying by times. Then we consider using more machine learning tools to handle this problem.

image

Fig. 12. Meta-training process for scaling factor D and F.

[1] M. Pelikan, D. E. Goldberg, and E. Cantupaz, “BOA: the bayesian optimization algorithm,” in Genetic and evolutionary computation conference, pp. 525–532, 1999.

[2] C. E. Rasmussen, “Gaussian processes in machine learning,” in Lecture notes in computer science, pp. 63–71, Springer, 2003.

[3] J. R. Gardner, M. J. Kusner, Z. E. Xu, K. Q. Weinberger, and J. P. Cunningham, “Bayesian optimization with inequality constraints.,” in ICML, pp. 937–945, 2014.

[4] N. Srinivas, A. Krause, S. Kakade, and M. Seeger, “Information- theoretic regret bounds for gaussian process optimization in the bandit setting,” IEEE Transactions on Information Theory, vol. 58, no. 5, pp. 3250–3265, 2012.

[5] N. D. Freitas, A. Smola, and M. Zoghi, “Exponential regret bounds for gaussian process bandits with deterministic observations,” in ICML, 2012.

[6] K. Kawaguchi, L. P. Kaelbling, and T. Lozano-P´erez, “Bayesian opti- mization with exponential convergence,” in NeurIPS, pp. 2809–2817, 2015.

[7] I. Roman, J. Ceberio, A. Mendiburu, and J. A. Lozano, “Bayesian optimization for parameter tuning in evolutionary algorithms,” in CEC, pp. 4839–4845, 2016.

[8] C. Huang, B. Yuan, Y. Li, and X. Yao, “Automatic parameter tuning using bayesian optimization method,” in CEC, pp. 2090–2097, 2019.

[9] J. Snoek, H. Larochelle, and R. P. Adams, “Practical bayesian optimiza- tion of machine learning algorithms,” in NeurIPS, 2012.

[10] J. Snoek, O. Rippel, K. Swersky, R. Kiros, N. Satish, N. Sundaram, M. M. A. Patwary, Prabhat, and R. P. Adams, “Scalable bayesian optimization using deep neural networks,” in ICML, pp. 2171–2180, 2015.

[11] M. Andrychowicz, M. Denil, S. Gomez, M. W. Hoffman, D. Pfau, T. Schaul, B. Shillingford, and N. De Freitas, “Learning to learn by gradient descent by gradient descent,” in NIPS, pp. 3981–3989, 2016.

[12] S. Wang, J. Sun, and Z. Xu, “Hyperadam: A learnable task-adaptive adam for network training,” in AAAI, 2019.

[13] K. Li and J. Malik, “Learning to optimize,” in ICLR, 2017.

[14] Y. Chen, M. W. Hoffman, S. G. Colmenarejo, M. Denil, T. P. Lillicrap, M. Botvinick, and N. De Freitas, “Learning to learn without gradient descent by gradient descent,” in ICML, pp. 748–756, 2017.

[15] S. Ravi and H. Larochelle, “Optimization as a model for few-shot learning,” in ICLR, 2017.

[16] P. H. T. Luca Bertinetto, Joao Henriques and A. Vedaldi, “Meta-learning with differentiable closed-form solvers,” in ICLR, 2019.

[17] C. Finn, P. Abbeel, and S. Levine, “Model-agnostic meta-learning for fast adaptation of deep networks,” in ICML, pp. 1126–1135, 2017.

[18] A. Nichol, J. Achiam, and J. Schulman, “On first-order meta-learning algorithms,” arXiv preprint arXiv:1803.02999, 2018.

[19] Y. Balaji, S. Sankaranarayanan, and R. Chellappa, “Metareg: Towards domain generalization using meta-regularization,” in NeurIPS, pp. 1006– 1016, 2018.

[20] Y. Bengio, T. Deleu, N. Rahaman, R. Ke, S. Lachapelle, O. Bilaniuk, A. Goyal, and C. Pal, “A meta-transfer objective for learning to disentangle causal mechanisms,” arXiv:Learning, 2019.

[21] M. Feurer, J. T. Springenberg, and F. Hutter, “Initializing bayesian hyperparameter optimization via meta-learning,” in AAAI, 2015.

[22] J. Kim, S. Kim, and S. Choi, “Learning to warm-start bayesian hyper- parameter optimization,” arXiv preprint arXiv:1710.06219, 2017.

[23] M. Poloczek, J. Wang, and P. I. Frazier, “Warm starting bayesian optimization,” in Winter simulation conference, pp. 770–781, 2016.

[24] M. Feurer, B. Letham, and E. Bakshy, “Scalable meta-learning for bayesian optimization.,” in ICML AutoML Workshop, 2018.

[25] M. Wistuba, N. Schilling, and L. Schmidt-Thieme, “Two-stage transfer surrogate model for automatic hyperparameter optimization,” in Joint European conference on machine learning and knowledge discovery in databases, pp. 199–214, Springer, 2016.

[26] M. L. Puterman, Markov decision processes: Discrete stochastic dynamic programming. John Wiley and Sons, Inc., 1994.

[27] D. Silver, A. Huang, C. J. Maddison, A. Guez, L. Sifre, d. D. G. Van, J. Schrittwieser, I. Antonoglou, V. Panneershelvam, and M. Lanctot, “Mastering the game of go with deep neural networks and tree search,” Nature, vol. 529, no. 7587, pp. 484–489, 2016.

[28] V. Mnih, K. Kavukcuoglu, D. Silver, A. Graves, I. Antonoglou, D. Wier- stra, and M. Riedmiller, “Playing atari with deep reinforcement learning,” Computer Science, 2013.

[29] S. Levine, C. Finn, T. Darrell, and P. Abbeel, “End-to-end training of deep visuomotor policies,” The Journal of Machine Learning Research, vol. 17, no. 1, pp. 1334–1373, 2016.

[30] J. Galletly, “Evolutionary algorithms in theory and practice,” Complexity, vol. 2, no. 8, pp. 26–27, 1996.

[31] S. Das and P. N. Suganthan, “Differential evolution: A survey of the state-of-the-art,” IEEE Transactions on Evolutionary Computation, vol. 15, no. 1, pp. 4–31, 2011.

[32] K. Arulkumaran, P. Deisenroth, M, M. Brundage, and et al., “Deep rein- forcement learning: A brief survey,” IEEE Signal Processing Magazine, vol. 34, no. 6, pp. 226–38, 2017.

[33] R. Sutton and A. Barto, Reinforcement Learning:An Introduction. 1998.

[34] B. T. Polyak, “Some methods of speeding up the convergence of iter- ation methods,” USSR Computational Mathematics and Mathematical Physics, vol. 4, no. 5, pp. 1–17, 1964.

[35] J. Duchi, E. Hazan, and Y. Singer, “Adaptive subgradient methods for online learning and stochastic optimization,” Journal of Machine Learning Research, vol. 12, no. Jul, pp. 2121–2159, 2011.

[36] M. D. Zeiler, “Adadelta: an adaptive learning rate method,” arXiv preprint arXiv:1212.5701, 2012.

[37] D. P. Kingma and J. Ba, “Adam: A method for stochastic optimization,” in ICLR, 2015.

[38] M. Zinkevich, “Online convex programming and generalized infinitesi- mal gradient ascent,” in ICML, pp. 928–935, 2003.

[39] C. M. Bishop, Pattern recognition and machine learning. springer, 2006.

[40] J. Mockus, “On bayesian methods for seeking the extremum,” in Optimization Techniques IFIP Technical Conference, pp. 400–404, Springer, 1975.

A. Bayesian Optimization

Bayesian optimization is to find a minimizer  x∗of a function f in a bounded space by iteratively querying the function at sampled points. At n-th iteration, we first construct a probabilistic model on D  = {xi, f(xi)}ni=1for estimating function values. Then using an acquisition function a(x) to select next point. We estimate f by surrogate model based on Gaussian process (GP). We assume  f(·) ∼ GP(m(·), k(·, ·)), where  m(·)is mean function and  k(·, ·)is covariance function or kernel. Given D  = {xi, f(xi)}ni=1, GP is simply a joint Gaussian: f(x1:n) ∼ N (m(x1:n), K(x1:n)), where K(x1:n)is a  n × ncovariance matrix with elements of  k(xi, xj)defined as kernel function measuring relation between  xiand  xj. We choose RBF kernel  k(xi, xj) =exp(−ρ∥xi − xj∥2)in this work. Then for a new point  xn+1, we consider joint distribution of previously and newly sampled points:

image

k(xn+1; x1:n) = (k(x1, xn+1), · · · , k(xn, xn+1)). The posterior  f(xn+1|D)satisfies Gaussian distribution [39]) with mean and variance as

image

which is the surrogate model of GP. As we can see, GP estimates function value via given data D based on surrogate model of Eq. (19) and Eq. (20). Then we use acquisition function a(x) to choose which point will be evaluated next. Upper confidence bound (UCB) [4]and expected improvement (EI) [40] have been shown to be efficient. The expected improvement is the most prominent choice for bayesian optimization. Formally, the improvement for x is defined as:

image

where  fbestis the current best function value and Y is a random variable modeling the function value f(x) depending on observation  {x1:n, f(x1:n)}. Then the point with highest expected improvement will be chosen to be next point:

image


Designed for Accessibility and to further Open Science