This paper considers unconstrained continuous global optimization problem:
where f is smooth and non-convex. The study of continuous global optimization can be dated back to 1950s [1]. The outcomes are very fruitful, please see [2] for a basic reference on most aspects of global optimization, [3] for a comprehensive archive of online information, and [4] for practical applications.
Numerical methods for global optimization can be classified into four categories according to their available guarantees, namely, incomplete, asymptotically complete, complete, and rigorous methods [5]. We make no attempt on referencing or reviewing the large amount of literatures. Interested readers please refer to a WWW survey by Hart [6] and Neumaier [3]. Instead, this paper focuses on a sub-category of incomplete method, the two-phase approach [7], [8].
A two-phase optimization approach is composed of a sequence of cycles, each cycle consists of two phases, a minimization phase and an escaping phase. At the minimization phase, a minimization algorithm is used to find a local minimum for a given starting point. The escaping phase aims
HZ, JS and ZX are all with the School of Mathematics and Statistics and National Engineering Laboratory for Big Data Analytics, Xi’an Jiaotong University, Xi’an, China. Corresponding author: Jianyong Sun, email: jy.sun@xjtu.edu.cn
to obtain a good starting point for the next minimization phase so that the point is able to escape from the local minimum.
A. The Minimization Phase
Classical line search iterative optimization algorithms, such as gradient descent, conjugate gradient descent, Newton method, and quasi-Newton methods like DFP and BFGS, etc., have flourished decades since 1940s [9], [10]. These algorithms can be readily used in the minimization phase.
At each iteration, these algorithms usually take the following location update formula:
where k is the iteration index, are the iterates,
is often taken as
where
is the step size and
is the descent direction. It is the chosen of
that largely determines the performance of these algorithms in terms of convergence guarantees and rates.
In these algorithms, is updated by using first-order or second-order derivatives. For examples,
in gradient descent (GD), and
in Newton method where
is the Hessian matrix. These algorithms were usually with mathematical guarantee on their convergence for convex functions. Further, it has been proven that first-order methods such as gradient descent usually converges slowly (with linear convergence rate), while second-order methods such as conjugate gradient and quasi-Newton can be faster (with super linear convergence rate), but their numerical performances could be poor in some cases (e.g. quadratic programming with ill-conditioned Hessian due to poorly chosen initial points).
For a specific optimization problem, it is usually hard to tell which of these algorithms is more appropriate. Further, the no-free-lunch theorem [11] states that “for any algorithm, any elevated performance over one class of problems is offset by performance over another class”. In light of this theorem, efforts have been made on developing optimization algorithms with adaptive descent directions.
The study of combination of various descent directions can be found way back to 1960s. For examples, the Broyden family [12] uses a linear combination of DFP and BFGS updates for the approximation to the inverse Hessian. In the Levenberg-Marquardt (LM) algorithm [13] for nonlinear least square problem, a linear combination of the Hessian and identity matrix with non-negative damping factor is employed to avoid slow convergence in the direction of small gradients. In the accelerated gradient method and recently proposed stochastic optimization algorithms, such as momentum [14], AdaGrad [15], AdaDelta [16], ADAM [17] and such, moments of the first-order and second-order gradients are combined and estimated iteratively to obtain the location update.
Besides these work, only recently the location update is proposed to be adaptively learned by considering it as a parameterized function of appropriate historical information:
where represents the information gathered up to k iterations, including such as iterates, gradients, function criteria, Hessians and so on, and
is the parameter.
Neural networks are used to model in recent literature simply because they are capable of approximating any smooth function. For example, Andrychowicz et al. [18] proposed to model
by long short term memory (LSTM) neural network [19] for differentiable f, in which the input of LSTM includes
and the hidden states of LSTM. Li et al. [20] used neural networks to model the location update for some machine learning tasks such as logistic/linear regression and neural net classifier. Chen et al. [21] proposed to obtain the iterate directly for black-box optimization problems, where the iterate is obtained by LSTM which take previous queries and function evaluations, and hidden states as inputs.
Neural networks used in existing learning to learn approaches are simply used as a block box. The interpretability issue of deep learning is thus inherited. A model-driven method with prior knowledge from hand-crafted classical optimization algorithms is thus much appealing. Model driven deep learning [22], [23] has shown its ability on learning hyper-parameters for a compressed sensing problem of the MRI image analysis, and for stochastic gradient descent methods [24], [25].
B. The Escaping Phase
A few methods, including tunneling [26] and filled function [8], have been proposed to escape from local optimum. The tunneling method was first proposed by Levy and Montalvo [26]. The core idea is to use the zero of an auxiliary function, called tunneling function, as the new starting point for next minimization phase. The filled function method was first proposed by Ge and Qin [8]. The method aims to find a point which falls into the attraction basin of a better than current local minimizer by minimizing an auxiliary function, called the filled function. The tunneling and filled function methods are all based on the construction of auxiliary function, and the auxiliary functions are all built upon the local minimum obtained from previous minimization phase. They are all originally proposed for smooth global optimization.
Existing research on tunneling and filled function is either on developing better auxiliary functions or extending to constrained and non-smooth optimization problems [27], [28], [29]. In general, these methods have similar drawbacks. First, the finding of zero or optimizer of the auxiliary function is itself a hard optimization problem. Second, it is not always guaranteed to find a better starting point when minimizing the auxiliary function [30]. Third, there often exists some hyper-parameters which are critical to the methods’ escaping performances, but are difficult to control [31]. Fourth, some proposed auxiliary functions are built with exponent or logarithm term. This could cause ill-condition problem for the
Fig. 1. Illustration of a finite horizon Markov decision process.
minimization phase [30]. Last but not least, it has been found that though the filled and tunneling function methods have desired theoretical properties, their numerical performance is far from satisfactory [30].
C. Main Contributions
In this paper, we first propose a model-driven learning approach to learn adaptive descent directions for locally convex functions. A local-convergence guaranteed algorithm is then developed based on the learned directions. We further model the escaping phase within the filled function method as a Markov decision process (MDP) and propose two policies, namely a fixed policy and a policy learned by policy gradient, on deciding the new starting point. Combining the learned local algorithm and the escaping policy, a two-phase global optimization algorithm is finally formed.
We prove that the learned local search algorithm is convergent; and we explain the insight of the fixed policy which can has a higher probability to find promising starting points than random sampling. Extensive experiments are carried out to justify the effectiveness of the learned local search algorithm, the two policies and the learned two-phase global optimization algorithm.
The rest of the paper is organized as follows. Section II briefly discusses the reinforcement learning and policy gradient to be used in the escaping phase. Section III presents the model-driven learning to learn approach for convex optimization. The escaping phase is presented in Section IV, in which the fixed escaping policy under the MDP framework is presented in Section IV-B, while the details of the learned policy is presented in Section IV-C. Controlled experimental study is presented in Section V. Section VI concludes the paper and discusses future work.
In reinforcement learning (RL), the learner (agent) chooses to take an action at each time step; the action changes the state of environment; (possibly delayed) feedback (reward) returns as the response of the environment to the learner’s action and affects the learner’s next decision. The learner aims to find an optimal policy so that the actions decided by the policy maximize cumulative rewards along time.
Consider a finite-horizon MDP with continuous state and action space defined by the tuple where
denotes the state space,
the action space,
the initial distribution of the state,
the reward, and T the time horizon, respectively. At each time t, there are
and a transition probability
R where
denotes the transition probability of
conditionally based on
and
. The policy
, where
is the probability of choosing action
when observing current state
with
as the parameter.
As shown in Fig. 1, starting from a state , the agent chooses
; after executing the action, agent arrives at state
. Meanwhile, agent receives a reward
(or
) from the environment. Iteratively, a trajectory
can be obtained. The optimal policy
is to be found by maximizing the expectation of the cumulative reward
:
where the expectation is taken over trajectory where
A variety of reinforcement learning algorithms have been proposed for different scenarios of the state and action spaces, please see [32] for recent advancements. The RL algorithms have succeeded overwhelmingly for playing games such as GO [33], Atari [34] and many others.
We briefly introduce the policy gradient method for continuous state space [35], which will be used in our study. Taking derivative of w.r.t.
discarding unrelated terms, we have
Eq. 6 can be calculated by sampling trajectories in practice:
where denotes action (state) at time t in the ith trajectory,
is the cumulative reward of the ith trajectory. For continuous state and action space, normally assume
where can be any smooth function, like radial basis function, linear function, and even neural networks.
In this section, we first summarize some well-known first-and second-order classical optimization algorithms. Then the proposed model-driven learning to optimize method for locally convex functions is presented.
A. Classical Optimization Methods
In the sequel, denote . The descent direction
at the k-th iteration of some classical methods is of the following form [12]:
where is an approximation to the inverse of the Hessian matrix, and
is a coefficient that varies for different conjugate GDs. For example,
could take
for Crowder-Wolfe conjugate gradient method [12].
The update of also varies for different quasi-Newton methods. In the Huang family,
is updated as follows:
where
The Broyden family is a special case of the Huang family in case , and
.
B. Learning the descent direction: d-Net
We propose to consider the descent direction as a nonlinear function of
with parameter
for the adaptive compu- tation of descent search direction
. Denote
We propose
where I is the identity matrix.
At each iteration, rather than updating directly, we update the multiplication of
and
like in the Huang family [12]:
It can be seen that with different parameter and
settings,
can degenerate to different directions:
• when , the denominator of
is not zero, and
, the update degenerates to conjugate gradient.
• when , and the denominator of
is not zero, and
, the update becomes the preconditioned conjugate gradient.
• when , and
, the update degenerates to the Huang family.
• when , the denominator of
is not zero, and
the update becomes the steepest GD. Based on Eq. 15, a new optimization algorithm, called adaptive gradient descent algorithm (AGD), can be established.
It is summarized in Alg. 1. It is seen that to obtain a new direction by Eq. 15, information from two steps ahead is required as included in . To initiate the computation of new direction, in Alg. 1, first a steep gradient descent step (lines 3- 5) and then a non-linear descent step (lines 7-10) are applied. With these prepared information, AGD iterates (lines 14-18) until the norm of gradient at the solution
is less than a positive number
.
To specify the parameters in the direction update function h, like [18], we unfold AGD into T iterations. Each iteration can be considered as a layer in a neural network. We thus have a ‘deep’ neural network with T layers. The resultant network is called d-Net. Fig. 2 shows the unfolding.
Like normal neural networks, we need to train for its parameters . To learn the parameters, the loss function
is defined as
That is, we expect these parameters are optimal not only to a single function, but to a class of functions F; and to all the criteria along the T iterations.
We hereby choose F to be the Gaussian function family:
There are two reasons to choose the Gaussian function family. First, any is locally convex. That is, let H(x) represents the Hessian matrix of f(x), it is seen that
Fig. 2. The unfolding of the AGD.
Second, it is known that finite mixture Gaussian model can approximate a Riemann integrable function with arbitrary accuracy [36]. Therefore, to learn an optimization algorithm that guarantees convergence to local optima, it is sufficient to choose functions that are locally convex.
Given F, when optimizing , the expectation can be obtained by Monte Carlo approximation with a set of functions sampled from F, that is
where can then be optimized by the steepest GD algorithm.
Note: The contribution of the proposed d-Net can be summarized as follows. First, there is a significant difference between the proposed learning to learn approach with existing methods, such as [18], [21]. In existing methods, LSTM is used as a ‘black-box’ for the determination of descent direction, and the parameters of the used LSTM is shared among the time horizon. Whereas in our approach, the direction is a combination of known and well-studied directions, i.e. a ‘white-box’, which means that our model is interpretable. This is a clear advantage against black-box models.
Second, in classical methods, such as the Broyden and Huang family and LM, descent directions are constructed through a linear combination. On the contrary, the proposed method is nonlinear and subsumes a wide range of classical methods. This may result in better directions.
Further, the combination parameters used in classical methods are considered to be hyper-parameters. They are normally set by trial and error. In the AGD, these parameters are learned from the optimization experiences to a class of functions, so that the directions can adapt to new optimization problem.
C. Group d-Net
To further improve the search ability of d-Net, we employ a group of d-Nets, dubbed as Gd-Net. These d-Nets are connected sequentially, with shared parameters among them. Input of the k-th (k > 1) d-Net is the gradient from -th d-Net. To apply Gd-Net, an initial point is taken as the input, and is brought forward through these d-Nets until the absolute gradient norm is less than a predefined small positive real number.
In the following we show that Gd-Net guarantees convergence to optimum for convex functions. We first prove that AGD is convergent. Theorem 1 summarizes the result. Please see Appendix A for proof.
Theorem 1. Assume is continuous and differentiable and the sublevel set
is bounded. The sequence obtained by AGD with exact line search converges to a stable point.
Since d-Net is the unfolding of AGD, from Theorem 1, it is sure that the iterate sequence obtained by d-Net is non-increasing for any initial with properly learned parameters. Therefore, applying a sequence of d-Net (i.e. Gd-Net) on a bound function f(x) from any initial point
will result in a sequence of non-increasing function values. This ensures that the convergence of the sequence, which indicates that Gd-Net is convergent under the assumption of Theorem 1.
Gd-Net guarantees convergence for locally convex func- tions. To approach global optimality, we present a method to escape from the local optimum once trapped. Our method is based on the filled-function method, and is embedded within the MDP framework.
A. The Escaping Phase in the Filled Function Method
In the escaping phase of the filled function method, a local search method is applied to minimize the filled function for a good starting point for next minimization phase. To apply the local search method, the starting point is set as where
is the local minimizer obtained from previous minimization phase,
is a small constant and d is the search direction.
Many filled functions have been constructed (please see [30] for a survey). One of the popular filled-functions [8] is defined as follows
where a is a hyper-parameter. It is expected that minimizing H(x) can lead to a local minimizer which is away from due to the exist of the exponential term.
Theoretical analysis has been conducted on the filled function methods in terms of its escaping ability [8]. However, the filled function methods have many practical weaknesses yet to overcome.
First, the hyper-parameter a is critical to algorithm performance. Basically speaking, if a is small, it struggles to escape from , otherwise it may miss some local minima. But it is very hard to determine the optimal value of a. There has no theoretical results, neither rule of thumb on how to choose a.
Second, the search direction d is also very important to the algorithmic performance. Different d’s may lead to different local minimizers, and the local minimizers are not necessarily better than . In literature, usually a trial-and-error procedure is applied to find the best direction from a set of pre-fixed
Fig. 3. Red lines show the contour of the three-hump function arrows are the gradients of
. There is a saddle point at (12,15) for
directions, e.g. along the coordinates [8]. This is apparently not effective. To the best of our knowledge, no work has been done in this avenue.
Third, minimizing H(x) itself is hard and may not lead to a local optimum, but a saddle point [8] even when a promising search direction is used. Unfortunately, there is no studies on how to deal with this scenario in literature. Fig. 3 shows a demo about this phenomenon. In the figure, the contour of f(x) is shown in red lines, while the negative gradients of the filled function H(x) are shown in blue arrows. From Fig. 3, it is seen that minimizing H(x) from a local minimizer of f at (4, 13) will lead to the saddle point at (12, 15).
B. The Proposed Escaping Scheme
The goal of an escaping phase is to find a new starting point such that
can escape from the attraction basin of
(the local minimizer obtained from previous minimization phase) if a minimization procedure is applied, where
is the direction and
is called the escaping length in this paper.
Rather than choosing d from a pre-fixed set, we could sample some directions, either randomly or sequentially following certain rules. In this section, we propose an effective way to sample directions, or more precisely speaking ’s.
In our approach, the sampling of is modeled as a finite-horizon MDP. That is, the sampling is viewed as the execution of a policy
: at each time step t, given the current state
, and reward
, an action
, i.e. the increment
, is obtained by the policy. The policy returns
by deciding a search direction
and an escaping length
.
At each time step t, the state is composed of a collection of previously used search directions
and their scores
, where
is a hyper-parameter. Here the score of a search direction measures how promising a direction is in terms of the quality of the new starting point that it can lead to. A new starting point is of high quality if applying local search from it can lead to a better minimizer than current one. The initial state
includes a set of
directions sampled uniformly at random, and their corresponding scores.
In the following, we first define ‘score’, then present the policy on deciding and
, and the transition probability
. Without causing confusion, we omit the subscript in the sequel.
1) Score: Given a search direction d, a local minimizer , define
where is the step size along d.
Theorem 2. If is a point outside the boundary of
’s attraction basin, there is no other local minimizer within
. Then there exists a
such that
obtains its maximum at . And g(t) is monotonically increasing in
, and monotonically decreasing in
.
If we let , then
. This implies that
is actually
along the direction from
pointing to
. This tells whether a direction d can lead to a new minimizer or not. A direction d with a positive
indicates that it could lead to a local minimizer different to present one.
We therefore define the score of a direction , to be the greatest
along d, i.e.
For such d that , we say it is promising.
In the following, we present the policy on finding
(or new starting point). The policy includes two sub-policies. One is to find the new point given a promising direction, i.e. to find the escaping length. The other is to decide the promising direction.
2) Policy on finding the escaping length: First we propose to use a simple filled function as follows:
Here a is called the ‘escaping length controller’ since it controlls how far a solution could escape from the current local optimum. Alg. 2 summarizes the policy proposed to determine the optimal and the new starting point x.
In Alg. 2, given a direction d, the filled function is optimized for N steps (line 3). The sum of the iterates’ function values, denoted as F(a) (line 4), is maximized w.r.t. a by gradient ascent (line 5). The algorithm terminates if a stable point of F(a) is found (
), or the search is out of bound (
). When the search is out of bound, a negative score is set for the direction d (line 8). As a by-product, Alg. 2 also returns the score
of the given direction d.
We prove that can escape from the attraction basin of
and ends up in another attraction basin of a local minimizer
with smaller criterion if
exists. Theorem 3 summarizes the result.
Theorem 3. Suppose that is a point such that
, and there are no other points that are with smaller or equal criterion than
within
. If the learning rate
is sufficiently small, then there exists an
such that
.
According to this theorem, we have the following corollary.
Corollary 1. Suppose that is the solution obtained by optimizing
along d starting from
at the N-th iteration, then
will be in an attraction basin of
, if the basin ever exists.
Theorem 3 can be explained intuitively as follows. Consider pushing a ball down the peak of a mountain with height (it can be regarded as the ball’s gravitational potential energy) along a direction d. The ball will keep moving until it arrives at a point
for some
such that
. For any
, the ball has a positive velocity, i.e.
where
. But the ball has a zero velocity at
, and negative at
, Hence
reaches its maximum in
. The integral is approximated by its discrete sum, i.e. F(a), in Alg. 2.
Further, according to the law of the conservation of energy, the ball will keep moving until at some 0 in which case
. This means that the ball falls into the attraction basin of a smaller criterion than
as shown in Fig. 4(b). Fig. 4(a) shows when a is small, in N iterations, the ball reaches some
but
.
Moreover, if there is no smaller local minimizers in search region, the ball will keep going until it rolls outside the restricted search region bounded by M as shown in Fig. 4(e) which means Alg. 2 fails to find . Fig. 4(c)(d) show the cases when there are more than one local minimizers within the search region.
Once such has been found, the corresponding
will enter an attraction basin of a local minimum with smaller criterion than
. If we cannot find such an
in the direction of d within a distance M to
, we consider that there is no another smaller local minimum along d. If it is the case, d is non-promising. We thus set a negative score for it as shown in line 8 of Alg. 2.
It is seen that the running of line 5 of Alg. 2 requires to compute N gradients of f(x) at each iteration. This causes Alg. 2 time consuming. We hereby propose to accelerate this procedure by fixing a but finding a proper number of iterations. Alg. 3 summarizes the fast policy. Given a direction d, during the search, the learning rate and the escaping length controller a are fixed. At each iteration of Alg. 3, an iterate
is obtained by applying gradient descent over
. The gradient of
over f(x) is computed (line 6).
is computed (line 7). Alg. 3 terminates if there is an i, such that
and
or the search is beyond the bound. It is seen that during the search, at each iteration, we only need to compute the gradient for once, which can significantly reduce the computational cost in comparison with Alg. 2.
Alg. 3 aims to find an integer i such that but
0. The existence of such an i can be illustrated as follows. It is seen that
(please see Eq. 31 in Appendix B). This implies that
. When
, we have
and
so that
and
for any
, and
and
. Thus, there exists an i such that Q(i) > 0 and
.
Corollary 1 proves that if there exists a better local minimum along d, then applying Alg. 2 or Alg. 3, we are able to escape from the local attraction basin of
.
3) Policy on the sampling of promising directions: In the following, we show how to sample directions that are of high probability to be promising. We first present a fixed policy, then propose to learn for an optimal policy by policy gradient.
Alg. 4 summarizes the fixed policy method. In Alg. 4, first a set of directions are sampled uniformly at random (line 1). Their scores are computed by Alg. 2 or Alg. 3. Archives used to store the directions and starting points are initialized (line 2). A direction is sampled by using a linear combination of previous directions with their respective scores as coefficients (line 4). If the sampled direction has a positive score, its score and the obtained starting point are included in the archive. The sets of scores and directions are updated accordingly in a FIFO manner (lines 9-10). The algorithm terminates if the number of sampling exceeds P.
We hope that the developed sampling algorithm is more efficient than that of the random sampling in terms of finding promising direction. denotes the probability of finding a promising direction by using the random sampling,
be the probability by the fixed policy. Then in Appendix C, we will do some explanation why
.
4) The transition: In our MDP model, the probability transition is deterministic. The determination of new starting point depends on the sampling of a new direction
and its score
. New state
is then updated in a FIFO manner. That is, at each time step, the first element
in
is replaced by the newly sampled
.
All the proofs in this section are given in Appendix B.
C. Learning the Escaping Policy by Policy Gradient
In the presented policy, a linear combination of previous directions with their scores as coefficients is applied to sample a new direction. However, this policy is not necessarily optimal.
Fig. 4. Possible scenarios encountered when estimating . (a) shows the case when a is not large enough, while (b) shows when a is appropriate. (c) shows that
reaches a local minimum, but
; (d) shows the case when there are more than one local minimizer. (e) shows when there is no smaller local minimizer within
In this section, we propose to learn an optimal policy by the policy gradient algorithm [35].
The learning is based on the same foregoing MDP framework. The goal is to learn the optimal coefficients for combining previously sampled directions. We assume that at time t, the coefficients are obtained as follows:
where and
is the output of a feed-forward neural network g with parameter
, and
is the coefficients. The current state
is the composition of
and
.
Fig. 5 shows the framework of estimating the coefficients and sampling a new direction at a certain time step. For the next time step, and
are updated
The policy gradient algorithm is used to learn for the neural network g. We assume that
and the policy can be stated as follows:
The reward is defined to be
where is a constant,
is the score of the sampled direction
and
is the indicator function.
Alg. 5 summaries the policy gradient learning procedure for is updated in E epochs. At each epoch, first a sample of trajectories is obtained (lines 3-23). Given
, a trajectory can be sampled as follows. First, a set of
initial directions is randomly generated and their scores are computed by Alg. 3 (lines 6-7). P new directions and their corresponding scores are then obtained (lines 9-22). At each step, the obtained direction
, the policy function
and the reward
are gathered in the current trajectory
(line 20). After the trajectory sampling,
and
are updated in lines 24- 28 and line 29, respectively.
Algorithm 5 Training policy network with policy gradient
Require: a local minimum , an integer P > 0, the number of training epochs E, the number of trajectories
0 and learning rate
.
Ensure: the optimal network parameter .
10: // sample new direction
13: apply Alg. 3 to obtain ;
14: // update the state
15: set by Eq. 23
24: // policy gradient
26: for do
28: end for
29: θθ;
Fig. 5. The framework of the escaping policy on the t-th time step. The black solid line is the policy proposed in section IV-B. The red dash line covers the feed-forward network and its output.
D. Learning to be Global Optimizer
Combining the proposed local search algorithm and the escaping policy, we can form a global optimization algorithm. Alg. 6 summarizes the algorithm, named as LGO. Starting from an initial point x, Gd-Net is firstly applied to obtain a local minimizer (line 2). The escaping policy is applied to sample P new starting points (line 5). Gd-Net is then applied on these points (line 9). The algorithm terminates if the prefixed maximum number of escaping tries (i.e. K) has been reached (line 4), or no new promising directions can be sampled (line 6). If any of these conditions have been met, it is assumed that a global optimum has been found.
Note. We should highlight that our method surpasses some filled function methods in the sense that our method has more chances to escape from local optimum. For example, consider the following filled function [37]:
The existence of the stable point to H usually holds. But
can be a saddle point or a local optimizer. If
is a saddle point, then to escape
, it is only possible by searching along
. However, it is highly unlikely
be contained in the pre-fixed direction set of the traditional filled function methods. This indicates that the corresponding filled function method will fail.
On the other hand, if is a local minimizer of H(x), we can prove that the proposed policy can always find a promising solution. Theorem 4 summarizes the result.
Theorem 4. Suppose there exists an attraction basin of on the domain of H(x), denoted as
, then
, we have
for
.
Proof. We first prove that , s.t.
. This can be done by contradiction. If there is no such t, then
This is because that , we have
, H(x) degenerates to
. Eq. 25 implies that apply gradient descent from x on H(x) will not lead to a point in
. This contradicts our assumption that
.
The existence of t implies by Theorem 2.
In this section, we study the numerical performance of Gd- net, the escaping policies, and LGO.
A. Model-driven Local Search
This section investigates the performance of Gd-Net. In the experiments, 50 d-Net blocks are used. Parameters of these blocks are the same.
Training. d-Net is trained through minimizing the Monte Carlo approximation to the loss functions as defined in Eq. 17, in which a sample of the Gaussian family is used. In the experiments, we use ten 2-d Gaussian functions with positive covariance matrix as the training functions. d-Net is trained on 25 initial points sampled uniformly at random for each training function. At each layer of d-Net, the step size
is obtained by exact line search in [0,1] 1. Gradient descent is used to optimize Eq. 17 with a learning rate 0.1 for 100 epochs. The same training configuration is used in the following. Testing. We use functions sampled from
in 5-d, and
-functions2 in 2-d to test Gd-Net. Note that Gd-Net is trained on 2-d
. By testing on 5-d Gaussian functions, we can see its generalization ability on higher-dimensional functions. The testing on
functions can check the generalization ability of Gd-Net on functions with non-symmetric contour different to Gaussians. Fig. 6 shows the difference between Gaussian and
contour.
1Note that taking is not necessarily the best choice for linesearch. It is rather considered as a rule of thumb. Notice that limiting the search of
in (0,1] could make Gd-Net be scale-variant. We transform
in order to eliminate the scaling problem where
the initial point when testing.
Fig. 6. A demo on the difference between a Gaussian contour and a contour.
Fig. 7. The optimization curve of the learned Gd-Net on a 5-d Gaussian function with various initial points.
Fig. 7 shows the testing result of the learned Gd-Net on optimizing a 5-d Gaussian function with different initial points. The test on a function is shown in Fig. 8. In these figures, first-order and second-order optimization algorithms, including steepest gradient descent, conjugate descent and BFGS, are used for comparison. From the figures, it is clear that Gd-Net requires much less iterations to reach the minimum than the compared algorithms.
Further, we observed that unlike BFGS, where a positive-definite Hessian matrix is a must, Gd-Net can cope with ill-conditioned Hessians. Fig. 9 shows the results on a 2-d Gaussian function with ill-posed Hessian. For an initial point that is far away from a minimizer, its Hessian is nearly singular which implies that the search area is rather flat. From the left plot of Fig. 9, it is seen that Gd-Net gradually decreases, while the other methods fail to make any progress. On the right plot, it is seen that Gd-Net finally progresses out the flat area and the criterion starts decreasing quickly.
B. The fixed escaping policy
In this section, controlled experiments are carried out to justify the ability of the fixed policy. We first consider a low-dimensional non-convex optimization problem with two local minimizers, then a high-dimensional highly non-convex problems with many local minimizers. The fixed policy is compared with random sampling on these test problems.
Fig. 8. The optimization curve of the learned Gd-Net on a 2-d with two different initial points.
Fig. 9. The optimization procedure of Gd-Net (with 5 blocks) on a 5-d Gaussian function with an initial point far away from the optimum. The left plot shows the decreasing curve obtained by the first 4 blocks, while the right shows the curve of the rest block.
1) Mixture of Gaussians functions: Consider the following mixture of Gaussians functions
where and
. The mixture of Gaussian functions have m local minimizers at
’s.
In the experiments, we set m = 2, 3. To test the ability of the escaping scheme, we assume the escaping starts from a local minimizer. We test on dimension n = 2, 3, 5, 8, 10. The other algorithmic parameters are 15, 20, 50, 100, 250 for n = 2, 3, 5, 8, 10, respectively, and
.
Table I shows the average number of samplings used to escape from local optimum and the standard deviation (in brackets) over 500 runs obtained by using the fixed policy and the random sampling method.
From the table, it is observed that the fixed policy requires less samples than that of the random sampling, and the standard deviation is smaller. The p-value obtained by applying the rank sum hypothesis test at 5% significance level is shown in the last column. The hypothesis test suggests that the fixed policy outperforms the random sampling approach significantly (the p-value is less than 0.05).
TABLE I THE NUMBER OF SAMPLINGS USED TO ESCAPE.
2) Deep neural network: The loss function of a deep neural network has many local optimizers. We take the training of a deep neural network for image classification on CIFAR-10 as an example. For CIFAR-10, an 8-layer convolution neural network similar to Le-Net [38], with 2520-d parameters, is applied. The cross entropy is used as the loss function.
The number of local minimizers found by a method is used as the metric of comparison. Given a maximal number of attempts P, a larger number of local minimizers indicates a higher probability of escaping local minimum, and hence a better performance. For CIFAR-10, ADAM [17] with mini-batch stochastic gradient is applied in the minimization phase.
Note that existing filled functions often involve f(x). This usually makes the application of mini-batch stochastic gradient method difficult if f(x) is not sum of sub-functions. Instead, the auxiliary function used in this paper does not involve f(x). Fig. 10 shows the scores (cf. Eq. 21) against the distance to current local optimum with different mini-batch sizes. From the figure, it is seen that with different batch-size, the scores exhibit similar behavior. This shows the applicability of the proposed escaping method to stochastic-based local search algorithms. In the experiment, the parameters to apply Alg. 2 is set as
and
.
In the following, the effective samplings3 in 1000 samplings are used to compare the proposed escaping policy against the random sampling. The obtained promising directions with different thresholds in 500 runs are summarized in Table II. It is clear that the proposed escaping policy is able to find more samples with positive scores than that of the random sampling.
TABLE II THE NUMBER OF EFFECTIVE SAMPLINGS IN 1000 SAMPLINGS.
C. The fixed policy vs. the learned policy
In this section we show the effectiveness of the learned policy. The policy function in Eq. 8 is set as a 3 layer network with sigmoid as hidden layer activation function, and a fully-connected output layer with linear activation function.
Fig. 10. The curves of the score against the step size w.r.t. two mini-batches when training a convolution neural network for CIFAR-10.
Training. A single Gaussian mixture function with two local minimizers is used for training the policy network in 2-d and 5-d, respectively. The other parameters are set and E = 30 for 2 (5)-d. The number of hidden layer units is 5 and 200 for 2-d and 5-d, respectively.
Testing. To test the learned policy, we also use the Gaussian mixture functions with two local minimizers in 2-d and 5-d. Here we set . Table III shows the average number of samplings used to escape from local optimum and the standard deviation (in brackets) over 500 runs obtained by using the fixed policy, the learned policy and the random sampling. Detailed configurations of the functions used for training can be found in Appendix D.
TABLE III THE AVERAGE NUMBER OF SAMPLINGS USED TO FIND THE PROMISING DIRECTION IN DIFFERENT SETTINGS.
From the table, it is seen clearly that the learned policy requires less samplings to reach new local optimum. To observe the behaviors of the compared escaping policies better, Fig. 11 shows the histograms of the number of effective samplings for a 2-D function. From the figure, we see that the fixed policy is mostly likely to escape the current local optimum in one sampling, but it also is highly possible to require more samplings. That is, the number of effective samplings by the fixed policy follows a heavy-tail distribution. For the learned policy, the effective sampling numbers are mostly concentrated in the first 7 samplings. This shows that the learned policy is more robust than the other policies, which can also be confirmed in Table III by the standard deviations. We may
Fig. 11. The histogram of the number of effective samplings for a 2-D Gaussian mixture function by the fixed policy, the learned policy and the random sampling policy.
Fig. 12. The running procedure of LGO, in which GD, BFGS and Gd-Net are used as local search, while the learned policy is used to escape from local minimum (represented in pink dotted line).
thus conclude that the learned policy is more efficient than the fixed policy and random sampling.
D. The global search ability
In this section, we study the global search ability of LGO in comparison with the filled function method proposed in [8]. Three examples, including the three hump function, robust regression and neural network classifier, are used as benchmarks. 1) Three-hump function: The function is defined as
It has three local minimizers at , and
. The global minimizer is at
. In our test, the algorithm parameters are set as
0.2 and N = 20. We run L
GO 20 times with different initial points. Fig. 12 shows the averaged optimization process of L
GO. From the figure, it is seen that L
GO is able to reach the local minimizer one by one. It is also seen that Gd-Net performs better than BFGS and steepest descent.
Table IV shows the number of effective samplings when using the fixed policy, random sampling and learned policy as escaping scheme. It is seen that the filled function method has failed due to the existence of the saddle point as shown in Fig. 3.
TABLE IV THE NUMBER OF SAMPLINGS USED TO FIND THE PROMISING DIRECTIONS FOR THE THREE-HUMP FUNCTION.
Fig. 13. The contour of the robust regression function with , respectively.
2) Robust regression: For the robust linear regression problem [20], a popular choice of the loss function is the GemanMcClure estimator, which can be written as follows:
where represent the weights and biases, respectively.
and
is the feature vector and label of the i-th instance and
is a constant that modulates the shape of the loss function.
The landscape of the robust regression problem can be systematically controlled. Specifically, we can decide the number of local minimizers, their locations and criteria readily. Note that given {w, b}, the training data can be created by
Different {w, b} indicates different local minimum.
In our experiments, we randomly sample 50 points of in
, and divide them to two sets
and
. For each set
, give a
, apply
, a training set
can be obtained. Combining them, we obtain the whole data set
. Given this training set, it is known that the objective function has two obvious local minimizers at
and
and lots of other local minimizers. Please see Fig. 13 for contour of the robust regression function with
and
. There are two main local minimizers at
and
with
), and many other local minimizers.
Fig. 14 shows the optimization curve of the robust regression function, in which Gd-Net and GD are compared. We notice that BFGS is not convergent in this case since landscape here is vary flat. From the figure, we can see that for robust regression function, Gd-Net also performs better than GD. Table V shows the average numbers of effective samplings
Fig. 14. The optimization procedure of LGO and the steepest gradient with the learned policy on robust regression.
Fig. 15. The optimization curve of LGO and ADAM on training the classification network.
obtained by the compared policies. From the table it is clear that the learned policy performs the best, while the filled function method needs much more times.
TABLE V THE NUMBER OF SAMPLINGS USED TO FIND THE PROMISING DIRECTIONS FOR ROBUST REGRESSION FUNCTION.
3) neural network classifier: We construct a small network with one hidden layer for a classification problem in 2-d [20]. The number of hidden layer is one, and the total dimension of network is 5. The goal is to classify into two classes, where
. The cross entropy is used as the loss function. ADAM is compared with L
GO. In ADAM, the learning rate is 0.001, and the hyper-parameters for momentum estimation are 0.9 and 0.9.
Fig. 15 shows the optimization curve. From the figure, we see that LGO performs better than ADAM. It can escape from the local optimum, and reach a better optimum successfully.
This paper proposed a two-phase global optimization algo- rithm for smooth non-convex function. In the minimization phase, a local optimization algorithm, called Gd-Net, was obtained by the model-driven learning approach. The method was established by learning the parameters of a non-linear combination of different descent directions through deep neural network training on a class of Gaussian family function. In the escaping phase, a fixed escaping policy was first developed based on the modeling of the escaping phase as an MDP. We further proposed to learn the escaping policy by policy gradient.
A series of experiments have been carried out. First, controlled experimental results showed that Gd-Net performs better than classical algorithms such as steepest gradient descent, conjugate descent and BFGS on locally convex functions. The generalization ability of the learned algorithm was also verified on higher dimensional functions and on functions with contour different to the Gaussian family function. Second, experimental results showed that the fixed policy was more able to find promising solutions than random sampling, while the learned policy performed better than the fixed policy. Third, the proposed two-phase global algorithm, LGO, showed its effectiveness on a benchmark function and two machine learning problems.
In the future, we plan to work on the following avenues. First, since the Hessian matrix is used in Gd-Net, it is thus not readily applicable to high-dimensional functions. Research on learning to learn approach for high-dimensional functions is appealing. Second, we found that learning the escaping policy is particularly difficult for high-dimensional functions. It is thus necessary to develop a better learning approach. Third, the two-phase approach is not the only way for global optimization. We intend to develop learning to learn approaches based on other global methods, such as branch and bound [39], and for other types of optimization problems such as non-smooth, non-convex and non-derivative.
[1] L. Dixon and G. Szeg¨o, Towards Global Optimization. New York: Elsevier, 1975.
[2] R. Horst and P. Pardalos, Eds., Handbook of Global Optimization. Dordrecht: Kluwer, 1995.
[3] (2019). [Online]. Available: https://www.mat.univie.ac.at/neum/glopt. html
[4] J. Pinter, “Continuous global optimization: Applications,” in Encyclopedia of Optimization, C. Floudas and P. Pardalos, Eds. Boston, M.A.: Springer, 2008.
[5] A. Neumaier, “Convexification and global optimization in continuous global optimization and constraint satisfaction,” in Acta Numerica, A. Iserles, Ed. Cambridge University press, 2004.
[6] P. Gary, W. Hart, L. Painton, C. Phillips, M. Trahan, and J. Wagner, “A survey of global optimization methods,” 1997.
[7] A. Levy and S. G´omez, “The tunneling method applied to global optimization,” in Numerical Optimization, P. Boggs, R. Byrd, and R. Schnabel, Eds. SIAM, 1985.
[8] R. P. Ge and Y. F. Qin, “A class of filled functions for finding global minimizers of a function of several variables,” Journal of Optimization Theory and Applications, vol. 54, no. 2, pp. 241–252, 1987.
[9] S. Boyd and L. Vandenberghe, Convex optimization. Cambridge university press, 2004.
[10] R. Fletcher and C. M. Reeves, “Function minimization by conjugate gradients,” Computer Journal, vol. 7, no. 2, pp. 149–154, 1964.
[11] D. H. Wolpert and W. G. Macready, “No free lunch theorems for optimization,” IEEE Transactions on Evolutionary Computation, vol. 1, no. 1, pp. 67–82, 1997.
[12] W. Sun and Y. Yuan, Optimization theory and methods: nonlinear programming. Springer Science & Business Media, 2006, vol. 1.
[13] D. W. Marquardt, “An algorithm for least-squares estimation of non- linear parameters,” Journal of the society for Industrial and Applied Mathematics, vol. 11, no. 2, pp. 431–441, 1963.
[14] B. T. Polyak, “Some methods of speeding up the convergence of iter- ation methods,” USSR Computational Mathematics and Mathematical Physics, vol. 4, no. 5, pp. 1–17, 1964.
[15] J. Duchi, E. Hazan, and Y. Singer, “Adaptive subgradient methods for online learning and stochastic optimization,” Journal of Machine Learning Research, vol. 12, no. Jul, pp. 2121–2159, 2011.
[16] M. D. Zeiler, “Adadelta: an adaptive learning rate method,” arXiv preprint arXiv:1212.5701, 2012.
[17] D. P. Kingma and J. Ba, “Adam: A method for stochastic optimization,” in ICLR, 2015.
[18] M. Andrychowicz, M. Denil, S. Gomez, M. W. Hoffman, D. Pfau, T. Schaul, B. Shillingford, and N. De Freitas, “Learning to learn by gradient descent by gradient descent,” in NIPS, 2016, pp. 3981–3989.
[19] S. Hochreiter and J. Schmidhuber, “Long short-term memory,” Neural Computation, vol. 9, no. 8, pp. 1735–1780, 1997.
[20] K. Li and J. Malik, “Learning to optimize,” in ICLR, 2017.
[21] Y. Chen, Hoffman, M. W, S. G. Colmenarejo, M. Denil, T. P. Lillicrap, and N. de Freitas, “Learning to learn without gradient descent by gradient descent,” ICML, 2017.
[22] Z. Xu and J. Sun, “Model-driven deep-learning,” National Science Review, vol. v.5, no. 1, pp. 26–28, 2018.
[23] J. Sun, H. Li, Z. Xu et al., “Deep admm-net for compressive sensing mri,” in NIPS, 2016, pp. 10–18.
[24] S. Wang, J. Sun, and Z. Xu, “Hyperadam: A learnable task-adaptive adam for network training,” in AAAI, 2019.
[25] K. Lv, S. Jiang, and J. Li, “Learning gradient descent: Better general- ization and longer horizons,” in ICML, 2017, pp. 2247–2255.
[26] A. Levy and A. Montalvo, “The tunneling algorithm for the global minimization of functions,” SIAM Journal on Scientific and Statistical Computing, 1985.
[27] Y. Xu, Y. Zhang, and S. Wang, “A modified tunneling function method for non-smooth global optimization and its application in artificial neural network,” Applied Mathematical Modelling, vol. 39, pp. 6348–6450, 2015.
[28] H. Lin, Y. Wang, and L. Fan, “A filled function method with one parameter for unconstrained global optimization,” Applied Mathematical Modelling, vol. 218, pp. 3776–3785, 2011.
[29] Y. Zhang, L. Zhang, and Y. Xu, “New filled functions for non-smooth global optimization,” Applied Mathematical Modelling, vol. 33, no. 7, pp. 3114–3129, 2009.
[30] L. Zhang, C. Ng, D. Li, and W. Tian, “A new filled function method for global optimization,” Journal of Global optimization, vol. 28, no. 1, pp. 17–43, 2004.
[31] S. Ma, Y. Yang, and H. Liu, “A parameter-free filled function for uncon- strained global optimization,” Applied Mathematics and Computation, vol. 215, no. 10, pp. 3610–3619, 2010.
[32] K. Arulkumaran, M. P. Deisenroth, M. Brundage, and A. A. Bharath, “Deep reinforcement learning: A brief survey,” IEEE Signal Processing Magazine, vol. 34, no. 6, pp. 26–38, 2017.
[33] D. Silver, A. Huang, C. J. Maddison, A. Guez, L. Sifre, G. Van Den Driessche, J. Schrittwieser, I. Antonoglou, V. Panneershelvam, M. Lanctot et al., “Mastering the game of go with deep neural networks and tree search,” Nature, vol. 529, no. 7587, p. 484, 2016.
[34] V. Mnih, K. Kavukcuoglu, D. Silver, A. Graves, I. Antonoglou, D. Wier- stra, and M. Riedmiller, “Playing atari with deep reinforcement learning,” arXiv preprint arXiv:1312.5602, 2013.
[35] R. Sutton and A. Barto, Reinforcement Learning:An Introduction, 1998.
[36] R. Wilson, “Multiresolution gaussian mixture models: Theory and applications,” in IEEE International Conference on Pattern Recognition. Citeseer, 2000.
[37] Y. Liang, L. Zhang, M. Li, and B. Han, “A filled function method for global optimization,” Journal of Computational and Applied Mathematics, vol. 205, no. 1, pp. 16–31, 2007.
[38] Y. LeCun, L. Bottou, Y. Bengio, P. Haffner et al., “Gradient-based learning applied to document recognition,” Proceedings of the IEEE, vol. 86, no. 11, pp. 2278–2324, 1998.
[39] E. L. Lawler and D. E. Wood, “Branch-and-bound methods: A survey,” Operations Research, vol. 14, no. 4, pp. 699–719, 1966.
The proof of Theorem 1 can be found below.
, and
. With exact linear search, we have
, therefore
It is clear that if
, otherwise a
can be chosen to make
diagonally dominant, which means
. Therefore, we can always make sure
, i.e.
is a descent direction, and
Since f(x) is bounded, there exists such that
.
This section gives details of the proofs for the theorems in Section III. The proof to Theorem 2 is shown below.
Proof. According to assumption (2), g(t) is convex in . As
, then
in
. Therefore, g(t) is monotonically increasing in
, and monotonically decreasing in
. Since
, then
. This implies that
is continuous in [0, T].
Since and
, then there exists a
such that
. Further,
is unique since it is assumed that there is no other local minima between
and
. Then we have
in
, and
in
.
To prove Theorem 3, we first prove Lemma 5.
Lemma 5. Suppose F(a) to be the function defined in Alg. 2. For fixed a and N, we have:
where gtd).
Proof. Since are all on the line
with t > 0. Therefore, each
can be written as
where and
. Further, we have
or equivalently,
We thus have:
Since is continuous in
, it is Riemann integrable. Thus, for a fixed N, we have
This finishes the proof.
In the sequel, we define
where . Here G(a) is just the distance between
and
along d. Obviously, G(a) is a polynomial function of a, and it is monotonically increasing.
Based on Lemma 5, Lemma 6 can be established.
Lemma 6. Suppose that is a point such that
, and there are no other local minimizer within
. If
is sufficiently small, then there exists an
such that
.
Proof. Since , and
is smooth, there exists a
such that
.
Let’s consider two cases. First, let , then there is an
s.t.
according to the intermediate value theorem. As
is sufficiently small, we have:
s.t.
Note that we can choose s.t.
Let ϵg
dt and ϵ
, then
s.t.
ϵ
< ϵ
F
> ϵ
.
Similarly, if let , then there is
s.t.
, as
is sufficiently small, we have:
s.t.
Note that
Let ϵg
dt and ϵ
, then
s.t. |F
F
< ϵ
.
In summary, we have and
, according to the intermediate value theorem, there exists an
such that
.
If there are L local minimizers in
whose criteria are bigger than
, and a local minimizer
with smaller criterion outside
. Denote
, we have L + 1 local maximizers
. Since
, we can set
such that
. Substituting
to
in the proof, we can prove Theorem 3.
In the following, we will explain why . In Alg. 4, the main idea is using negative linear combination and adding a noise to make algorithm robustly. Now we will explain the insight of ’negative linear combination’. We first assume that there are two local minimizers. Without loss of generality, suppose that we are at a local minimum
, and there exists a local minima
(
). Then
has a neighborhood region
, which satisfies
, then
denotes the center of circumscribed sphere of
. Then
is called the central direction in the sequel. Further, we define the ray
. We have the following Lemma 7.
Lemma 7. Given an initial sample of directions and scores ), using negative linear combination of
, it is of higher probability to obtain
than that of the random sampling.
Proof. In the following, we first prove the theorem in case n = 2. It is then generalized to n > 2.
In case and n = 2, suppose at some time step, we have two linearly independent directions
and
with negative scores. Let
be the confined search space. The search space can then be divided into four regions
and
. Particularly,
. Suppose that
has a neighborhood region
, which satisfies
, and the radius of the circumscribed sphere of
is
. The boundary of the circumscribed sphere and
can form a cone
. By assumption, the lines
has no interaction with
(otherwise we have found a direction that will lead to the attraction basin of
, i.e.
For each , take
and
such that
and
, respectively. Then the boundary of
and
form a cone
, where
, then
, the probability of finding
in
by the negative linear combination, can be computed as follows:
/
where is the direction got by negative linear combination. Notice that
where
is the measure of
, respectively. Denote
the probability of finding
in
by random sampling, then
can be represented by
and the measure of
. That is,
. As
does not interact with
, thus
. Then we have:
where is the measure of
and
, respectively. The last inequality holds because
.If
for i = 1, 2, then
is covered by
. Since the region covered by
in
has a larger measure than
is thus the best region for sampling.
In case n > 2, we have n directions with negative scores . If set
as the subspace S = {d =
, since S has a zero measure in
, the proof degenerates into the n = 2 case.
Furthermore, we will present why :
We illustrate by using in Fig. 16 in n = 2. In Fig. 16,
is a direction with negative score. If we want to create
which is a promising direction, then
must be between
and
as shown in Fig. 17.
To define and
, let
is the cone made by the boundary of
and
where
for any t > 0 and d, and
is defined in Eq. 32. We further define
and
and
.
and
are considered as the ray from
to the boundary of R. Similarly to the definition of
, we define
and
.If the distance between
and
is larger, then the distance between
and
must be larger as shown in Fig. 17. This implies the probability that
is promising is higher. When the distance between
and
is larger than
, i.e.
, the probability is the maximum since there is no d such that
and
. Therefore, it is the best to use the opposite direction of
since the point by interacting
and
is the furthest to
.
Fig. 16. Illustration of Lemma 7 in 2-D case. In the figure, is to be found in
is the radius of the circumscribed sphere of the attraction basin of
form a cone
the boundary of
forms a cone
. It is clear that sampling a direction in
is the best choice.
Similarly for , the best direction should be
. Taking both
and
into consideration, a direction is promising only if its interaction point with
is the furthest to both
and
. It is thus the best to sample a direction in the region spanned by
and
, i.e.
.
For n > 2, we have n directions with negative scores. Given the n directions, we can construct a spanned space B = {d = . Depending on the signs of
’s, we have
sub- regions
. We take
be the region with all negative
’s.
Similar to the analysis in n = 2, for each , the point
is the furthest to
. A direction is promising only if its interaction point with
is the furthest to all
’s. Therefore,
is the best region for sampling among the
regions.
A direction is sampled with equal probability in in random sampling. On the contrary, using negative linear combination is sampling in
. Therefore, we have
.
If f(x) has two local minima, we have explained . In case f(x) has 3 or more local minimizers, the sampling procedure can be done as follows. Assuming we have sampled N directions,
, from which at least one local minimizer
cannot be reached. It is not wise to sample within the cones induced by local minimizers we have visited. Instead, the negative rewards associated with these directions should be used as the linear combination for sampling directions for
. Therefore, this combination is guaranteed to be more efficient to sample promising directions for
than random sampling.
To train (test) the learned policy, the Gaussian mixture functions are used (cf. Eq. 26). And we use for 2-D problem. When
Fig. 17. Demonstration of promising direction and optimal direction. (a) shows when is close to each other,
are close too. (b) shows when the distance between
is bigger than
, the distance between
reaches the maximum.
testing, ,
for 5-D problem. When training, the following settings with different means and covariances, are applied in Table III.
• order 1-4, problem is in 2-D, ,
respectively;
• order 5, problem is in 5-D, Σ, Σ
0]
[5, 5, 5, 5, 5]
;
• order 6, problem is in 5-D, Σ, Σ
0]
[4, 4, 5, 5, 5]
;
• order 7, problem is in 5-D, Σ, Σ
0]
[3, 3, 5, 5, 5]
;
• order 8, problem is in 5-D, Σ, Σ
0]
[5, 5, 5, 5, 5]
;
• order 9, problem is in 5-D, Σ, Σ
0]
[3, 3, 5, 5, 5]
;