36:[["$","audio",null,{"id":"tts"}],["$","$L3b",null,{"paperID":"2003.04521","publisher":"arxiv","paperJSON":{"title":"Learning to be Global Optimizer","paperID":"2003.04521","avgLineHeight":11.95,"imgScale":4,"sections":[{"heading":"Abstract","paragraphs":[[{"style":{"fontWeight":"bold"},"text":"The advancement of artificial intelligence has cast a new light on the development of optimization algorithm. This paper proposes to ","element":"span"},{"style":{"fontStyle":"italic","fontWeight":"bold"},"text":"learn ","element":"span"},{"style":{"fontWeight":"bold"},"text":"$3c","element":"span"}],[{"style":{"fontStyle":"italic","fontWeight":"bold"},"text":"Index Terms","element":"span"},{"style":{"fontWeight":"bold"},"text":"—two-phase global optimization, learning to learn, model-driven deep learning, reinforcement learning, Markov Decision Process","element":"span"}]]},{"heading":"I. INTRODUCTION","paragraphs":[[{"text":"This paper considers unconstrained continuous global optimization problem:","element":"span"}],[{"style":{"width":"57%"},"width":580,"height":59,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2003.04521/images/0-0.png","element":"img"}],[{"text":"where ","element":"span"},{"style":{"fontStyle":"italic"},"text":"f ","element":"span"},{"text":"is smooth and non-convex. The study of continuous global optimization can be dated back to 1950s ","element":"span"},{"href":"#id-0","referenceIndex":1,"text":"[1]","element":"a"},{"text":". The outcomes are very fruitful, please see ","element":"span"},{"href":"#id-1","referenceIndex":2,"text":"[2] ","element":"a"},{"text":"for a basic reference on most aspects of global optimization, ","element":"span"},{"href":"#id-2","referenceIndex":3,"text":"[3] ","element":"a"},{"text":"for a comprehensive archive of online information, and ","element":"span"},{"href":"#id-3","referenceIndex":4,"text":"[4] ","element":"a"},{"text":"for practical applications.","element":"span"}],[{"text":"Numerical methods for global optimization can be classified into four categories according to their available guarantees, namely, incomplete, asymptotically complete, complete, and rigorous methods ","element":"span"},{"href":"#id-4","referenceIndex":5,"text":"[5]","element":"a"},{"text":". We make no attempt on referencing or reviewing the large amount of literatures. Interested readers please refer to a WWW survey by Hart ","element":"span"},{"href":"#id-5","referenceIndex":6,"text":"[6] ","element":"a"},{"text":"and Neumaier ","element":"span"},{"href":"#id-2","referenceIndex":3,"text":"[3]","element":"a"},{"text":". Instead, this paper focuses on a sub-category of incomplete method, the two-phase approach ","element":"span"},{"href":"#id-6","referenceIndex":7,"text":"[7]","element":"a"},{"text":", ","element":"span"},{"href":"#id-7","referenceIndex":8,"text":"[8]","element":"a"},{"text":".","element":"span"}],[{"text":"A two-phase optimization approach is composed of a sequence of cycles, each cycle consists of two phases, a minimization phase and an escaping phase. At the minimization phase, a minimization algorithm is used to find a local minimum for a given starting point. The escaping phase aims","element":"span"}],[{"text":"HZ, JS and ZX are all with the School of Mathematics and Statistics and National Engineering Laboratory for Big Data Analytics, Xi’an Jiaotong University, Xi’an, China. ","element":"span"},{"style":{"fontStyle":"italic"},"text":"Corresponding author: Jianyong Sun, email: jy.sun@xjtu.edu.cn","element":"span"}],[{"text":"to obtain a good starting point for the next minimization phase so that the point is able to escape from the local minimum.","element":"span"}],[{"style":{"fontStyle":"italic"},"text":"A. The Minimization Phase","element":"span"}],[{"text":"Classical line search iterative optimization algorithms, such as gradient descent, conjugate gradient descent, Newton method, and quasi-Newton methods like DFP and BFGS, etc., have flourished decades since 1940s ","element":"span"},{"href":"#id-8","referenceIndex":9,"text":"[9]","element":"a"},{"text":", ","element":"span"},{"href":"#id-9","referenceIndex":10,"text":"[10]","element":"a"},{"text":". These algorithms can be readily used in the minimization phase.","element":"span"}],[{"text":"At each iteration, these algorithms usually take the following location update formula:","element":"span"}],[{"style":{"width":"63%"},"width":640,"height":39,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2003.04521/images/0-1.png","element":"img"}],[{"text":"where ","element":"span"},{"style":{"fontStyle":"italic"},"text":"k ","element":"span"},{"text":"is the iteration index, ","element":"span"},{"style":{"height":10.79},"width":140.22,"height":26.98,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2003.04521/images/0-2.png","element":"img","alt":" xk+1, xk","inline":true,"padRight":true},{"text":"are the iterates, ","element":"span"},{"style":{"height":13.99},"width":50.21,"height":34.98,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2003.04521/images/0-3.png","element":"img","alt":" ∆k","inline":true,"padRight":true},{"text":"is often taken as ","element":"span"},{"style":{"height":13.19},"width":109.54,"height":32.98,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2003.04521/images/0-4.png","element":"img","alt":" αk · dk","inline":true,"padRight":true},{"text":"where ","element":"span"},{"style":{"height":9.19},"width":42.49,"height":22.98,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2003.04521/images/0-5.png","element":"img","alt":" αk","inline":true,"padRight":true},{"text":"is the step size and ","element":"span"},{"style":{"height":13.19},"width":37.74,"height":32.98,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2003.04521/images/0-6.png","element":"img","alt":" dk","inline":true,"padRight":true},{"text":"is the descent direction. It is the chosen of ","element":"span"},{"style":{"height":13.19},"width":37.74,"height":32.98,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2003.04521/images/0-7.png","element":"img","alt":" dk","inline":true,"padRight":true},{"text":"that largely determines the performance of these algorithms in terms of convergence guarantees and rates.","element":"span"}],[{"text":"In these algorithms, ","element":"span"},{"style":{"height":13.19},"width":37.74,"height":32.98,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2003.04521/images/0-8.png","element":"img","alt":" dk","inline":true,"padRight":true},{"text":"is updated by using first-order or second-order derivatives. For examples, ","element":"span"},{"style":{"height":16},"width":274.4,"height":40,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2003.04521/images/0-9.png","element":"img","alt":" dk = −∇f(xk)","inline":true,"padRight":true},{"text":"in gradient descent (GD), and ","element":"span"},{"style":{"height":17.38},"width":375.07,"height":43.46,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2003.04521/images/0-10.png","element":"img","alt":" −[∇2f(xk)]−1∇f(xk)","inline":true,"padRight":true},{"text":"in Newton method where ","element":"span"},{"style":{"height":17.38},"width":148.76,"height":43.46,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2003.04521/images/0-11.png","element":"img","alt":" ∇2f(xk)","inline":true,"padRight":true},{"text":"is the Hessian matrix. These algorithms were usually with mathematical guarantee on their convergence for convex functions. Further, it has been proven that first-order methods such as gradient descent usually converges slowly (with linear convergence rate), while second-order methods such as conjugate gradient and quasi-Newton can be faster (with super linear convergence rate), but their numerical performances could be poor in some cases (e.g. quadratic programming with ill-conditioned Hessian due to poorly chosen initial points).","element":"span"}],[{"text":"For a specific optimization problem, it is usually hard to tell which of these algorithms is more appropriate. Further, the no-free-lunch theorem ","element":"span"},{"href":"#id-10","referenceIndex":11,"text":"[11] ","element":"a"},{"text":"states that “for any algorithm, any elevated performance over one class of problems is offset by performance over another class”. In light of this theorem, efforts have been made on developing optimization algorithms with adaptive descent directions.","element":"span"}],[{"text":"The study of combination of various descent directions can be found way back to 1960s. For examples, the Broyden family ","element":"span"},{"href":"#id-11","referenceIndex":12,"text":"[12] ","element":"a"},{"text":"uses a linear combination of DFP and BFGS updates for the approximation to the inverse Hessian. In the Levenberg-Marquardt (LM) algorithm ","element":"span"},{"href":"#id-12","referenceIndex":13,"text":"[13] ","element":"a"},{"text":"for nonlinear least square problem, a linear combination of the Hessian and identity matrix with non-negative damping factor is employed to avoid slow convergence in the direction of small gradients. In the accelerated gradient method and recently proposed stochastic optimization algorithms, such as momentum ","element":"span"},{"href":"#id-13","referenceIndex":14,"text":"[14]","element":"a"},{"text":", AdaGrad ","element":"span"},{"href":"#id-14","referenceIndex":15,"text":"[15]","element":"a"},{"text":", AdaDelta ","element":"span"},{"href":"#id-15","referenceIndex":16,"text":"[16]","element":"a"},{"text":", ADAM ","element":"span"},{"href":"#id-16","referenceIndex":17,"text":"[17] ","element":"a"},{"text":"and such, moments of the first-order and second-order gradients are combined and estimated iteratively to obtain the location update.","element":"span"}],[{"text":"Besides these work, only recently the location update ","element":"span"},{"style":{"height":13.99},"width":50.22,"height":34.98,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2003.04521/images/1-0.png","element":"img","alt":" ∆k","inline":true,"padRight":true},{"text":"is proposed to be adaptively ","element":"span"},{"style":{"fontStyle":"italic"},"text":"learned ","element":"span"},{"text":"by considering it as a parameterized function of appropriate historical information:","element":"span"}],[{"style":{"width":"62%"},"width":629,"height":40,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2003.04521/images/1-1.png","element":"img"}],[{"text":"where ","element":"span"},{"style":{"height":13.19},"width":41.44,"height":32.98,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2003.04521/images/1-2.png","element":"img","alt":" Sk","inline":true,"padRight":true},{"text":"represents the information gathered up to ","element":"span"},{"style":{"fontStyle":"italic"},"text":"k ","element":"span"},{"text":"iterations, including such as iterates, gradients, function criteria, Hessians and so on, and ","element":"span"},{"style":{"height":13.19},"width":35.71,"height":32.98,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2003.04521/images/1-3.png","element":"img","alt":" θk","inline":true,"padRight":true},{"text":"is the parameter.","element":"span"}],[{"text":"Neural networks are used to model ","element":"span"},{"style":{"height":16},"width":152,"height":40,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2003.04521/images/1-4.png","element":"img","alt":" g(Sk; θk)","inline":true,"padRight":true},{"text":"in recent literature simply because they are capable of approximating any smooth function. For example, Andrychowicz et al. ","element":"span"},{"href":"#id-17","referenceIndex":18,"text":"[18] ","element":"a"},{"text":"proposed to model ","element":"span"},{"style":{"height":13.19},"width":37.74,"height":32.98,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2003.04521/images/1-5.png","element":"img","alt":" dk","inline":true,"padRight":true},{"text":"by long short term memory (LSTM) neural network ","element":"span"},{"href":"#id-18","referenceIndex":19,"text":"[19] ","element":"a"},{"text":"for differentiable ","element":"span"},{"style":{"fontStyle":"italic"},"text":"f","element":"span"},{"text":", in which the input of LSTM includes ","element":"span"},{"style":{"height":16},"width":130.89,"height":40,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2003.04521/images/1-6.png","element":"img","alt":" ∇f(xk)","inline":true,"padRight":true},{"text":"and the hidden states of LSTM. Li et al. ","element":"span"},{"href":"#id-19","referenceIndex":20,"text":"[20] ","element":"a"},{"text":"used neural networks to model the location update for some machine learning tasks such as logistic/linear regression and neural net classifier. Chen et al. ","element":"span"},{"href":"#id-20","referenceIndex":21,"text":"[21] ","element":"a"},{"text":"proposed to obtain the iterate directly for black-box optimization problems, where the iterate is obtained by LSTM which take previous queries and function evaluations, and hidden states as inputs.","element":"span"}],[{"text":"Neural networks used in existing learning to learn approaches are simply used as a block box. The interpretability issue of deep learning is thus inherited. A model-driven method with prior knowledge from hand-crafted classical optimization algorithms is thus much appealing. Model driven deep learning ","element":"span"},{"href":"#id-21","referenceIndex":22,"text":"[22]","element":"a"},{"text":", ","element":"span"},{"href":"#id-22","referenceIndex":23,"text":"[23] ","element":"a"},{"text":"has shown its ability on learning hyper-parameters for a compressed sensing problem of the MRI image analysis, and for stochastic gradient descent methods ","element":"span"},{"href":"#id-23","referenceIndex":24,"text":"[24]","element":"a"},{"text":", ","element":"span"},{"href":"#id-24","referenceIndex":25,"text":"[25]","element":"a"},{"text":".","element":"span"}],[{"style":{"fontStyle":"italic"},"text":"B. The Escaping Phase","element":"span"}],[{"text":"A few methods, including tunneling ","element":"span"},{"href":"#id-25","referenceIndex":26,"text":"[26] ","element":"a"},{"text":"and filled function ","element":"span"},{"href":"#id-7","referenceIndex":8,"text":"[8]","element":"a"},{"text":", have been proposed to escape from local optimum. The tunneling method was first proposed by Levy and Montalvo ","element":"span"},{"href":"#id-25","referenceIndex":26,"text":"[26]","element":"a"},{"text":". The core idea is to use the zero of an auxiliary function, called tunneling function, as the new starting point for next minimization phase. The filled function method was first proposed by Ge and Qin ","element":"span"},{"href":"#id-7","referenceIndex":8,"text":"[8]","element":"a"},{"text":". The method aims to find a point which falls into the attraction basin of a better than current local minimizer by minimizing an auxiliary function, called the filled function. The tunneling and filled function methods are all based on the construction of auxiliary function, and the auxiliary functions are all built upon the local minimum obtained from previous minimization phase. They are all originally proposed for smooth global optimization.","element":"span"}],[{"text":"Existing research on tunneling and filled function is either on developing better auxiliary functions or extending to constrained and non-smooth optimization problems ","element":"span"},{"href":"#id-26","referenceIndex":27,"text":"[27]","element":"a"},{"text":", ","element":"span"},{"href":"#id-27","referenceIndex":28,"text":"[28]","element":"a"},{"text":", ","element":"span"},{"href":"#id-28","referenceIndex":29,"text":"[29]","element":"a"},{"text":". In general, these methods have similar drawbacks. First, the finding of zero or optimizer of the auxiliary function is itself a hard optimization problem. Second, it is not always guaranteed to find a better starting point when minimizing the auxiliary function ","element":"span"},{"href":"#id-29","referenceIndex":30,"text":"[30]","element":"a"},{"text":". Third, there often exists some hyper-parameters which are critical to the methods’ escaping performances, but are difficult to control ","element":"span"},{"href":"#id-30","referenceIndex":31,"text":"[31]","element":"a"},{"text":". Fourth, some proposed auxiliary functions are built with exponent or logarithm term. This could cause ill-condition problem for the","element":"span"}],[{"style":{"width":"100%"},"width":1005,"height":191,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2003.04521/images/1-7.png","element":"img"}],[{"id":"id-36","text":"Fig. 1. Illustration of a finite horizon Markov decision process.","element":"figcaption","subtype":"caption"}],[{"text":"minimization phase ","element":"span"},{"href":"#id-29","referenceIndex":30,"text":"[30]","element":"a"},{"text":". Last but not least, it has been found that though the filled and tunneling function methods have desired theoretical properties, their numerical performance is far from satisfactory ","element":"span"},{"href":"#id-29","referenceIndex":30,"text":"[30]","element":"a"},{"text":".","element":"span"}],[{"style":{"fontStyle":"italic"},"text":"C. Main Contributions","element":"span"}],[{"text":"In this paper, we first propose a model-driven learning approach to learn adaptive descent directions for locally convex functions. A local-convergence guaranteed algorithm is then developed based on the learned directions. We further model the escaping phase within the filled function method as a Markov decision process (MDP) and propose two policies, namely a fixed policy and a policy learned by policy gradient, on deciding the new starting point. Combining the learned local algorithm and the escaping policy, a two-phase global optimization algorithm is finally formed.","element":"span"}],[{"text":"We prove that the learned local search algorithm is convergent; and we explain the insight of the fixed policy which can has a higher probability to find promising starting points than random sampling. Extensive experiments are carried out to justify the effectiveness of the learned local search algorithm, the two policies and the learned two-phase global optimization algorithm.","element":"span"}],[{"text":"The rest of the paper is organized as follows. Section ","element":"span"},{"text":"II ","element":"span"},{"text":"briefly discusses the reinforcement learning and policy gradient to be used in the escaping phase. Section ","element":"span"},{"text":"III ","element":"span"},{"text":"presents the model-driven learning to learn approach for convex optimization. The escaping phase is presented in Section ","element":"span"},{"href":"#id-31","text":"IV, ","element":"a"},{"text":"in which the fixed escaping policy under the MDP framework is presented in Section ","element":"span"},{"href":"#id-32","text":"IV-B, ","element":"a"},{"text":"while the details of the learned policy is presented in Section ","element":"span"},{"href":"#id-33","text":"IV-C. ","element":"a"},{"text":"Controlled experimental study is presented in Section ","element":"span"},{"href":"#id-34","text":"V. ","element":"a"},{"text":"Section ","element":"span"},{"href":"#id-35","text":"VI ","element":"a"},{"text":"concludes the paper and discusses future work.","element":"span"}]]},{"heading":"II. BRIEF INTRODUCTION OF REINFORCEMENT LEARNING","paragraphs":[[{"text":"In reinforcement learning (RL), the learner (agent) chooses to take an action at each time step; the action changes the state of environment; (possibly delayed) feedback (reward) returns as the response of the environment to the learner’s action and affects the learner’s next decision. The learner aims to find an optimal policy so that the actions decided by the policy maximize cumulative rewards along time.","element":"span"}],[{"text":"Consider a finite-horizon MDP with continuous state and action space defined by the tuple ","element":"span"},{"style":{"height":16},"width":328.48,"height":40,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2003.04521/images/1-8.png","element":"img","alt":" (S, A, µ0, p, r, π, T)","inline":true,"padRight":true},{"text":"where ","element":"span"},{"style":{"height":14.18},"width":141.36,"height":35.46,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2003.04521/images/1-9.png","element":"img","alt":"S ∈ RD","inline":true,"padRight":true},{"text":"denotes the state space, ","element":"span"},{"style":{"height":14.18},"width":137.06,"height":35.46,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2003.04521/images/1-10.png","element":"img","alt":" A ∈ Rd","inline":true,"padRight":true},{"text":"the action space, ","element":"span"},{"style":{"height":10},"width":40.01,"height":25,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2003.04521/images/1-11.png","element":"img","alt":"µ0","inline":true,"padRight":true},{"text":"the initial distribution of the state, ","element":"span"},{"style":{"height":11.2},"width":174.53,"height":28,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2003.04521/images/1-12.png","element":"img","alt":" r : S → R","inline":true,"padRight":true},{"text":"the reward, and ","element":"span"},{"style":{"fontStyle":"italic"},"text":"T ","element":"span"},{"text":"the time horizon, respectively. At each time ","element":"span"},{"style":{"fontStyle":"italic"},"text":"t","element":"span"},{"text":", there are ","element":"span"},{"style":{"height":14},"width":248.12,"height":35,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2003.04521/images/1-13.png","element":"img","alt":"st ∈ S, at ∈ A","inline":true,"padRight":true},{"text":"and a transition probability ","element":"span"},{"style":{"height":14.8},"width":286.81,"height":37,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2003.04521/images/1-14.png","element":"img","alt":" p : S × A × S →","inline":true,"padRight":true},{"text":"R ","element":"span"},{"text":"where ","element":"span"},{"style":{"height":16},"width":221.2,"height":40,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2003.04521/images/1-15.png","element":"img","alt":" p(st+1|at, st)","inline":true,"padRight":true},{"text":"denotes the transition probability of ","element":"span"},{"style":{"height":10.79},"width":71.18,"height":26.98,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2003.04521/images/2-0.png","element":"img","alt":"st+1","inline":true,"padRight":true},{"text":"conditionally based on ","element":"span"},{"style":{"height":9.19},"width":30.68,"height":22.98,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2003.04521/images/2-1.png","element":"img","alt":" st","inline":true,"padRight":true},{"text":"and ","element":"span"},{"style":{"height":9.19},"width":33.06,"height":22.98,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2003.04521/images/2-2.png","element":"img","alt":" at","inline":true},{"text":". The policy ","element":"span"},{"style":{"height":10.8},"width":138.92,"height":27,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2003.04521/images/2-3.png","element":"img","alt":" π : S ×","inline":true},{"style":{"height":16},"width":368.86,"height":40,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2003.04521/images/2-4.png","element":"img","alt":"A×{0, 1, · · · T} → R","inline":true},{"text":", where ","element":"span"},{"style":{"height":16},"width":172.04,"height":40,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2003.04521/images/2-5.png","element":"img","alt":" π(at|st; θ)","inline":true,"padRight":true},{"text":"is the probability of choosing action ","element":"span"},{"style":{"height":9.19},"width":33.07,"height":22.98,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2003.04521/images/2-6.png","element":"img","alt":" at","inline":true,"padRight":true},{"text":"when observing current state ","element":"span"},{"style":{"height":9.19},"width":30.68,"height":22.98,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2003.04521/images/2-7.png","element":"img","alt":" st","inline":true,"padRight":true},{"text":"with ","element":"span"},{"style":{"height":10.8},"width":19,"height":27,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2003.04521/images/2-8.png","element":"img","alt":" θ","inline":true,"padRight":true},{"text":"as the parameter.","element":"span"}],[{"text":"As shown in Fig. ","element":"span"},{"href":"#id-36","text":"1, ","element":"a"},{"text":"starting from a state ","element":"span"},{"style":{"height":10},"width":129.71,"height":25,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2003.04521/images/2-9.png","element":"img","alt":" s0 ∼ µ0","inline":true},{"text":", the agent chooses ","element":"span"},{"style":{"height":16},"width":288.18,"height":40,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2003.04521/images/2-10.png","element":"img","alt":" a0 ∼ π(a0|s0, θ)","inline":true},{"text":"; after executing the action, agent arrives at state ","element":"span"},{"style":{"height":16},"width":288.12,"height":40,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2003.04521/images/2-11.png","element":"img","alt":" s1 ∼ p(s1|a0, s0)","inline":true},{"text":". Meanwhile, agent receives a reward ","element":"span"},{"style":{"height":16},"width":87.14,"height":40,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2003.04521/images/2-12.png","element":"img","alt":" r(s1)","inline":true,"padRight":true},{"text":"(or ","element":"span"},{"style":{"height":9.19},"width":33.98,"height":22.98,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2003.04521/images/2-13.png","element":"img","alt":" r1","inline":true},{"text":") from the environment. Iteratively, a trajectory ","element":"span"},{"style":{"height":16},"width":755.18,"height":40,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2003.04521/images/2-14.png","element":"img","alt":" τ = {s0, a0, r1, s1, a1, r2, · · · , aT −1, sT , rT }","inline":true,"padRight":true},{"text":"can be obtained. The optimal policy ","element":"span"},{"style":{"height":10.98},"width":40.15,"height":27.46,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2003.04521/images/2-15.png","element":"img","alt":" π∗","inline":true,"padRight":true},{"text":"is to be found by maximizing the expectation of the cumulative reward ","element":"span"},{"style":{"height":16},"width":138.29,"height":40,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2003.04521/images/2-16.png","element":"img","alt":" R(τ) =","inline":true},{"style":{"height":28.8},"width":312.44,"height":72,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2003.04521/images/2-17.png","element":"img","alt":"��T −1t=0 γtr(st+1)�","inline":true},{"text":":","element":"span"}],[{"style":{"width":"90%"},"width":906,"height":85,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2003.04521/images/2-18.png","element":"img"}],[{"text":"where the expectation is taken over trajectory ","element":"span"},{"style":{"height":16},"width":205.27,"height":40,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2003.04521/images/2-19.png","element":"img","alt":" τ ∼ q(τ; θ)","inline":true,"padRight":true},{"text":"where","element":"span"}],[{"style":{"width":"87%"},"width":876,"height":117,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2003.04521/images/2-20.png","element":"img"}],[{"text":"A variety of reinforcement learning algorithms have been proposed for different scenarios of the state and action spaces, please see ","element":"span"},{"href":"#id-37","referenceIndex":32,"text":"[32] ","element":"a"},{"text":"for recent advancements. The RL algorithms have succeeded overwhelmingly for playing games such as GO ","element":"span"},{"href":"#id-38","referenceIndex":33,"text":"[33]","element":"a"},{"text":", Atari ","element":"span"},{"href":"#id-39","referenceIndex":34,"text":"[34] ","element":"a"},{"text":"and many others.","element":"span"}],[{"text":"We briefly introduce the policy gradient method for continuous state space ","element":"span"},{"href":"#id-40","referenceIndex":35,"text":"[35]","element":"a"},{"text":", which will be used in our study. Taking derivative of ","element":"span"},{"style":{"height":16},"width":82.86,"height":40,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2003.04521/images/2-21.png","element":"img","alt":" U(θ)","inline":true,"padRight":true},{"text":"w.r.t. ","element":"span"},{"style":{"height":10.8},"width":19,"height":27,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2003.04521/images/2-22.png","element":"img","alt":" θ","inline":true,"padRight":true},{"text":"discarding unrelated terms, we have","element":"span"}],[{"id":"id-41","style":{"width":"94%"},"width":947,"height":326,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2003.04521/images/2-23.png","element":"img"}],[{"text":"Eq. ","element":"span"},{"href":"#id-41","text":"6 ","element":"a"},{"text":"can be calculated by sampling trajectories ","element":"span"},{"style":{"height":10},"width":174.92,"height":25,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2003.04521/images/2-24.png","element":"img","alt":" τ1, · · · , τN","inline":true,"padRight":true},{"text":"in practice:","element":"span"}],[{"style":{"width":"93%"},"width":934,"height":118,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2003.04521/images/2-25.png","element":"img"}],[{"text":"where ","element":"span"},{"style":{"height":20.68},"width":147.59,"height":51.7,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2003.04521/images/2-26.png","element":"img","alt":" a(i)t (s(i)t )","inline":true,"padRight":true},{"text":"denotes action (state) at time ","element":"span"},{"style":{"fontStyle":"italic"},"text":"t ","element":"span"},{"text":"in the ","element":"span"},{"style":{"fontStyle":"italic"},"text":"i","element":"span"},{"text":"th ","element":"span"},{"text":"trajectory, ","element":"span"},{"style":{"height":16.98},"width":97.26,"height":42.46,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2003.04521/images/2-27.png","element":"img","alt":" R(τ i)","inline":true,"padRight":true},{"text":"is the cumulative reward of the ","element":"span"},{"style":{"fontStyle":"italic"},"text":"i","element":"span"},{"text":"th trajectory. For continuous state and action space, normally assume","element":"span"}],[{"id":"id-80","style":{"width":"89%"},"width":896,"height":99,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2003.04521/images/2-28.png","element":"img"}],[{"text":"where ","element":"span"},{"style":{"height":14},"width":24,"height":35,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2003.04521/images/2-29.png","element":"img","alt":" φ","inline":true,"padRight":true},{"text":"can be any smooth function, like radial basis function, linear function, and even neural networks.","element":"span"}]]},{"heading":"III. MODEL-DRIVEN LEARNING TO LEARN FOR LOCAL SEARCH","paragraphs":[[{"text":"In this section, we first summarize some well-known first-and second-order classical optimization algorithms. Then the proposed model-driven learning to optimize method for locally convex functions is presented.","element":"span"}],[{"style":{"fontStyle":"italic"},"text":"A. Classical Optimization Methods","element":"span"}],[{"text":"In the sequel, denote ","element":"span"},{"style":{"height":16},"width":611.86,"height":40,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2003.04521/images/2-30.png","element":"img","alt":" gk = ∇f(xk), sk = xk+1 − xk, yk =","inline":true},{"style":{"height":10.79},"width":167.81,"height":26.98,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2003.04521/images/2-31.png","element":"img","alt":"gk+1 − gk","inline":true},{"text":". The descent direction ","element":"span"},{"style":{"height":13.19},"width":37.74,"height":32.98,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2003.04521/images/2-32.png","element":"img","alt":" dk","inline":true,"padRight":true},{"text":"at the ","element":"span"},{"style":{"fontStyle":"italic"},"text":"k","element":"span"},{"text":"-th iteration of some classical methods is of the following form ","element":"span"},{"href":"#id-11","referenceIndex":12,"text":"[12]","element":"a"},{"text":":","element":"span"}],[{"style":{"width":"83%"},"width":842,"height":144,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2003.04521/images/2-33.png","element":"img"}],[{"text":"where ","element":"span"},{"style":{"height":13.19},"width":50.13,"height":32.98,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2003.04521/images/2-34.png","element":"img","alt":" Hk","inline":true,"padRight":true},{"text":"is an approximation to the inverse of the Hessian matrix, and ","element":"span"},{"style":{"height":9.19},"width":42.5,"height":22.98,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2003.04521/images/2-35.png","element":"img","alt":" αk","inline":true,"padRight":true},{"text":"is a coefficient that varies for different conjugate GDs. For example, ","element":"span"},{"style":{"height":9.19},"width":42.49,"height":22.98,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2003.04521/images/2-36.png","element":"img","alt":" αk","inline":true,"padRight":true},{"text":"could take ","element":"span"},{"style":{"height":17.22},"width":299.54,"height":43.04,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2003.04521/images/2-37.png","element":"img","alt":" g⊺kyk−1/d⊺k−1yk−1","inline":true,"padRight":true},{"text":"for ","element":"span"},{"text":"Crowder-Wolfe conjugate gradient method ","element":"span"},{"href":"#id-11","referenceIndex":12,"text":"[12]","element":"a"},{"text":".","element":"span"}],[{"text":"The update of ","element":"span"},{"style":{"height":13.19},"width":50.13,"height":32.98,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2003.04521/images/2-38.png","element":"img","alt":" Hk","inline":true,"padRight":true},{"text":"also varies for different quasi-Newton methods. In the Huang family, ","element":"span"},{"style":{"height":13.19},"width":50.12,"height":32.98,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2003.04521/images/2-39.png","element":"img","alt":" Hk","inline":true,"padRight":true},{"text":"is updated as follows:","element":"span"}],[{"style":{"width":"85%"},"width":856,"height":45,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2003.04521/images/2-40.png","element":"img"}],[{"text":"where","element":"span"}],[{"style":{"width":"83%"},"width":843,"height":164,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2003.04521/images/2-41.png","element":"img"}],[{"text":"The Broyden family is a special case of the Huang family in case ","element":"span"},{"style":{"height":14},"width":93.75,"height":35,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2003.04521/images/2-42.png","element":"img","alt":" ρ = 1","inline":true},{"text":", and ","element":"span"},{"style":{"height":9.19},"width":160.92,"height":22.98,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2003.04521/images/2-43.png","element":"img","alt":" a12 = a21","inline":true},{"text":".","element":"span"}],[{"style":{"fontStyle":"italic"},"text":"B. Learning the descent direction: d-Net","element":"span"}],[{"text":"We propose to consider the descent direction ","element":"span"},{"style":{"height":13.19},"width":37.74,"height":32.98,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2003.04521/images/2-44.png","element":"img","alt":" dk","inline":true,"padRight":true},{"text":"as a nonlinear function of ","element":"span"},{"style":{"height":16},"width":564.06,"height":40,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2003.04521/images/2-45.png","element":"img","alt":" Sk = {gk, gk−1, sk−1, sk−2, yk−2}","inline":true,"padRight":true},{"text":"with parameter ","element":"span"},{"style":{"height":17.9},"width":436.91,"height":44.75,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2003.04521/images/2-46.png","element":"img","alt":" θk = {w1k, w2k, w3k, w4k, βk}","inline":true,"padRight":true},{"text":"for the adaptive compu- ","element":"span"},{"text":"tation of descent search direction ","element":"span"},{"style":{"height":16},"width":248.01,"height":40,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2003.04521/images/2-47.png","element":"img","alt":" dk = h(Sk; θk)","inline":true},{"text":". Denote","element":"span"}],[{"style":{"width":"83%"},"width":840,"height":100,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2003.04521/images/2-48.png","element":"img"}],[{"text":"We propose","element":"span"}],[{"id":"id-42","style":{"width":"90%"},"width":912,"height":73,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2003.04521/images/2-49.png","element":"img"}],[{"text":"where ","element":"span"},{"style":{"fontWeight":"bold"},"text":"I ","element":"span"},{"text":"is the identity matrix.","element":"span"}],[{"text":"At each iteration, rather than updating ","element":"span"},{"style":{"height":13.19},"width":91.64,"height":32.98,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2003.04521/images/2-50.png","element":"img","alt":" Hk−1","inline":true,"padRight":true},{"text":"directly, we update the multiplication of ","element":"span"},{"style":{"height":13.19},"width":88.78,"height":32.98,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2003.04521/images/2-51.png","element":"img","alt":" Rk−1","inline":true,"padRight":true},{"text":"and ","element":"span"},{"style":{"height":13.19},"width":91.65,"height":32.98,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2003.04521/images/2-52.png","element":"img","alt":" Hk−1","inline":true,"padRight":true},{"text":"like in the Huang family ","element":"span"},{"href":"#id-11","referenceIndex":12,"text":"[12]","element":"a"},{"text":":","element":"span"}],[{"style":{"width":"97%"},"width":975,"height":90,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2003.04521/images/2-53.png","element":"img"}],[{"text":"It can be seen that with different parameter ","element":"span"},{"style":{"height":17.5},"width":267.84,"height":43.74,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2003.04521/images/2-54.png","element":"img","alt":" wik, i = 1, · · · , 4","inline":true,"padRight":true},{"text":"and ","element":"span"},{"style":{"height":14.4},"width":39.54,"height":36,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2003.04521/images/2-55.png","element":"img","alt":" βk","inline":true,"padRight":true},{"text":"settings, ","element":"span"},{"style":{"height":13.19},"width":37.74,"height":32.97,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2003.04521/images/2-56.png","element":"img","alt":" dk","inline":true,"padRight":true},{"text":"can degenerate to different directions:","element":"span"}],[{"style":{"fontStyle":"italic"},"text":"• ","element":"span"},{"text":"when ","element":"span"},{"style":{"height":17.9},"width":391.87,"height":44.75,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2003.04521/images/2-57.png","element":"img","alt":" w1k, w2k, w3k, w4k ∈ {0, 1}","inline":true},{"text":", the denominator of ","element":"span"},{"style":{"height":13.19},"width":47.26,"height":32.98,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2003.04521/images/2-58.png","element":"img","alt":" Rk","inline":true,"padRight":true},{"text":"is ","element":"span"},{"text":"not zero, and ","element":"span"},{"style":{"height":14.4},"width":115.3,"height":36,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2003.04521/images/2-59.png","element":"img","alt":" βk = 0","inline":true},{"text":", the update degenerates to conjugate gradient.","element":"span"}],[{"style":{"fontStyle":"italic"},"text":"• ","element":"span"},{"text":"when ","element":"span"},{"style":{"height":17.9},"width":391.88,"height":44.75,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2003.04521/images/2-60.png","element":"img","alt":" w1k, w2k, w3k, w4k ∈ {0, 1}","inline":true},{"text":", and the denominator of ","element":"span"},{"style":{"height":13.19},"width":47.26,"height":32.98,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2003.04521/images/2-61.png","element":"img","alt":"Rk","inline":true,"padRight":true},{"text":"is not zero, and ","element":"span"},{"style":{"height":14.4},"width":115.3,"height":36,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2003.04521/images/2-62.png","element":"img","alt":" βk = 1","inline":true},{"text":", the update becomes the preconditioned conjugate gradient.","element":"span"}],[{"style":{"fontStyle":"italic"},"text":"• ","element":"span"},{"text":"when ","element":"span"},{"style":{"height":17.9},"width":538.01,"height":44.75,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2003.04521/images/2-63.png","element":"img","alt":" w1k = 1, w2k = 1, w3k = 1, w4k = 1","inline":true},{"text":", and ","element":"span"},{"style":{"height":14.4},"width":115.3,"height":36,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2003.04521/images/2-64.png","element":"img","alt":" βk = 1","inline":true},{"text":", the ","element":"span"},{"text":"update degenerates to the Huang family.","element":"span"}],[{"style":{"fontStyle":"italic"},"text":"• ","element":"span"},{"text":"when ","element":"span"},{"style":{"height":17.9},"width":260.2,"height":44.75,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2003.04521/images/2-65.png","element":"img","alt":" w1k = 0, w2k = 0","inline":true},{"text":", the denominator of ","element":"span"},{"style":{"height":13.19},"width":47.26,"height":32.98,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2003.04521/images/2-66.png","element":"img","alt":" Rk","inline":true,"padRight":true},{"text":"is not zero, ","element":"span"},{"text":"and ","element":"span"},{"style":{"height":14.4},"width":115.3,"height":36,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2003.04521/images/2-67.png","element":"img","alt":" βk = 0","inline":true,"padRight":true},{"text":"the update becomes the steepest GD. Based on Eq. ","element":"span"},{"href":"#id-42","text":"15, ","element":"a"},{"text":"a new optimization algorithm, called adaptive gradient descent algorithm (AGD), can be established.","element":"span"}],[{"text":"It is summarized in Alg. ","element":"span"},{"href":"#id-43","text":"1. ","element":"a"},{"text":"It is seen that to obtain a new direction by Eq. ","element":"span"},{"href":"#id-42","text":"15, ","element":"a"},{"text":"information from two steps ahead is required as included in ","element":"span"},{"style":{"height":13.19},"width":41.44,"height":32.98,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2003.04521/images/3-0.png","element":"img","alt":" Sk","inline":true},{"text":". To initiate the computation of new direction, in Alg. ","element":"span"},{"href":"#id-43","text":"1, ","element":"a"},{"text":"first a steep gradient descent step (lines ","element":"span"},{"href":"#id-43","text":"3- ","element":"a"},{"href":"#id-43","text":"5) ","element":"a"},{"text":"and then a non-linear descent step (lines ","element":"span"},{"href":"#id-43","text":"7-","element":"a"},{"href":"#id-43","text":"10) ","element":"a"},{"text":"are applied. With these prepared information, AGD iterates (lines ","element":"span"},{"href":"#id-43","text":"14-","element":"a"},{"href":"#id-43","text":"18) ","element":"a"},{"text":"until the norm of gradient at the solution ","element":"span"},{"style":{"height":9.19},"width":39.78,"height":22.98,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2003.04521/images/3-1.png","element":"img","alt":" xk","inline":true,"padRight":true},{"text":"is less than a positive number ","element":"span"},{"style":{"height":0},"width":20,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2003.04521/images/3-2.png","element":"img","alt":" ϵ","inline":true},{"text":".","element":"span"}],[{"id":"id-43","style":{"width":"100%"},"width":1005,"height":1087,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2003.04521/images/3-3.png","element":"img"}],[{"text":"To specify the parameters ","element":"span"},{"style":{"height":13.19},"width":35.71,"height":32.98,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2003.04521/images/3-4.png","element":"img","alt":" θk","inline":true,"padRight":true},{"text":"in the direction update function ","element":"span"},{"style":{"fontStyle":"italic"},"text":"h","element":"span"},{"text":", like ","element":"span"},{"href":"#id-17","referenceIndex":18,"text":"[18]","element":"a"},{"text":", we unfold AGD into ","element":"span"},{"style":{"fontStyle":"italic"},"text":"T ","element":"span"},{"text":"iterations. Each iteration can be considered as a layer in a neural network. We thus have a ‘deep’ neural network with ","element":"span"},{"style":{"fontStyle":"italic"},"text":"T ","element":"span"},{"text":"layers. The resultant network is called ","element":"span"},{"style":{"fontStyle":"italic"},"text":"d-Net","element":"span"},{"text":". Fig. ","element":"span"},{"href":"#id-44","text":"2 ","element":"a"},{"text":"shows the unfolding.","element":"span"}],[{"text":"Like normal neural networks, we need to train for its parameters ","element":"span"},{"style":{"height":16},"width":305.43,"height":40,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2003.04521/images/3-5.png","element":"img","alt":" θ = {θ1, · · · , θT }","inline":true},{"text":". To learn the parameters, the loss function ","element":"span"},{"style":{"height":16},"width":67.92,"height":40,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2003.04521/images/3-6.png","element":"img","alt":" ℓ(θ)","inline":true,"padRight":true},{"text":"is defined as","element":"span"}],[{"id":"id-68","style":{"width":"93%"},"width":943,"height":256,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2003.04521/images/3-7.png","element":"img"}],[{"text":"That is, we expect these parameters are optimal not only to a single function, but to a class of functions ","element":"span"},{"style":{"fontStyle":"italic"},"text":"F","element":"span"},{"text":"; and to all the criteria along the ","element":"span"},{"style":{"fontStyle":"italic"},"text":"T ","element":"span"},{"text":"iterations.","element":"span"}],[{"text":"We hereby choose ","element":"span"},{"style":{"fontStyle":"italic"},"text":"F ","element":"span"},{"text":"to be the Gaussian function family:","element":"span"}],[{"style":{"width":"95%"},"width":962,"height":50,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2003.04521/images/3-8.png","element":"img"}],[{"text":"There are two reasons to choose the Gaussian function family. First, any ","element":"span"},{"style":{"height":14.79},"width":154.52,"height":36.98,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2003.04521/images/3-9.png","element":"img","alt":" f ∈ FG","inline":true,"padRight":true},{"text":"is locally convex. That is, let ","element":"span"},{"style":{"fontStyle":"italic"},"text":"H","element":"span"},{"text":"(","element":"span"},{"style":{"fontStyle":"italic"},"text":"x","element":"span"},{"text":") ","element":"span"},{"text":"represents the Hessian matrix of ","element":"span"},{"style":{"fontStyle":"italic"},"text":"f","element":"span"},{"text":"(","element":"span"},{"style":{"fontStyle":"italic"},"text":"x","element":"span"},{"text":")","element":"span"},{"text":", it is seen that","element":"span"}],[{"style":{"width":"82%"},"width":824,"height":61,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2003.04521/images/3-10.png","element":"img"}],[{"style":{"width":"98%"},"width":993,"height":484,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2003.04521/images/3-11.png","element":"img"}],[{"id":"id-44","text":"Fig. 2. The unfolding of the AGD.","element":"figcaption","subtype":"caption"}],[{"text":"Second, it is known that finite mixture Gaussian model can approximate a Riemann integrable function with arbitrary accuracy ","element":"span"},{"href":"#id-45","referenceIndex":36,"text":"[36]","element":"a"},{"text":". Therefore, to learn an optimization algorithm that guarantees convergence to local optima, it is sufficient to choose functions that are locally convex.","element":"span"}],[{"text":"Given ","element":"span"},{"style":{"fontStyle":"italic"},"text":"F","element":"span"},{"text":", when optimizing ","element":"span"},{"style":{"height":16},"width":67.92,"height":40,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2003.04521/images/3-12.png","element":"img","alt":" ℓ(θ)","inline":true},{"text":", the expectation can be obtained by Monte Carlo approximation with a set of functions sampled from ","element":"span"},{"style":{"fontStyle":"italic"},"text":"F","element":"span"},{"text":", that is","element":"span"}],[{"style":{"width":"76%"},"width":766,"height":117,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2003.04521/images/3-13.png","element":"img"}],[{"text":"where ","element":"span"},{"style":{"height":16},"width":208.17,"height":40,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2003.04521/images/3-14.png","element":"img","alt":" fi ∼ F. ℓ(θ)","inline":true,"padRight":true},{"text":"can then be optimized by the steepest GD algorithm.","element":"span"}],[{"style":{"fontWeight":"bold"},"text":"Note: ","element":"span"},{"text":"The contribution of the proposed d-Net can be summarized as follows. First, there is a significant difference between the proposed learning to learn approach with existing methods, such as ","element":"span"},{"href":"#id-17","referenceIndex":18,"text":"[18]","element":"a"},{"text":", ","element":"span"},{"href":"#id-20","referenceIndex":21,"text":"[21]","element":"a"},{"text":". In existing methods, LSTM is used as a ‘black-box’ for the determination of descent direction, and the parameters of the used LSTM is shared among the time horizon. Whereas in our approach, the direction is a combination of known and well-studied directions, i.e. a ‘white-box’, which means that our model is interpretable. This is a clear advantage against black-box models.","element":"span"}],[{"text":"Second, in classical methods, such as the Broyden and Huang family and LM, descent directions are constructed through a linear combination. On the contrary, the proposed method is nonlinear and subsumes a wide range of classical methods. This may result in better directions.","element":"span"}],[{"text":"Further, the combination parameters used in classical methods are considered to be hyper-parameters. They are normally set by trial and error. In the AGD, these parameters are learned from the optimization experiences to a class of functions, so that the directions can adapt to new optimization problem.","element":"span"}],[{"style":{"fontStyle":"italic"},"text":"C. Group d-Net","element":"span"}],[{"text":"To further improve the search ability of d-Net, we employ a group of d-Nets, dubbed as Gd-Net. These d-Nets are connected sequentially, with shared parameters among them. Input of the ","element":"span"},{"style":{"fontStyle":"italic"},"text":"k","element":"span"},{"text":"-th (","element":"span"},{"style":{"fontStyle":"italic"},"text":"k > ","element":"span"},{"text":"1","element":"span"},{"text":") d-Net is the gradient from ","element":"span"},{"style":{"height":16},"width":122.32,"height":40,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2003.04521/images/3-15.png","element":"img","alt":" (k − 1)","inline":true},{"text":"-th d-Net. To apply Gd-Net, an initial point is taken as the input, and is brought forward through these d-Nets until the absolute gradient norm is less than a predefined small positive real number.","element":"span"}],[{"text":"In the following we show that Gd-Net guarantees convergence to optimum for convex functions. We first prove that AGD is convergent. Theorem ","element":"span"},{"href":"#id-46","text":"1 ","element":"a"},{"text":"summarizes the result. Please see Appendix A for proof.","element":"span"}],[{"id":"id-46","style":{"fontWeight":"bold"},"text":"Theorem 1. ","element":"span"},{"style":{"fontStyle":"italic"},"text":"Assume ","element":"span"},{"style":{"height":14},"width":281.39,"height":35,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2003.04521/images/4-0.png","element":"img","alt":" f : Rn → R","inline":true,"padRight":true},{"style":{"fontStyle":"italic"},"text":"is continuous and differentiable and the sublevel set","element":"span"}],[{"style":{"width":"82%"},"width":828,"height":73,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2003.04521/images/4-1.png","element":"img"}],[{"style":{"fontStyle":"italic"},"text":"is bounded. The sequence ","element":"span"},{"style":{"height":16},"width":326.62,"height":40,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2003.04521/images/4-2.png","element":"img","alt":" {xk, k = 1, 2, · · · }","inline":true,"padRight":true},{"style":{"fontStyle":"italic"},"text":"obtained by AGD with exact line search converges to a stable point.","element":"span"}],[{"text":"Since d-Net is the unfolding of AGD, from Theorem ","element":"span"},{"href":"#id-46","text":"1, ","element":"a"},{"text":"it is sure that the iterate sequence obtained by d-Net is non-increasing for any initial ","element":"span"},{"style":{"height":9.19},"width":38.78,"height":22.98,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2003.04521/images/4-3.png","element":"img","alt":" x0","inline":true,"padRight":true},{"text":"with properly learned parameters. Therefore, applying a sequence of d-Net (i.e. Gd-Net) on a bound function ","element":"span"},{"style":{"fontStyle":"italic"},"text":"f","element":"span"},{"text":"(","element":"span"},{"style":{"fontStyle":"italic"},"text":"x","element":"span"},{"text":") ","element":"span"},{"text":"from any initial point ","element":"span"},{"style":{"height":9.19},"width":38.78,"height":22.98,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2003.04521/images/4-4.png","element":"img","alt":" x0","inline":true,"padRight":true},{"text":"will result in a sequence of non-increasing function values. This ensures that the convergence of the sequence, which indicates that Gd-Net is convergent under the assumption of Theorem ","element":"span"},{"href":"#id-46","text":"1.","element":"a"}]]},{"heading":"IV. ESCAPING FROM LOCAL OPTIMUM","paragraphs":[[{"id":"id-31","text":"Gd-Net guarantees convergence for locally convex func- ","element":"span"},{"text":"tions. To approach global optimality, we present a method to escape from the local optimum once trapped. Our method is based on the filled-function method, and is embedded within the MDP framework.","element":"span"}],[{"style":{"fontStyle":"italic"},"text":"A. The Escaping Phase in the Filled Function Method","element":"span"}],[{"text":"In the escaping phase of the filled function method, a local search method is applied to minimize the filled function for a good starting point for next minimization phase. To apply the local search method, the starting point is set as ","element":"span"},{"style":{"height":13.99},"width":134.8,"height":34.98,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2003.04521/images/4-5.png","element":"img","alt":" x0+δ0d","inline":true,"padRight":true},{"text":"where ","element":"span"},{"style":{"height":9.19},"width":38.78,"height":22.98,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2003.04521/images/4-6.png","element":"img","alt":"x0","inline":true,"padRight":true},{"text":"is the local minimizer obtained from previous minimization phase, ","element":"span"},{"style":{"height":13.99},"width":33.71,"height":34.97,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2003.04521/images/4-7.png","element":"img","alt":" δ0","inline":true,"padRight":true},{"text":"is a small constant and ","element":"span"},{"style":{"fontStyle":"italic"},"text":"d ","element":"span"},{"text":"is the search direction.","element":"span"}],[{"text":"Many filled functions have been constructed (please see ","element":"span"},{"href":"#id-29","referenceIndex":30,"text":"[30] ","element":"a"},{"text":"for a survey). One of the popular filled-functions ","element":"span"},{"href":"#id-7","referenceIndex":8,"text":"[8] ","element":"a"},{"text":"is defined as follows","element":"span"}],[{"style":{"width":"85%"},"width":858,"height":45,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2003.04521/images/4-8.png","element":"img"}],[{"text":"where ","element":"span"},{"style":{"fontStyle":"italic"},"text":"a ","element":"span"},{"text":"is a hyper-parameter. It is expected that minimizing ","element":"span"},{"style":{"fontStyle":"italic"},"text":"H","element":"span"},{"text":"(","element":"span"},{"style":{"fontStyle":"italic"},"text":"x","element":"span"},{"text":") ","element":"span"},{"text":"can lead to a local minimizer which is away from ","element":"span"},{"style":{"height":9.19},"width":38.78,"height":22.98,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2003.04521/images/4-9.png","element":"img","alt":" x0","inline":true,"padRight":true},{"text":"due to the exist of the exponential term.","element":"span"}],[{"text":"Theoretical analysis has been conducted on the filled function methods in terms of its escaping ability ","element":"span"},{"href":"#id-7","referenceIndex":8,"text":"[8]","element":"a"},{"text":". However, the filled function methods have many practical weaknesses yet to overcome.","element":"span"}],[{"text":"First, the hyper-parameter ","element":"span"},{"style":{"fontStyle":"italic"},"text":"a ","element":"span"},{"text":"is critical to algorithm performance. Basically speaking, if ","element":"span"},{"style":{"fontStyle":"italic"},"text":"a ","element":"span"},{"text":"is small, it struggles to escape from ","element":"span"},{"style":{"height":9.19},"width":38.78,"height":22.98,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2003.04521/images/4-10.png","element":"img","alt":" x0","inline":true},{"text":", otherwise it may miss some local minima. But it is very hard to determine the optimal value of ","element":"span"},{"style":{"fontStyle":"italic"},"text":"a","element":"span"},{"text":". There has no theoretical results, neither rule of thumb on how to choose ","element":"span"},{"style":{"fontStyle":"italic"},"text":"a","element":"span"},{"text":".","element":"span"}],[{"text":"Second, the search direction ","element":"span"},{"style":{"fontStyle":"italic"},"text":"d ","element":"span"},{"text":"is also very important to the algorithmic performance. Different ","element":"span"},{"style":{"fontStyle":"italic"},"text":"d","element":"span"},{"text":"’s may lead to different local minimizers, and the local minimizers are not necessarily better than ","element":"span"},{"style":{"height":9.19},"width":38.78,"height":22.98,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2003.04521/images/4-11.png","element":"img","alt":" x0","inline":true},{"text":". In literature, usually a trial-and-error procedure is applied to find the best direction from a set of pre-fixed","element":"span"}],[{"style":{"width":"76%"},"width":769,"height":611,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2003.04521/images/4-12.png","element":"img"}],[{"text":"Fig. 3. Red lines show the contour of the three-hump function ","element":"figcaption","subtype":"caption"},{"style":{"height":12.8},"width":203.02,"height":32,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2003.04521/images/4-13.png","element":"img","alt":" f(x1, x2), blue","inline":true,"padRight":true},{"text":"arrows are the gradients of ","element":"figcaption","subtype":"caption"},{"style":{"height":12.8},"width":169.2,"height":32,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2003.04521/images/4-14.png","element":"img","alt":" −H(x1, x2)","inline":true},{"text":". There is a saddle point at (12,15) for ","element":"figcaption","subtype":"caption"},{"id":"id-47","style":{"height":12.8},"width":188.53,"height":32,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2003.04521/images/4-15.png","element":"img","alt":" −H(x1, x2) .","inline":true}],[{"text":"directions, e.g. along the coordinates ","element":"span"},{"href":"#id-7","referenceIndex":8,"text":"[8]","element":"a"},{"text":". This is apparently not effective. To the best of our knowledge, no work has been done in this avenue.","element":"span"}],[{"text":"Third, minimizing ","element":"span"},{"style":{"fontStyle":"italic"},"text":"H","element":"span"},{"text":"(","element":"span"},{"style":{"fontStyle":"italic"},"text":"x","element":"span"},{"text":") ","element":"span"},{"text":"itself is hard and may not lead to a local optimum, but a saddle point ","element":"span"},{"href":"#id-7","referenceIndex":8,"text":"[8] ","element":"a"},{"text":"even when a promising search direction is used. Unfortunately, there is no studies on how to deal with this scenario in literature. Fig. ","element":"span"},{"href":"#id-47","text":"3 ","element":"a"},{"text":"shows a demo about this phenomenon. In the figure, the contour of ","element":"span"},{"style":{"fontStyle":"italic"},"text":"f","element":"span"},{"text":"(","element":"span"},{"style":{"fontStyle":"italic"},"text":"x","element":"span"},{"text":") ","element":"span"},{"text":"is shown in red lines, while the negative gradients of the filled function ","element":"span"},{"style":{"fontStyle":"italic"},"text":"H","element":"span"},{"text":"(","element":"span"},{"style":{"fontStyle":"italic"},"text":"x","element":"span"},{"text":") ","element":"span"},{"text":"are shown in blue arrows. From Fig. ","element":"span"},{"href":"#id-47","text":"3, ","element":"a"},{"text":"it is seen that minimizing ","element":"span"},{"style":{"fontStyle":"italic"},"text":"H","element":"span"},{"text":"(","element":"span"},{"style":{"fontStyle":"italic"},"text":"x","element":"span"},{"text":") ","element":"span"},{"text":"from a local minimizer of ","element":"span"},{"style":{"fontStyle":"italic"},"text":"f ","element":"span"},{"text":"at ","element":"span"},{"text":"(4","element":"span"},{"style":{"fontStyle":"italic"},"text":", ","element":"span"},{"text":"13) ","element":"span"},{"text":"will lead to the saddle point at ","element":"span"},{"text":"(12","element":"span"},{"style":{"fontStyle":"italic"},"text":", ","element":"span"},{"text":"15)","element":"span"},{"text":".","element":"span"}],[{"id":"id-32","style":{"fontStyle":"italic"},"text":"B. The Proposed Escaping Scheme","element":"span"}],[{"text":"The goal of an escaping phase is to find a new starting point ","element":"span"},{"style":{"height":14.98},"width":301.16,"height":37.46,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2003.04521/images/4-16.png","element":"img","alt":" xnew = xold + ∆x","inline":true,"padRight":true},{"text":"such that ","element":"span"},{"style":{"height":10.98},"width":68.41,"height":27.46,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2003.04521/images/4-17.png","element":"img","alt":" xnew","inline":true,"padRight":true},{"text":"can escape from the attraction basin of ","element":"span"},{"style":{"height":13.78},"width":58.48,"height":34.46,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2003.04521/images/4-18.png","element":"img","alt":" xold","inline":true,"padRight":true},{"text":"(the local minimizer obtained from previous minimization phase) if a minimization procedure is applied, where ","element":"span"},{"style":{"height":14},"width":192.38,"height":35,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2003.04521/images/4-19.png","element":"img","alt":" ∆x = δd, d","inline":true,"padRight":true},{"text":"is the direction and ","element":"span"},{"style":{"height":11.6},"width":19,"height":29,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2003.04521/images/4-20.png","element":"img","alt":" δ","inline":true,"padRight":true},{"text":"is called the escaping length in this paper.","element":"span"}],[{"text":"Rather than choosing ","element":"span"},{"style":{"fontStyle":"italic"},"text":"d ","element":"span"},{"text":"from a pre-fixed set, we could sample some directions, either randomly or sequentially following certain rules. In this section, we propose an effective way to sample directions, or more precisely speaking ","element":"span"},{"style":{"height":11.6},"width":56.21,"height":29,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2003.04521/images/4-21.png","element":"img","alt":" ∆x","inline":true},{"text":"’s.","element":"span"}],[{"text":"In our approach, the sampling of ","element":"span"},{"style":{"height":11.6},"width":56.21,"height":29,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2003.04521/images/4-22.png","element":"img","alt":" ∆x","inline":true,"padRight":true},{"text":"is modeled as a finite-horizon MDP. That is, the sampling is viewed as the execution of a policy ","element":"span"},{"style":{"height":6.8},"width":23,"height":17,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2003.04521/images/4-23.png","element":"img","alt":" π","inline":true},{"text":": at each time step ","element":"span"},{"style":{"fontStyle":"italic"},"text":"t","element":"span"},{"text":", given the current state ","element":"span"},{"style":{"height":9.19},"width":30.68,"height":22.98,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2003.04521/images/4-24.png","element":"img","alt":" st","inline":true},{"text":", and reward ","element":"span"},{"style":{"height":10.79},"width":70.48,"height":26.98,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2003.04521/images/4-25.png","element":"img","alt":" rt+1","inline":true},{"text":", an action ","element":"span"},{"style":{"height":9.19},"width":33.06,"height":22.98,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2003.04521/images/4-26.png","element":"img","alt":" at","inline":true},{"text":", i.e. the increment ","element":"span"},{"style":{"height":11.6},"width":56.21,"height":29,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2003.04521/images/4-27.png","element":"img","alt":" ∆x","inline":true},{"text":", is obtained by the policy. The policy returns ","element":"span"},{"style":{"height":11.6},"width":56.21,"height":29,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2003.04521/images/4-28.png","element":"img","alt":" ∆x","inline":true,"padRight":true},{"text":"by deciding a search direction ","element":"span"},{"style":{"height":13.19},"width":32.74,"height":32.98,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2003.04521/images/4-29.png","element":"img","alt":" dt","inline":true,"padRight":true},{"text":"and an escaping length ","element":"span"},{"style":{"height":13.99},"width":29.71,"height":34.98,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2003.04521/images/4-30.png","element":"img","alt":" δt","inline":true},{"text":".","element":"span"}],[{"text":"At each time step ","element":"span"},{"style":{"fontStyle":"italic"},"text":"t","element":"span"},{"text":", the state ","element":"span"},{"style":{"height":9.19},"width":30.68,"height":22.98,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2003.04521/images/4-31.png","element":"img","alt":" st","inline":true,"padRight":true},{"text":"is composed of a collection of previously used search directions ","element":"span"},{"style":{"height":16},"width":360.63,"height":40,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2003.04521/images/4-32.png","element":"img","alt":" dt = {d1, · · · , dN0}","inline":true,"padRight":true},{"text":"and their scores ","element":"span"},{"style":{"height":16},"width":341.64,"height":40,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2003.04521/images/4-33.png","element":"img","alt":" ut = {u1, · · · , uN0}","inline":true},{"text":", where ","element":"span"},{"style":{"height":13.19},"width":48.02,"height":32.98,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2003.04521/images/4-34.png","element":"img","alt":" N0","inline":true,"padRight":true},{"text":"is a hyper-parameter. Here the score of a search direction measures how promising a direction is in terms of the quality of the new starting point that it can lead to. A new starting point is of high quality if applying local search from it can lead to a better minimizer than current one. The initial state ","element":"span"},{"style":{"height":9.19},"width":34.68,"height":22.98,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2003.04521/images/4-35.png","element":"img","alt":" s0","inline":true,"padRight":true},{"text":"includes a set of ","element":"span"},{"style":{"height":13.19},"width":48.02,"height":32.98,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2003.04521/images/5-0.png","element":"img","alt":" N0","inline":true,"padRight":true},{"text":"directions sampled uniformly at random, and their corresponding scores.","element":"span"}],[{"text":"In the following, we first define ‘score’, then present the policy on deciding ","element":"span"},{"style":{"height":9.19},"width":33.06,"height":22.98,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2003.04521/images/5-1.png","element":"img","alt":" at","inline":true,"padRight":true},{"text":"and ","element":"span"},{"style":{"height":13.99},"width":29.71,"height":34.98,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2003.04521/images/5-2.png","element":"img","alt":" δt","inline":true},{"text":", and the transition probability ","element":"span"},{"style":{"height":16},"width":221.19,"height":40,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2003.04521/images/5-3.png","element":"img","alt":"p(st+1|at, st)","inline":true},{"text":". Without causing confusion, we omit the subscript in the sequel.","element":"span"}],[{"style":{"fontStyle":"italic"},"text":"1) Score: ","element":"span"},{"text":"Given a search direction ","element":"span"},{"style":{"fontStyle":"italic"},"text":"d","element":"span"},{"text":", a local minimizer ","element":"span"},{"style":{"height":9.19},"width":38.78,"height":22.98,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2003.04521/images/5-4.png","element":"img","alt":" x0","inline":true},{"text":", define","element":"span"}],[{"style":{"width":"72%"},"width":727,"height":40,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2003.04521/images/5-5.png","element":"img"}],[{"text":"where ","element":"span"},{"style":{"height":14.79},"width":141.34,"height":36.98,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2003.04521/images/5-6.png","element":"img","alt":" t ∈ R++","inline":true,"padRight":true},{"text":"is the step size along ","element":"span"},{"style":{"fontStyle":"italic"},"text":"d","element":"span"},{"text":".","element":"span"}],[{"style":{"width":"100%"},"width":1005,"height":611,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2003.04521/images/5-7.png","element":"img"}],[{"id":"id-67","style":{"fontWeight":"bold"},"text":"Theorem 2. ","element":"span"},{"style":{"fontStyle":"italic"},"text":"If ","element":"span"},{"style":{"height":13.19},"width":220.46,"height":32.98,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2003.04521/images/5-8.png","element":"img","alt":" x′ = x0+T·d","inline":true,"padRight":true},{"style":{"fontStyle":"italic"},"text":"is a point outside the boundary of ","element":"span"},{"style":{"height":9.19},"width":38.78,"height":22.98,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2003.04521/images/5-9.png","element":"img","alt":"x0","inline":true},{"style":{"fontStyle":"italic"},"text":"’s attraction basin, there is no other local minimizer within ","element":"span"},{"style":{"height":16},"width":671.96,"height":40,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2003.04521/images/5-10.png","element":"img","alt":"B0 = {x ∈ Rn|∥x − x0∥2 ≤ ∥x′ − x0∥2}","inline":true},{"style":{"fontStyle":"italic"},"text":". Then there exists a ","element":"span"},{"style":{"height":14},"width":18,"height":35,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2003.04521/images/5-11.png","element":"img","alt":"ξ","inline":true,"padRight":true},{"style":{"fontStyle":"italic"},"text":"such that","element":"span"}],[{"style":{"width":"49%"},"width":492,"height":45,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2003.04521/images/5-12.png","element":"img"}],[{"style":{"fontStyle":"italic"},"text":"obtains its maximum at ","element":"span"},{"style":{"height":14},"width":18,"height":35,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2003.04521/images/5-13.png","element":"img","alt":" ξ","inline":true},{"style":{"fontStyle":"italic"},"text":". And ","element":"span"},{"style":{"fontStyle":"italic"},"text":"g","element":"span"},{"text":"(","element":"span"},{"style":{"fontStyle":"italic"},"text":"t","element":"span"},{"text":") ","element":"span"},{"style":{"fontStyle":"italic"},"text":"is monotonically increasing in ","element":"span"},{"style":{"height":16},"width":83.97,"height":40,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2003.04521/images/5-14.png","element":"img","alt":" [0, ξ)","inline":true},{"style":{"fontStyle":"italic"},"text":", and monotonically decreasing in ","element":"span"},{"style":{"height":16},"width":92.3,"height":40,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2003.04521/images/5-15.png","element":"img","alt":" (ξ, T]","inline":true},{"style":{"fontStyle":"italic"},"text":".","element":"span"}],[{"text":"If we let ","element":"span"},{"style":{"height":13.19},"width":230.82,"height":32.98,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2003.04521/images/5-16.png","element":"img","alt":" d = x′ − x0","inline":true},{"text":", then ","element":"span"},{"style":{"height":16},"width":436.28,"height":40,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2003.04521/images/5-17.png","element":"img","alt":" g′(t) = ∇f(x0 + t(x′ −","inline":true},{"style":{"height":16},"width":419.76,"height":40,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2003.04521/images/5-18.png","element":"img","alt":"x0))⊺(x′ − x0) ≜ −ud(t)","inline":true},{"text":". This implies that ","element":"span"},{"style":{"height":16},"width":87.28,"height":40,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2003.04521/images/5-19.png","element":"img","alt":" ud(t)","inline":true,"padRight":true},{"text":"is actually ","element":"span"},{"style":{"height":16},"width":108.5,"height":40,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2003.04521/images/5-20.png","element":"img","alt":"−g′(t)","inline":true,"padRight":true},{"text":"along the direction from ","element":"span"},{"style":{"height":6.8},"width":36.78,"height":17,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2003.04521/images/5-21.png","element":"img","alt":" x′","inline":true,"padRight":true},{"text":"pointing to ","element":"span"},{"style":{"height":9.19},"width":38.78,"height":22.98,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2003.04521/images/5-22.png","element":"img","alt":" x0","inline":true},{"text":". This tells whether a direction ","element":"span"},{"style":{"fontStyle":"italic"},"text":"d ","element":"span"},{"text":"can lead to a new minimizer or not. A direction ","element":"span"},{"style":{"fontStyle":"italic"},"text":"d ","element":"span"},{"text":"with a positive ","element":"span"},{"style":{"height":16},"width":87.28,"height":40,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2003.04521/images/5-23.png","element":"img","alt":" ud(t)","inline":true,"padRight":true},{"text":"indicates that it could lead to a local minimizer different to present one.","element":"span"}],[{"text":"We therefore define the score of a direction ","element":"span"},{"style":{"height":13.2},"width":82.71,"height":33,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2003.04521/images/5-24.png","element":"img","alt":" d, ud","inline":true},{"text":", to be the greatest ","element":"span"},{"style":{"height":16},"width":87.28,"height":40,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2003.04521/images/5-25.png","element":"img","alt":" ud(t)","inline":true,"padRight":true},{"text":"along ","element":"span"},{"style":{"fontStyle":"italic"},"text":"d","element":"span"},{"text":", i.e.","element":"span"}],[{"id":"id-78","style":{"width":"64%"},"width":644,"height":69,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2003.04521/images/5-26.png","element":"img"}],[{"text":"For such ","element":"span"},{"style":{"fontStyle":"italic"},"text":"d ","element":"span"},{"text":"that ","element":"span"},{"style":{"height":13.19},"width":114.53,"height":32.97,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2003.04521/images/5-27.png","element":"img","alt":" ud > 0","inline":true},{"text":", we say it is promising.","element":"span"}],[{"text":"In the following, we present the policy ","element":"span"},{"style":{"height":6.8},"width":23,"height":17,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2003.04521/images/5-28.png","element":"img","alt":" π","inline":true,"padRight":true},{"text":"on finding ","element":"span"},{"style":{"height":11.6},"width":56.21,"height":29,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2003.04521/images/5-29.png","element":"img","alt":" ∆x","inline":true,"padRight":true},{"text":"(or new starting point). The policy includes two sub-policies. One is to find the new point given a promising direction, i.e. to find the escaping length. The other is to decide the promising direction.","element":"span"}],[{"style":{"fontStyle":"italic"},"text":"2) Policy on finding the escaping length: ","element":"span"},{"text":"First we propose to use a simple filled function as follows:","element":"span"}],[{"style":{"width":"68%"},"width":688,"height":49,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2003.04521/images/5-30.png","element":"img"}],[{"text":"Here ","element":"span"},{"style":{"fontStyle":"italic"},"text":"a ","element":"span"},{"text":"is called the ‘escaping length controller’ since it controlls how far a solution could escape from the current local optimum. Alg. ","element":"span"},{"href":"#id-48","text":"2 ","element":"a"},{"text":"summarizes the policy proposed to determine the optimal ","element":"span"},{"style":{"height":10.98},"width":37.06,"height":27.46,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2003.04521/images/5-31.png","element":"img","alt":" a∗","inline":true,"padRight":true},{"text":"and the new starting point ","element":"span"},{"style":{"fontStyle":"italic"},"text":"x","element":"span"},{"text":".","element":"span"}],[{"id":"id-48","style":{"width":"100%"},"width":1005,"height":872,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2003.04521/images/5-32.png","element":"img"}],[{"text":"In Alg. ","element":"span"},{"href":"#id-48","text":"2, ","element":"a"},{"text":"given a direction ","element":"span"},{"style":{"fontStyle":"italic"},"text":"d","element":"span"},{"text":", the filled function ","element":"span"},{"style":{"height":16},"width":90.64,"height":40,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2003.04521/images/5-33.png","element":"img","alt":"�H(x)","inline":true},{"text":"is optimized for ","element":"span"},{"style":{"fontStyle":"italic"},"text":"N ","element":"span"},{"text":"steps (line ","element":"span"},{"href":"#id-48","text":"3)","element":"a"},{"text":". The sum of the iterates’ function values, denoted as ","element":"span"},{"style":{"fontStyle":"italic"},"text":"F","element":"span"},{"text":"(","element":"span"},{"style":{"fontStyle":"italic"},"text":"a","element":"span"},{"text":") ","element":"span"},{"text":"(line ","element":"span"},{"href":"#id-48","text":"4)","element":"a"},{"text":", is maximized w.r.t. ","element":"span"},{"style":{"fontStyle":"italic"},"text":"a ","element":"span"},{"text":"by gradient ascent (line ","element":"span"},{"href":"#id-48","text":"5)","element":"a"},{"text":". The algorithm terminates if a stable point of ","element":"span"},{"style":{"fontStyle":"italic"},"text":"F","element":"span"},{"text":"(","element":"span"},{"style":{"fontStyle":"italic"},"text":"a","element":"span"},{"text":") ","element":"span"},{"text":"is found (","element":"span"},{"style":{"height":16},"width":195.43,"height":40,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2003.04521/images/5-34.png","element":"img","alt":"|F ′(a)| ≤ ϵ","inline":true},{"text":"), or the search is out of bound (","element":"span"},{"style":{"height":16},"width":286.03,"height":40,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2003.04521/images/5-35.png","element":"img","alt":"∥xN − x0∥ ≥ M","inline":true},{"text":"). When the search is out of bound, a negative score is set for the direction ","element":"span"},{"style":{"fontStyle":"italic"},"text":"d ","element":"span"},{"text":"(line ","element":"span"},{"href":"#id-48","text":"8)","element":"a"},{"text":". As a by-product, Alg. ","element":"span"},{"href":"#id-48","text":"2 ","element":"a"},{"text":"also returns the score ","element":"span"},{"style":{"height":9.19},"width":39.81,"height":22.98,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2003.04521/images/5-36.png","element":"img","alt":" ud","inline":true,"padRight":true},{"text":"of the given direction ","element":"span"},{"style":{"fontStyle":"italic"},"text":"d","element":"span"},{"text":".","element":"span"}],[{"text":"We prove that ","element":"span"},{"style":{"height":9.19},"width":49.78,"height":22.98,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2003.04521/images/5-37.png","element":"img","alt":" xN","inline":true,"padRight":true},{"text":"can escape from the attraction basin of ","element":"span"},{"style":{"height":9.19},"width":38.78,"height":22.98,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2003.04521/images/5-38.png","element":"img","alt":"x0","inline":true,"padRight":true},{"text":"and ends up in another attraction basin of a local minimizer ","element":"span"},{"style":{"height":6.8},"width":36.78,"height":17,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2003.04521/images/5-39.png","element":"img","alt":"x′","inline":true,"padRight":true},{"text":"with smaller criterion ","element":"span"},{"style":{"fontStyle":"italic"},"text":"if ","element":"span"},{"style":{"height":6.8},"width":36.78,"height":17,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2003.04521/images/5-40.png","element":"img","alt":" x′","inline":true,"padRight":true},{"style":{"fontStyle":"italic"},"text":"exists","element":"span"},{"text":". Theorem ","element":"span"},{"href":"#id-49","text":"3 ","element":"a"},{"text":"summarizes the result.","element":"span"}],[{"id":"id-49","style":{"fontWeight":"bold"},"text":"Theorem 3. ","element":"span"},{"style":{"fontStyle":"italic"},"text":"Suppose that ","element":"span"},{"style":{"height":13.19},"width":227.7,"height":32.98,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2003.04521/images/5-41.png","element":"img","alt":" x′ = x0 + Td","inline":true,"padRight":true},{"style":{"fontStyle":"italic"},"text":"is a point such that ","element":"span"},{"style":{"height":16},"width":248.28,"height":40,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2003.04521/images/5-42.png","element":"img","alt":"f(x0) ≥ f(x′)","inline":true},{"style":{"fontStyle":"italic"},"text":", and there are no other points that are with smaller or equal criterion than ","element":"span"},{"style":{"height":16},"width":95.95,"height":40,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2003.04521/images/5-43.png","element":"img","alt":" f(x0)","inline":true,"padRight":true},{"style":{"fontStyle":"italic"},"text":"within ","element":"span"},{"style":{"height":13.19},"width":42.17,"height":32.98,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2003.04521/images/5-44.png","element":"img","alt":" B0","inline":true},{"style":{"fontStyle":"italic"},"text":". If the learning rate ","element":"span"},{"style":{"height":6.8},"width":26,"height":17,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2003.04521/images/5-45.png","element":"img","alt":" α","inline":true,"padRight":true},{"style":{"fontStyle":"italic"},"text":"is sufficiently small, then there exists an ","element":"span"},{"style":{"height":10.99},"width":37.06,"height":27.46,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2003.04521/images/5-46.png","element":"img","alt":" a∗","inline":true,"padRight":true},{"style":{"fontStyle":"italic"},"text":"such that ","element":"span"},{"style":{"height":16},"width":185.85,"height":40,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2003.04521/images/5-47.png","element":"img","alt":"F ′(a∗) = 0","inline":true},{"style":{"fontStyle":"italic"},"text":".","element":"span"}],[{"text":"According to this theorem, we have the following corollary.","element":"span"}],[{"id":"id-53","style":{"fontWeight":"bold"},"text":"Corollary 1. ","element":"span"},{"style":{"fontStyle":"italic"},"text":"Suppose that ","element":"span"},{"style":{"height":9.19},"width":49.78,"height":22.98,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2003.04521/images/5-48.png","element":"img","alt":" xN","inline":true,"padRight":true},{"style":{"fontStyle":"italic"},"text":"is the solution obtained by optimizing ","element":"span"},{"style":{"height":17.38},"width":374.98,"height":43.46,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2003.04521/images/5-49.png","element":"img","alt":"�H(x) = −a∗∥x−x0∥22","inline":true,"padRight":true},{"style":{"fontStyle":"italic"},"text":"along ","element":"span"},{"style":{"fontStyle":"italic"},"text":"d ","element":"span"},{"style":{"fontStyle":"italic"},"text":"starting from ","element":"span"},{"style":{"height":11.59},"width":77.18,"height":28.98,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2003.04521/images/5-50.png","element":"img","alt":" x0 +","inline":true},{"style":{"height":13.99},"width":56.59,"height":34.98,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2003.04521/images/5-51.png","element":"img","alt":"δ0d","inline":true,"padRight":true},{"style":{"fontStyle":"italic"},"text":"at the ","element":"span"},{"style":{"fontStyle":"italic"},"text":"N","element":"span"},{"style":{"fontStyle":"italic"},"text":"-th iteration, then ","element":"span"},{"style":{"height":9.19},"width":49.78,"height":22.98,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2003.04521/images/5-52.png","element":"img","alt":" xN","inline":true,"padRight":true},{"style":{"fontStyle":"italic"},"text":"will be in an attraction basin of ","element":"span"},{"style":{"height":6.8},"width":36.78,"height":17,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2003.04521/images/5-53.png","element":"img","alt":" x′","inline":true},{"style":{"fontStyle":"italic"},"text":", if the basin ever exists.","element":"span"}],[{"text":"Theorem ","element":"span"},{"href":"#id-49","text":"3 ","element":"a"},{"text":"can be explained intuitively as follows. Consider pushing a ball down the peak of a mountain with height ","element":"span"},{"style":{"height":16},"width":126.94,"height":40,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2003.04521/images/5-54.png","element":"img","alt":"−f(x1)","inline":true,"padRight":true},{"text":"(it can be regarded as the ball’s gravitational potential energy) along a direction ","element":"span"},{"style":{"fontStyle":"italic"},"text":"d","element":"span"},{"text":". The ball will keep moving until it arrives at a point ","element":"span"},{"style":{"height":16.13},"width":244.76,"height":40.33,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2003.04521/images/5-55.png","element":"img","alt":" ˜x = x0 + ˜td","inline":true,"padRight":true},{"text":"for some ","element":"span"},{"style":{"height":13.74},"width":20.55,"height":34.36,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2003.04521/images/5-56.png","element":"img","alt":"˜t","inline":true,"padRight":true},{"text":"such that ","element":"span"},{"style":{"height":16},"width":244.35,"height":40,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2003.04521/images/5-57.png","element":"img","alt":"f(x1) = f(˜x)","inline":true},{"text":". For any ","element":"span"},{"style":{"height":17.74},"width":175.55,"height":44.36,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2003.04521/images/5-58.png","element":"img","alt":" t ∈ [δ0, ˜t)","inline":true},{"text":", the ball has a positive velocity, i.e. ","element":"span"},{"style":{"height":16},"width":316.92,"height":40,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2003.04521/images/5-59.png","element":"img","alt":" g(t) − g(t1) > 0","inline":true,"padRight":true},{"text":"where ","element":"span"},{"style":{"height":13.99},"width":151.65,"height":34.98,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2003.04521/images/5-60.png","element":"img","alt":" t1 = δ0","inline":true},{"text":". But the ball has a zero velocity at ","element":"span"},{"style":{"height":13.74},"width":20.56,"height":34.35,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2003.04521/images/5-61.png","element":"img","alt":"˜t","inline":true},{"text":", and negative at ","element":"span"},{"style":{"height":14.54},"width":102.72,"height":36.35,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2003.04521/images/5-62.png","element":"img","alt":" t > ˜t","inline":true},{"text":", Hence ","element":"span"},{"style":{"height":18},"width":418.59,"height":45,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2003.04521/images/5-63.png","element":"img","alt":"�(f(x0 + td) − f(x1))dt","inline":true,"padRight":true},{"text":"reaches its maximum in ","element":"span"},{"style":{"height":17.74},"width":74.1,"height":44.35,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2003.04521/images/5-64.png","element":"img","alt":" [0, ˜t]","inline":true},{"text":". The integral is approximated by its discrete sum, i.e. ","element":"span"},{"style":{"fontStyle":"italic"},"text":"F","element":"span"},{"text":"(","element":"span"},{"style":{"fontStyle":"italic"},"text":"a","element":"span"},{"text":")","element":"span"},{"text":", in Alg. ","element":"span"},{"href":"#id-48","text":"2.","element":"a"}],[{"text":"Further, according to the law of the conservation of energy, the ball will keep moving until at some ","element":"span"},{"style":{"height":17.74},"width":371.22,"height":44.36,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2003.04521/images/5-65.png","element":"img","alt":"˜t, f(x0+td)−f(x1) =","inline":true,"padRight":true},{"text":"0 ","element":"span"},{"text":"in which case ","element":"span"},{"style":{"height":16},"width":167.53,"height":40,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2003.04521/images/5-66.png","element":"img","alt":" F ′(a) = 0","inline":true},{"text":". This means that the ball falls into the attraction basin of a smaller criterion than ","element":"span"},{"style":{"height":16},"width":95.95,"height":40,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2003.04521/images/5-67.png","element":"img","alt":" f(x1)","inline":true,"padRight":true},{"text":"as shown in Fig. ","element":"span"},{"href":"#id-50","text":"4(","element":"a"},{"text":"b). Fig. ","element":"span"},{"href":"#id-50","text":"4(","element":"a"},{"text":"a) shows when ","element":"span"},{"style":{"fontStyle":"italic"},"text":"a ","element":"span"},{"text":"is small, in ","element":"span"},{"style":{"fontStyle":"italic"},"text":"N ","element":"span"},{"text":"iterations, the ball reaches some ","element":"span"},{"style":{"height":12.39},"width":41.39,"height":30.98,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2003.04521/images/6-0.png","element":"img","alt":" tN","inline":true,"padRight":true},{"text":"but ","element":"span"},{"style":{"height":23.36},"width":394.86,"height":58.4,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2003.04521/images/6-1.png","element":"img","alt":" F ′(a) ≈� tNt1 (g′(t)) > 0","inline":true},{"text":".","element":"span"}],[{"text":"Moreover, if there is no smaller local minimizers in search region, the ball will keep going until it rolls outside the restricted search region bounded by ","element":"span"},{"style":{"fontStyle":"italic"},"text":"M ","element":"span"},{"text":"as shown in Fig. ","element":"span"},{"href":"#id-50","text":"4(","element":"a"},{"text":"e) which means Alg. ","element":"span"},{"href":"#id-48","text":"2 ","element":"a"},{"text":"fails to find ","element":"span"},{"style":{"height":10.98},"width":37.06,"height":27.46,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2003.04521/images/6-2.png","element":"img","alt":" a∗","inline":true},{"text":". Fig. ","element":"span"},{"href":"#id-50","text":"4(","element":"a"},{"text":"c)(d) show the cases when there are more than one local minimizers within the search region.","element":"span"}],[{"text":"Once such ","element":"span"},{"style":{"height":10.99},"width":37.07,"height":27.46,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2003.04521/images/6-3.png","element":"img","alt":" a∗","inline":true,"padRight":true},{"text":"has been found, the corresponding ","element":"span"},{"style":{"height":9.19},"width":49.78,"height":22.98,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2003.04521/images/6-4.png","element":"img","alt":" xN","inline":true,"padRight":true},{"text":"will enter an attraction basin of a local minimum with smaller criterion than ","element":"span"},{"style":{"height":9.19},"width":38.78,"height":22.98,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2003.04521/images/6-5.png","element":"img","alt":" x0","inline":true},{"text":". If we cannot find such an ","element":"span"},{"style":{"height":10.98},"width":37.07,"height":27.46,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2003.04521/images/6-6.png","element":"img","alt":" a∗","inline":true,"padRight":true},{"text":"in the direction of ","element":"span"},{"style":{"fontStyle":"italic"},"text":"d ","element":"span"},{"text":"within a distance ","element":"span"},{"style":{"fontStyle":"italic"},"text":"M ","element":"span"},{"text":"to ","element":"span"},{"style":{"height":9.19},"width":38.77,"height":22.98,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2003.04521/images/6-7.png","element":"img","alt":" x0","inline":true},{"text":", we consider that there is no another smaller local minimum along ","element":"span"},{"style":{"fontStyle":"italic"},"text":"d","element":"span"},{"text":". If it is the case, ","element":"span"},{"style":{"fontStyle":"italic"},"text":"d ","element":"span"},{"text":"is non-promising. We thus set a negative score for it as shown in line ","element":"span"},{"href":"#id-48","text":"8 ","element":"a"},{"text":"of Alg. ","element":"span"},{"href":"#id-48","text":"2.","element":"a"}],[{"text":"It is seen that the running of line ","element":"span"},{"href":"#id-48","text":"5 ","element":"a"},{"text":"of Alg. ","element":"span"},{"href":"#id-48","text":"2 ","element":"a"},{"text":"requires to compute ","element":"span"},{"style":{"fontStyle":"italic"},"text":"N ","element":"span"},{"text":"gradients of ","element":"span"},{"style":{"fontStyle":"italic"},"text":"f","element":"span"},{"text":"(","element":"span"},{"style":{"fontStyle":"italic"},"text":"x","element":"span"},{"text":") ","element":"span"},{"text":"at each iteration. This causes Alg. ","element":"span"},{"href":"#id-48","text":"2 ","element":"a"},{"text":"time consuming. We hereby propose to accelerate this procedure by fixing ","element":"span"},{"style":{"fontStyle":"italic"},"text":"a ","element":"span"},{"text":"but finding a proper number of iterations. Alg. ","element":"span"},{"href":"#id-51","text":"3 ","element":"a"},{"text":"summarizes the fast policy. Given a direction ","element":"span"},{"style":{"fontStyle":"italic"},"text":"d","element":"span"},{"text":", during the search, the learning rate ","element":"span"},{"style":{"height":6.8},"width":26,"height":17,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2003.04521/images/6-8.png","element":"img","alt":" α","inline":true,"padRight":true},{"text":"and the escaping length controller ","element":"span"},{"style":{"fontStyle":"italic"},"text":"a ","element":"span"},{"text":"are fixed. At each iteration of Alg. ","element":"span"},{"href":"#id-51","text":"3, ","element":"a"},{"text":"an iterate ","element":"span"},{"style":{"height":9.19},"width":33.78,"height":22.98,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2003.04521/images/6-9.png","element":"img","alt":" xi","inline":true,"padRight":true},{"text":"is obtained by applying gradient descent over ","element":"span"},{"style":{"height":16},"width":90.64,"height":40,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2003.04521/images/6-10.png","element":"img","alt":"�H(x)","inline":true},{"text":". The gradient of ","element":"span"},{"style":{"height":9.19},"width":33.78,"height":22.98,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2003.04521/images/6-11.png","element":"img","alt":" xi","inline":true,"padRight":true},{"text":"over ","element":"span"},{"style":{"fontStyle":"italic"},"text":"f","element":"span"},{"text":"(","element":"span"},{"style":{"fontStyle":"italic"},"text":"x","element":"span"},{"text":") ","element":"span"},{"text":"is computed (line ","element":"span"},{"href":"#id-51","text":"6)","element":"a"},{"text":". ","element":"span"},{"style":{"height":22.4},"width":543.83,"height":55.99,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2003.04521/images/6-12.png","element":"img","alt":"Qi = �ij=1 ∇f(xj)⊺(xj −xj−1)","inline":true,"padRight":true},{"text":"is computed (line ","element":"span"},{"href":"#id-51","text":"7)","element":"a"},{"text":". Alg. ","element":"span"},{"href":"#id-51","text":"3 ","element":"a"},{"text":"terminates if there is an ","element":"span"},{"style":{"fontStyle":"italic"},"text":"i","element":"span"},{"text":", such that ","element":"span"},{"style":{"height":14},"width":127.52,"height":35,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2003.04521/images/6-13.png","element":"img","alt":" Qi > 0","inline":true,"padRight":true},{"text":"and ","element":"span"},{"style":{"height":14},"width":168.32,"height":35,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2003.04521/images/6-14.png","element":"img","alt":" Qi−1 < 0","inline":true,"padRight":true},{"text":"or the search is beyond the bound. It is seen that during the search, at each iteration, we only need to compute the gradient for once, which can significantly reduce the computational cost in comparison with Alg. ","element":"span"},{"href":"#id-48","text":"2.","element":"a"}],[{"id":"id-51","style":{"width":"100%"},"width":1005,"height":739,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2003.04521/images/6-15.png","element":"img"}],[{"text":"Alg. ","element":"span"},{"href":"#id-51","text":"3 ","element":"a"},{"text":"aims to find an integer ","element":"span"},{"style":{"fontStyle":"italic"},"text":"i ","element":"span"},{"text":"such that ","element":"span"},{"style":{"height":14},"width":158.7,"height":35,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2003.04521/images/6-16.png","element":"img","alt":" Qi−1 < 0","inline":true,"padRight":true},{"text":"but ","element":"span"},{"style":{"height":14},"width":86.84,"height":35,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2003.04521/images/6-17.png","element":"img","alt":" Qi >","inline":true,"padRight":true},{"text":"0","element":"span"},{"text":". The existence of such an ","element":"span"},{"style":{"fontStyle":"italic"},"text":"i ","element":"span"},{"text":"can be illustrated as follows. It is seen that ","element":"span"},{"style":{"height":16},"width":213.87,"height":40,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2003.04521/images/6-18.png","element":"img","alt":" Qi = aF ′(a)","inline":true,"padRight":true},{"text":"(please see Eq. ","element":"span"},{"href":"#id-52","text":"31 ","element":"a"},{"text":"in Appendix B). This implies that ","element":"span"},{"style":{"height":23.36},"width":269.58,"height":58.41,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2003.04521/images/6-19.png","element":"img","alt":" Qi =� tit1 g′(t)dt","inline":true},{"text":". When ","element":"span"},{"style":{"height":11.2},"width":107.68,"height":28,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2003.04521/images/6-20.png","element":"img","alt":" α → 0","inline":true},{"text":", we have ","element":"span"},{"style":{"height":12.79},"width":29.73,"height":31.98,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2003.04521/images/6-21.png","element":"img","alt":" i1","inline":true,"padRight":true},{"text":"and ","element":"span"},{"style":{"height":12.79},"width":29.73,"height":31.98,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2003.04521/images/6-22.png","element":"img","alt":" i2","inline":true,"padRight":true},{"text":"so that ","element":"span"},{"style":{"height":16},"width":208.18,"height":40,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2003.04521/images/6-23.png","element":"img","alt":" |ti1 − ξ| < ε","inline":true,"padRight":true},{"text":"and ","element":"span"},{"style":{"height":16},"width":217.74,"height":40,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2003.04521/images/6-24.png","element":"img","alt":" |ti2 − T| < ε","inline":true,"padRight":true},{"text":"for any ","element":"span"},{"style":{"height":11.6},"width":93.66,"height":29,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2003.04521/images/6-25.png","element":"img","alt":" ε > 0","inline":true},{"text":", and ","element":"span"},{"style":{"height":16},"width":175.24,"height":40,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2003.04521/images/6-26.png","element":"img","alt":" Q(i1) < 0","inline":true,"padRight":true},{"text":"and ","element":"span"},{"style":{"height":16},"width":175.24,"height":40,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2003.04521/images/6-27.png","element":"img","alt":" Q(i2) > 0","inline":true},{"text":". Thus, there exists an ","element":"span"},{"style":{"fontStyle":"italic"},"text":"i ","element":"span"},{"text":"such that ","element":"span"},{"style":{"fontStyle":"italic"},"text":"Q","element":"span"},{"text":"(","element":"span"},{"style":{"fontStyle":"italic"},"text":"i","element":"span"},{"text":") ","element":"span"},{"style":{"fontStyle":"italic"},"text":"> ","element":"span"},{"text":"0 ","element":"span"},{"text":"and ","element":"span"},{"style":{"height":16},"width":218,"height":40,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2003.04521/images/6-28.png","element":"img","alt":" Q(i − 1) < 0","inline":true},{"text":".","element":"span"}],[{"text":"Corollary ","element":"span"},{"href":"#id-53","text":"1 ","element":"a"},{"text":"proves that if there exists a better local minimum ","element":"span"},{"style":{"height":6.8},"width":36.78,"height":17,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2003.04521/images/6-29.png","element":"img","alt":"x′","inline":true,"padRight":true},{"text":"along ","element":"span"},{"style":{"fontStyle":"italic"},"text":"d","element":"span"},{"text":", then applying Alg. ","element":"span"},{"href":"#id-48","text":"2 ","element":"a"},{"text":"or Alg. ","element":"span"},{"href":"#id-51","text":"3, ","element":"a"},{"text":"we are able to escape from the local attraction basin of ","element":"span"},{"style":{"height":9.19},"width":38.78,"height":22.98,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2003.04521/images/6-30.png","element":"img","alt":" x0","inline":true},{"text":".","element":"span"}],[{"style":{"fontStyle":"italic"},"text":"3) Policy on the sampling of promising directions: ","element":"span"},{"text":"In the following, we show how to sample directions that are of high probability to be promising. We first present a fixed policy, then propose to learn for an optimal policy by policy gradient.","element":"span"}],[{"id":"id-54","style":{"width":"100%"},"width":1006,"height":1035,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2003.04521/images/6-31.png","element":"img"}],[{"text":"Alg. ","element":"span"},{"href":"#id-54","text":"4 ","element":"a"},{"text":"summarizes the fixed policy method. In Alg. ","element":"span"},{"href":"#id-54","text":"4, ","element":"a"},{"text":"first a set of directions are sampled uniformly at random (line ","element":"span"},{"href":"#id-54","text":"1)","element":"a"},{"text":". Their scores are computed by Alg. ","element":"span"},{"href":"#id-48","text":"2 ","element":"a"},{"text":"or Alg. ","element":"span"},{"href":"#id-51","text":"3. ","element":"a"},{"text":"Archives used to store the directions and starting points are initialized (line ","element":"span"},{"href":"#id-54","text":"2)","element":"a"},{"text":". A direction is sampled by using a linear combination of previous directions with their respective scores as coefficients (line ","element":"span"},{"href":"#id-54","text":"4)","element":"a"},{"text":". If the sampled direction has a positive score, its score and the obtained starting point are included in the archive. The sets of scores and directions are updated accordingly in a FIFO manner (lines ","element":"span"},{"href":"#id-54","text":"9-","element":"a"},{"href":"#id-54","text":"10)","element":"a"},{"text":". The algorithm terminates if the number of sampling exceeds ","element":"span"},{"style":{"fontStyle":"italic"},"text":"P","element":"span"},{"text":".","element":"span"}],[{"text":"We hope that the developed sampling algorithm is more efficient than that of the random sampling in terms of finding promising direction. ","element":"span"},{"style":{"height":13.19},"width":40.58,"height":32.98,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2003.04521/images/6-32.png","element":"img","alt":" Pr","inline":true,"padRight":true},{"text":"denotes the probability of finding a promising direction by using the random sampling, ","element":"span"},{"style":{"height":13.19},"width":39.58,"height":32.98,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2003.04521/images/6-33.png","element":"img","alt":" Pc","inline":true,"padRight":true},{"text":"be the probability by the fixed policy. Then in Appendix C, we will do some explanation why ","element":"span"},{"style":{"height":13.19},"width":135.54,"height":32.98,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2003.04521/images/6-34.png","element":"img","alt":" Pc > Pr","inline":true},{"text":".","element":"span"}],[{"style":{"fontStyle":"italic"},"text":"4) The transition: ","element":"span"},{"text":"In our MDP model, the probability transition ","element":"span"},{"style":{"height":16},"width":221.2,"height":40,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2003.04521/images/6-35.png","element":"img","alt":" p(st+1|st, at)","inline":true,"padRight":true},{"text":"is deterministic. The determination of new starting point depends on the sampling of a new direction ","element":"span"},{"style":{"height":13.19},"width":32.74,"height":32.98,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2003.04521/images/6-36.png","element":"img","alt":"dt","inline":true,"padRight":true},{"text":"and its score ","element":"span"},{"style":{"height":9.19},"width":34.81,"height":22.98,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2003.04521/images/6-37.png","element":"img","alt":" ut","inline":true},{"text":". New state ","element":"span"},{"style":{"height":10.79},"width":71.18,"height":26.98,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2003.04521/images/6-38.png","element":"img","alt":" st+1","inline":true,"padRight":true},{"text":"is then updated in a FIFO manner. That is, at each time step, the first element ","element":"span"},{"style":{"height":16},"width":135.88,"height":40,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2003.04521/images/6-39.png","element":"img","alt":" (u1, d1)","inline":true,"padRight":true},{"text":"in ","element":"span"},{"style":{"height":9.19},"width":30.68,"height":22.98,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2003.04521/images/6-40.png","element":"img","alt":" st","inline":true,"padRight":true},{"text":"is replaced by the newly sampled ","element":"span"},{"style":{"height":16},"width":128.19,"height":40,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2003.04521/images/6-41.png","element":"img","alt":" (ut, dt)","inline":true},{"text":".","element":"span"}],[{"text":"All the proofs in this section are given in Appendix B.","element":"span"}],[{"id":"id-33","style":{"fontStyle":"italic"},"text":"C. Learning the Escaping Policy by Policy Gradient","element":"span"}],[{"text":"In the presented policy, a linear combination of previous directions with their scores as coefficients is applied to sample a new direction. However, this policy is not necessarily optimal.","element":"span"}],[{"style":{"width":"98%"},"width":2031,"height":364,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2003.04521/images/7-0.png","element":"img"}],[{"text":"Fig. 4. ","element":"figcaption","subtype":"caption"},{"id":"id-50","text":"Possible scenarios encountered when estimating ","element":"figcaption","subtype":"caption"},{"style":{"height":8.9},"width":32.99,"height":22.25,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2003.04521/images/7-1.png","element":"img","alt":" a∗","inline":true},{"text":". (a) shows the case when ","element":"figcaption","subtype":"caption"},{"style":{"fontStyle":"italic"},"text":"a ","element":"figcaption","subtype":"caption"},{"text":"is not large enough, while (b) shows when ","element":"figcaption","subtype":"caption"},{"style":{"fontStyle":"italic"},"text":"a ","element":"figcaption","subtype":"caption"},{"text":"is appropriate. (c) shows that ","element":"figcaption","subtype":"caption"},{"style":{"height":10.95},"width":227.31,"height":27.36,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2003.04521/images/7-2.png","element":"img","alt":" xN = x0 + tNd","inline":true,"padRight":true},{"text":"reaches a local minimum, but ","element":"figcaption","subtype":"caption"},{"style":{"height":12.8},"width":526.86,"height":32,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2003.04521/images/7-3.png","element":"img","alt":" F ′(a) ̸= 0 because g(tN) − g(t1) ̸= 0","inline":true},{"text":"; (d) shows the case when there are more than one local minimizer. (e) shows when there is no smaller local minimizer within ","element":"figcaption","subtype":"caption"},{"style":{"height":12.8},"width":246.8,"height":32,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2003.04521/images/7-4.png","element":"img","alt":" ∥xN − x0∥ ≥ M.","inline":true}],[{"text":"In this section, we propose to learn an optimal policy by the policy gradient algorithm ","element":"span"},{"href":"#id-40","referenceIndex":35,"text":"[35]","element":"a"},{"text":".","element":"span"}],[{"text":"The learning is based on the same foregoing MDP framework. The goal is to learn the optimal coefficients for combining previously sampled directions. We assume that at time ","element":"span"},{"style":{"fontStyle":"italic"},"text":"t","element":"span"},{"text":", the coefficients are obtained as follows:","element":"span"}],[{"id":"id-62","style":{"width":"36%"},"width":364,"height":100,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2003.04521/images/7-5.png","element":"img"}],[{"text":"where ","element":"span"},{"style":{"height":16},"width":346.91,"height":40,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2003.04521/images/7-6.png","element":"img","alt":" ut = [u1, · · · , uN0]⊺","inline":true,"padRight":true},{"text":"and ","element":"span"},{"style":{"height":16},"width":448.72,"height":40,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2003.04521/images/7-7.png","element":"img","alt":" dt = [d1, · · · , dN0]. mt ∈","inline":true},{"style":{"height":13.39},"width":68.02,"height":33.46,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2003.04521/images/7-8.png","element":"img","alt":"RN0","inline":true,"padRight":true},{"text":"is the output of a feed-forward neural network ","element":"span"},{"style":{"fontStyle":"italic"},"text":"g ","element":"span"},{"text":"with parameter ","element":"span"},{"style":{"height":10.8},"width":19,"height":27,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2003.04521/images/7-9.png","element":"img","alt":" θ","inline":true},{"text":", and ","element":"span"},{"style":{"height":15.78},"width":175.2,"height":39.44,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2003.04521/images/7-10.png","element":"img","alt":" wt ∈ RN0","inline":true,"padRight":true},{"text":"is the coefficients. The current state ","element":"span"},{"style":{"height":9.19},"width":30.68,"height":22.97,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2003.04521/images/7-11.png","element":"img","alt":" st","inline":true,"padRight":true},{"text":"is the composition of ","element":"span"},{"style":{"height":9.59},"width":37.46,"height":23.97,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2003.04521/images/7-12.png","element":"img","alt":" ut","inline":true,"padRight":true},{"text":"and ","element":"span"},{"style":{"height":13.19},"width":37.46,"height":32.97,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2003.04521/images/7-13.png","element":"img","alt":" dt","inline":true},{"text":".","element":"span"}],[{"text":"Fig. ","element":"span"},{"href":"#id-55","text":"5 ","element":"a"},{"text":"shows the framework of estimating the coefficients and sampling a new direction at a certain time step. For the next time step, ","element":"span"},{"style":{"height":11.19},"width":77.96,"height":27.98,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2003.04521/images/7-14.png","element":"img","alt":" ut+1","inline":true,"padRight":true},{"text":"and ","element":"span"},{"style":{"height":14.79},"width":77.97,"height":36.98,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2003.04521/images/7-15.png","element":"img","alt":" dt+1","inline":true,"padRight":true},{"text":"are updated","element":"span"}],[{"style":{"width":"87%"},"width":879,"height":181,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2003.04521/images/7-16.png","element":"img"}],[{"text":"The policy gradient algorithm is used to learn ","element":"span"},{"style":{"height":10.8},"width":19,"height":27,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2003.04521/images/7-17.png","element":"img","alt":" θ","inline":true,"padRight":true},{"text":"for the neural network ","element":"span"},{"style":{"fontStyle":"italic"},"text":"g","element":"span"},{"text":". We assume that ","element":"span"},{"style":{"height":16.41},"width":277.46,"height":41.03,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2003.04521/images/7-18.png","element":"img","alt":" φ(st; θ) = w⊺t dt","inline":true,"padRight":true},{"text":"and the ","element":"span"},{"text":"policy can be stated as follows:","element":"span"}],[{"style":{"width":"52%"},"width":523,"height":49,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2003.04521/images/7-19.png","element":"img"}],[{"text":"The reward is defined to be","element":"span"}],[{"style":{"width":"92%"},"width":928,"height":136,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2003.04521/images/7-20.png","element":"img"}],[{"text":"where ","element":"span"},{"style":{"height":14.4},"width":137.71,"height":36,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2003.04521/images/7-21.png","element":"img","alt":" γ = 0.9","inline":true,"padRight":true},{"text":"is a constant, ","element":"span"},{"style":{"height":10.8},"width":23,"height":27,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2003.04521/images/7-22.png","element":"img","alt":" ˜u","inline":true,"padRight":true},{"text":"is the score of the sampled direction ","element":"span"},{"style":{"height":15.01},"width":27.05,"height":37.52,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2003.04521/images/7-23.png","element":"img","alt":"˜d","inline":true,"padRight":true},{"text":"and ","element":"span"},{"style":{"height":16},"width":58.06,"height":40,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2003.04521/images/7-24.png","element":"img","alt":" I(·)","inline":true,"padRight":true},{"text":"is the indicator function.","element":"span"}],[{"text":"Alg. ","element":"span"},{"href":"#id-56","text":"5 ","element":"a"},{"text":"summaries the policy gradient learning procedure for ","element":"span"},{"style":{"height":11.2},"width":62.22,"height":28,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2003.04521/images/7-25.png","element":"img","alt":"θ. θ","inline":true,"padRight":true},{"text":"is updated in ","element":"span"},{"style":{"fontStyle":"italic"},"text":"E ","element":"span"},{"text":"epochs. At each epoch, first a sample of trajectories is obtained (lines ","element":"span"},{"href":"#id-57","text":"3-","element":"a"},{"href":"#id-58","text":"23)","element":"a"},{"text":". Given ","element":"span"},{"style":{"height":9.19},"width":38.78,"height":22.98,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2003.04521/images/7-26.png","element":"img","alt":" x0","inline":true},{"text":", a trajectory can be sampled as follows. First, a set of ","element":"span"},{"style":{"height":13.19},"width":48.02,"height":32.98,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2003.04521/images/7-27.png","element":"img","alt":" N0","inline":true,"padRight":true},{"text":"initial directions is randomly generated and their scores are computed by Alg. ","element":"span"},{"href":"#id-51","text":"3 ","element":"a"},{"text":"(lines ","element":"span"},{"href":"#id-57","text":"6-","element":"a"},{"href":"#id-57","text":"7)","element":"a"},{"text":". ","element":"span"},{"style":{"fontStyle":"italic"},"text":"P ","element":"span"},{"text":"new directions and their corresponding scores are then obtained (lines ","element":"span"},{"href":"#id-57","text":"9-","element":"a"},{"href":"#id-58","text":"22)","element":"a"},{"text":". At each step, the obtained direction ","element":"span"},{"style":{"height":15.01},"width":27.05,"height":37.52,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2003.04521/images/7-28.png","element":"img","alt":"˜d","inline":true},{"text":", the policy function ","element":"span"},{"style":{"height":16},"width":125.48,"height":40,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2003.04521/images/7-29.png","element":"img","alt":" φ(st; θ)","inline":true,"padRight":true},{"text":"and the reward ","element":"span"},{"style":{"height":10.79},"width":70.48,"height":26.98,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2003.04521/images/7-30.png","element":"img","alt":" rt+1","inline":true,"padRight":true},{"text":"are gathered in the current trajectory ","element":"span"},{"style":{"height":13.19},"width":51.29,"height":32.98,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2003.04521/images/7-31.png","element":"img","alt":" Tm","inline":true,"padRight":true},{"text":"(line ","element":"span"},{"href":"#id-59","text":"20)","element":"a"},{"text":". After the trajectory sampling, ","element":"span"},{"style":{"height":11.6},"width":52.21,"height":29,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2003.04521/images/7-32.png","element":"img","alt":" ∆θ","inline":true,"padRight":true},{"text":"and ","element":"span"},{"style":{"height":13.19},"width":28.7,"height":32.98,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2003.04521/images/7-33.png","element":"img","alt":" θl","inline":true,"padRight":true},{"text":"are updated in lines ","element":"span"},{"href":"#id-59","text":"24- ","element":"a"},{"href":"#id-60","text":"28 ","element":"a"},{"text":"and line ","element":"span"},{"href":"#id-61","text":"29, ","element":"a"},{"text":"respectively.","element":"span"}],[{"id":"id-56","style":{"fontWeight":"bold"},"text":"Algorithm 5 ","element":"span"},{"text":"Training policy network with policy gradient","element":"span"}],[{"style":{"fontWeight":"bold"},"text":"Require: ","element":"span"},{"text":"a local minimum ","element":"span"},{"style":{"height":9.19},"width":38.78,"height":22.98,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2003.04521/images/7-34.png","element":"img","alt":" x0","inline":true},{"text":", an integer ","element":"span"},{"style":{"fontStyle":"italic"},"text":"P > ","element":"span"},{"text":"0","element":"span"},{"text":", the number of training epochs ","element":"span"},{"style":{"fontStyle":"italic"},"text":"E","element":"span"},{"text":", the number of trajectories ","element":"span"},{"style":{"height":13.2},"width":145.9,"height":33,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2003.04521/images/7-35.png","element":"img","alt":" NT , σ >","inline":true,"padRight":true},{"text":"0 ","element":"span"},{"text":"and learning rate ","element":"span"},{"style":{"height":14.4},"width":97.78,"height":36,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2003.04521/images/7-36.png","element":"img","alt":" β > 0","inline":true},{"text":".","element":"span"}],[{"style":{"fontWeight":"bold"},"text":"Ensure: ","element":"span"},{"text":"the optimal network parameter ","element":"span"},{"style":{"height":10.99},"width":35.82,"height":27.47,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2003.04521/images/7-37.png","element":"img","alt":" θ∗","inline":true},{"text":".","element":"span"}],[{"id":"id-57","style":{"width":"97%"},"width":982,"height":517,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2003.04521/images/7-38.png","element":"img"}],[{"text":"10: ","element":"span"},{"style":{"fontStyle":"italic"},"text":"// sample new direction","element":"span"}],[{"style":{"width":"98%"},"width":994,"height":137,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2003.04521/images/7-39.png","element":"img"}],[{"text":"13: ","element":"span"},{"text":"apply Alg. ","element":"span"},{"href":"#id-51","text":"3 ","element":"a"},{"text":"to obtain ","element":"span"},{"style":{"height":10.8},"width":23,"height":27,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2003.04521/images/7-40.png","element":"img","alt":" ˜u","inline":true},{"text":";","element":"span"}],[{"text":"14: ","element":"span"},{"style":{"fontStyle":"italic"},"text":"// update the state","element":"span"}],[{"text":"15: ","element":"span"},{"text":"set ","element":"span"},{"style":{"height":10.79},"width":70.48,"height":26.98,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2003.04521/images/7-41.png","element":"img","alt":" rt+1","inline":true,"padRight":true},{"text":"by Eq. ","element":"span"},{"href":"#id-62","text":"23","element":"a"}],[{"id":"id-59","style":{"width":"99%"},"width":995,"height":408,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2003.04521/images/7-42.png","element":"img"}],[{"id":"id-58","text":"24: ","element":"span"},{"style":{"fontStyle":"italic"},"text":"// policy gradient","element":"span"}],[{"style":{"width":"22%"},"width":225,"height":30,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2003.04521/images/7-43.png","element":"img"}],[{"text":"26: ","element":"span"},{"style":{"fontWeight":"bold"},"text":"for ","element":"span"},{"style":{"height":13.19},"width":196.28,"height":32.98,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2003.04521/images/7-44.png","element":"img","alt":" m = 1 : NT","inline":true,"padRight":true},{"style":{"fontWeight":"bold"},"text":"do","element":"span"}],[{"style":{"width":"99%"},"width":998,"height":172,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2003.04521/images/7-45.png","element":"img"}],[{"id":"id-60","text":"28: ","element":"span"},{"id":"id-61","style":{"fontWeight":"bold"},"text":"end for","element":"span"}],[{"text":"29: ","element":"span"},{"style":{"fontStyle":"italic"},"text":"θ","element":"span"},{"style":{"height":15.59},"width":243,"height":38.98,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2003.04521/images/7-46.png","element":"img","alt":"l+1 = θl + β∆","inline":true},{"style":{"fontStyle":"italic"},"text":"θ","element":"span"},{"text":";","element":"span"}],[{"style":{"width":"100%"},"width":1005,"height":99,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2003.04521/images/7-47.png","element":"img"}],[{"style":{"width":"99%"},"width":999,"height":286,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2003.04521/images/8-0.png","element":"img"}],[{"id":"id-55","text":"Fig. 5. The framework of the escaping policy on the ","element":"figcaption","subtype":"caption"},{"style":{"fontStyle":"italic"},"text":"t","element":"figcaption","subtype":"caption"},{"text":"-th time step. The black solid line is the policy proposed in section ","element":"figcaption","subtype":"caption"},{"href":"#id-32","text":"IV-B. ","element":"a","subtype":"caption"},{"text":"The red dash line covers the feed-forward network and its output.","element":"figcaption","subtype":"caption"}],[{"style":{"fontStyle":"italic"},"text":"D. Learning to be Global Optimizer","element":"span"}],[{"text":"Combining the proposed local search algorithm and the escaping policy, we can form a global optimization algorithm. Alg. ","element":"span"},{"href":"#id-63","text":"6 ","element":"a"},{"text":"summarizes the algorithm, named as L","element":"span"},{"style":{"height":7.6},"width":16,"height":19,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2003.04521/images/8-1.png","element":"img","alt":"2","inline":true},{"text":"GO. Starting from an initial point ","element":"span"},{"style":{"fontStyle":"italic"},"text":"x","element":"span"},{"text":", Gd-Net is firstly applied to obtain a local minimizer (line ","element":"span"},{"href":"#id-63","text":"2)","element":"a"},{"text":". The escaping policy is applied to sample ","element":"span"},{"style":{"fontStyle":"italic"},"text":"P ","element":"span"},{"text":"new starting points (line ","element":"span"},{"href":"#id-63","text":"5)","element":"a"},{"text":". Gd-Net is then applied on these points (line ","element":"span"},{"href":"#id-63","text":"9)","element":"a"},{"text":". The algorithm terminates if the prefixed maximum number of escaping tries (i.e. ","element":"span"},{"style":{"fontStyle":"italic"},"text":"K","element":"span"},{"text":") has been reached (line ","element":"span"},{"href":"#id-63","text":"4)","element":"a"},{"text":", or no new promising directions can be sampled (line ","element":"span"},{"href":"#id-63","text":"6)","element":"a"},{"text":". If any of these conditions have been met, it is assumed that a global optimum has been found.","element":"span"}],[{"id":"id-63","style":{"width":"100%"},"width":1005,"height":1020,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2003.04521/images/8-2.png","element":"img"}],[{"style":{"fontStyle":"italic"},"text":"Note. ","element":"span"},{"text":"We should highlight that our method surpasses some filled function methods in the sense that our method has more chances to escape from local optimum. For example, consider the following filled function ","element":"span"},{"href":"#id-64","referenceIndex":37,"text":"[37]","element":"a"},{"text":":","element":"span"}],[{"style":{"width":"92%"},"width":932,"height":46,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2003.04521/images/8-3.png","element":"img"}],[{"text":"The existence of the stable point ","element":"span"},{"style":{"height":9.19},"width":54.04,"height":22.98,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2003.04521/images/8-4.png","element":"img","alt":" xfill","inline":true,"padRight":true},{"text":"to ","element":"span"},{"style":{"fontStyle":"italic"},"text":"H ","element":"span"},{"text":"usually holds. But ","element":"span"},{"style":{"height":9.19},"width":54.04,"height":22.98,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2003.04521/images/8-5.png","element":"img","alt":" xfill","inline":true,"padRight":true},{"text":"can be a saddle point or a local optimizer. If ","element":"span"},{"style":{"height":9.19},"width":54.04,"height":22.98,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2003.04521/images/8-6.png","element":"img","alt":" xfill","inline":true,"padRight":true},{"text":"is a saddle point, then to escape ","element":"span"},{"style":{"height":9.19},"width":38.78,"height":22.98,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2003.04521/images/8-7.png","element":"img","alt":" x0","inline":true},{"text":", it is only possible by searching along ","element":"span"},{"style":{"height":13.19},"width":257.26,"height":32.97,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2003.04521/images/8-8.png","element":"img","alt":" dfill = xfill − x0","inline":true},{"text":". However, it is highly unlikely ","element":"span"},{"style":{"height":13.19},"width":52.01,"height":32.97,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2003.04521/images/8-9.png","element":"img","alt":" dfill","inline":true,"padRight":true},{"text":"be contained in the pre-fixed direction set of the traditional filled function methods. This indicates that the corresponding filled function method will fail.","element":"span"}],[{"text":"On the other hand, if ","element":"span"},{"style":{"height":9.19},"width":54.04,"height":22.98,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2003.04521/images/8-10.png","element":"img","alt":" xfill","inline":true,"padRight":true},{"text":"is a local minimizer of ","element":"span"},{"style":{"fontStyle":"italic"},"text":"H","element":"span"},{"text":"(","element":"span"},{"style":{"fontStyle":"italic"},"text":"x","element":"span"},{"text":")","element":"span"},{"text":", we can prove that the proposed policy can always find a promising solution. Theorem ","element":"span"},{"href":"#id-65","text":"4 ","element":"a"},{"text":"summarizes the result.","element":"span"}],[{"id":"id-65","style":{"fontWeight":"bold"},"text":"Theorem 4. ","element":"span"},{"style":{"fontStyle":"italic"},"text":"Suppose there exists an attraction basin of ","element":"span"},{"style":{"height":11.59},"width":52.48,"height":28.98,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2003.04521/images/8-11.png","element":"img","alt":" xfill","inline":true,"padRight":true},{"style":{"fontStyle":"italic"},"text":"on the domain of ","element":"span"},{"style":{"fontStyle":"italic"},"text":"H","element":"span"},{"text":"(","element":"span"},{"style":{"fontStyle":"italic"},"text":"x","element":"span"},{"text":")","element":"span"},{"style":{"fontStyle":"italic"},"text":", denoted as ","element":"span"},{"style":{"height":15.59},"width":59.93,"height":38.98,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2003.04521/images/8-12.png","element":"img","alt":" Bfill","inline":true},{"style":{"fontStyle":"italic"},"text":", then ","element":"span"},{"style":{"height":15.59},"width":167.57,"height":38.98,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2003.04521/images/8-13.png","element":"img","alt":" ∀ x ∈ Bfill","inline":true},{"style":{"fontStyle":"italic"},"text":", we have ","element":"span"},{"style":{"height":13.19},"width":114.53,"height":32.98,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2003.04521/images/8-14.png","element":"img","alt":" ud > 0","inline":true,"padRight":true},{"style":{"fontStyle":"italic"},"text":"for ","element":"span"},{"style":{"height":13.19},"width":184.14,"height":32.98,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2003.04521/images/8-15.png","element":"img","alt":" d = x − x0","inline":true},{"style":{"fontStyle":"italic"},"text":".","element":"span"}],[{"style":{"fontStyle":"italic"},"text":"Proof. ","element":"span"},{"text":"We first prove that ","element":"span"},{"style":{"height":14},"width":497.37,"height":35,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2003.04521/images/8-16.png","element":"img","alt":" ∀x ∈ Bfill, d = x − x0, ∃t ∈ R","inline":true},{"text":", s.t. ","element":"span"},{"style":{"height":16},"width":363.05,"height":40,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2003.04521/images/8-17.png","element":"img","alt":"f(x0 + t · d) < f(x0)","inline":true},{"text":". This can be done by contradiction. If there is no such ","element":"span"},{"style":{"fontStyle":"italic"},"text":"t","element":"span"},{"text":", then","element":"span"}],[{"id":"id-66","style":{"width":"72%"},"width":727,"height":40,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2003.04521/images/8-18.png","element":"img"}],[{"text":"This is because that ","element":"span"},{"style":{"height":14.79},"width":163.48,"height":36.98,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2003.04521/images/8-19.png","element":"img","alt":" ∀t ∈ R++","inline":true},{"text":", we have ","element":"span"},{"style":{"height":16},"width":337.33,"height":40,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2003.04521/images/8-20.png","element":"img","alt":" f(x0+t·d) ≥ f(x0)","inline":true},{"text":", ","element":"span"},{"style":{"fontStyle":"italic"},"text":"H","element":"span"},{"text":"(","element":"span"},{"style":{"fontStyle":"italic"},"text":"x","element":"span"},{"text":") ","element":"span"},{"text":"degenerates to ","element":"span"},{"style":{"height":17.38},"width":178.16,"height":43.46,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2003.04521/images/8-21.png","element":"img","alt":" −a∥t · d∥2","inline":true},{"text":". Eq. ","element":"span"},{"href":"#id-66","text":"25 ","element":"a"},{"text":"implies that apply gradient descent from ","element":"span"},{"style":{"fontStyle":"italic"},"text":"x ","element":"span"},{"text":"on ","element":"span"},{"style":{"fontStyle":"italic"},"text":"H","element":"span"},{"text":"(","element":"span"},{"style":{"fontStyle":"italic"},"text":"x","element":"span"},{"text":") ","element":"span"},{"text":"will not lead to a point in ","element":"span"},{"style":{"height":13.19},"width":61.49,"height":32.98,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2003.04521/images/8-22.png","element":"img","alt":"Bfill","inline":true},{"text":". This contradicts our assumption that ","element":"span"},{"style":{"height":13.19},"width":132.97,"height":32.98,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2003.04521/images/8-23.png","element":"img","alt":" x ∈ Bfill","inline":true},{"text":".","element":"span"}],[{"text":"The existence of ","element":"span"},{"style":{"fontStyle":"italic"},"text":"t ","element":"span"},{"text":"implies ","element":"span"},{"style":{"height":13.19},"width":114.53,"height":32.98,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2003.04521/images/8-24.png","element":"img","alt":" ud > 0","inline":true,"padRight":true},{"text":"by Theorem ","element":"span"},{"href":"#id-67","text":"2.","element":"a"}]]},{"heading":"V. EXPERIMENT RESULTS","paragraphs":[[{"id":"id-34","text":"In this section, we study the numerical performance of Gd- ","element":"span"},{"text":"net, the escaping policies, and L","element":"span"},{"style":{"height":7.6},"width":16,"height":19,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2003.04521/images/8-25.png","element":"img","alt":"2","inline":true},{"text":"GO.","element":"span"}],[{"style":{"fontStyle":"italic"},"text":"A. Model-driven Local Search","element":"span"}],[{"text":"This section investigates the performance of Gd-Net. In the experiments, 50 d-Net blocks are used. Parameters of these blocks are the same.","element":"span"}],[{"style":{"fontWeight":"bold"},"text":"Training. ","element":"span"},{"text":"d-Net is trained through minimizing the Monte Carlo approximation to the loss functions as defined in Eq. ","element":"span"},{"href":"#id-68","text":"17, ","element":"a"},{"text":"in which a sample of the Gaussian family ","element":"span"},{"style":{"height":14.79},"width":47.64,"height":36.98,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2003.04521/images/8-26.png","element":"img","alt":" FG","inline":true,"padRight":true},{"text":"is used. In the experiments, we use ten 2-d Gaussian functions with positive covariance matrix as the training functions. d-Net is trained on 25 initial points sampled uniformly at random for each training function. At each layer of d-Net, the step size ","element":"span"},{"style":{"height":9.19},"width":42.49,"height":22.98,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2003.04521/images/8-27.png","element":"img","alt":" αk","inline":true,"padRight":true},{"text":"is obtained by exact line search in [0,1] ","element":"span"},{"href":"#id-69","text":"1","element":"a"},{"text":". Gradient descent is used to optimize Eq. ","element":"span"},{"href":"#id-68","text":"17 ","element":"a"},{"text":"with a learning rate 0.1 for 100 epochs. The same training configuration is used in the following. ","element":"span"},{"style":{"fontWeight":"bold"},"text":"Testing. ","element":"span"},{"text":"We use functions sampled from ","element":"span"},{"style":{"height":14.79},"width":47.64,"height":36.98,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2003.04521/images/8-28.png","element":"img","alt":" FG","inline":true,"padRight":true},{"text":"in 5-d, and ","element":"span"},{"style":{"height":16.59},"width":40.93,"height":41.46,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2003.04521/images/8-29.png","element":"img","alt":" χ2","inline":true},{"text":"-functions","element":"span"},{"href":"#id-70","text":"2 ","element":"a"},{"text":"in 2-d to test Gd-Net. Note that Gd-Net is trained on 2-d ","element":"span"},{"style":{"height":14.79},"width":47.64,"height":36.98,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2003.04521/images/8-30.png","element":"img","alt":" FG","inline":true},{"text":". By testing on 5-d Gaussian functions, we can see its generalization ability on higher-dimensional functions. The testing on ","element":"span"},{"style":{"height":16.58},"width":40.94,"height":41.46,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2003.04521/images/8-31.png","element":"img","alt":" χ2","inline":true,"padRight":true},{"text":"functions can check the generalization ability of Gd-Net on functions with non-symmetric contour different to Gaussians. Fig. ","element":"span"},{"href":"#id-71","text":"6 ","element":"a"},{"text":"shows the difference between Gaussian and ","element":"span"},{"style":{"height":16.59},"width":40.93,"height":41.46,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2003.04521/images/8-32.png","element":"img","alt":"χ2","inline":true,"padRight":true},{"text":"contour.","element":"span"}],[{"text":"1","element":"span"},{"id":"id-69","text":"Note that taking ","element":"span"},{"style":{"height":12.8},"width":160.73,"height":32,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2003.04521/images/8-33.png","element":"img","alt":" αk ∈ (0, 1]","inline":true,"padRight":true},{"text":"is not necessarily the best choice for linesearch. It is rather considered as a rule of thumb. Notice that limiting the search of ","element":"span"},{"style":{"height":7.85},"width":37.76,"height":19.62,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2003.04521/images/8-34.png","element":"img","alt":" αk","inline":true,"padRight":true},{"text":"in (0,1] could make Gd-Net be scale-variant. We transform ","element":"span"},{"style":{"height":12.8},"width":277.65,"height":32,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2003.04521/images/8-35.png","element":"img","alt":"f(x) = f(x)/f(x0)","inline":true,"padRight":true},{"text":"in order to eliminate the scaling problem where ","element":"span"},{"style":{"height":10.17},"width":68.59,"height":25.43,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2003.04521/images/8-36.png","element":"img","alt":" x0 is","inline":true,"padRight":true},{"text":"the initial point when testing.","element":"span"}],[{"id":"id-70","style":{"width":"67%"},"width":674,"height":146,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2003.04521/images/8-37.png","element":"img"}],[{"style":{"width":"98%"},"width":993,"height":363,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2003.04521/images/9-0.png","element":"img"}],[{"text":"Fig. 6. ","element":"figcaption","subtype":"caption"},{"id":"id-71","text":"A demo on the difference between a Gaussian contour and a ","element":"figcaption","subtype":"caption"},{"style":{"height":13.3},"width":36.03,"height":33.25,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2003.04521/images/9-1.png","element":"img","alt":" χ2","inline":true,"padRight":true},{"text":"contour.","element":"figcaption","subtype":"caption"}],[{"style":{"width":"93%"},"width":934,"height":715,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2003.04521/images/9-2.png","element":"img"}],[{"text":"Fig. 7. ","element":"figcaption","subtype":"caption"},{"id":"id-72","text":"The optimization curve of the learned Gd-Net on a 5-d Gaussian ","element":"figcaption","subtype":"caption"},{"text":"function with various initial points.","element":"figcaption","subtype":"caption"}],[{"text":"Fig. ","element":"span"},{"href":"#id-72","text":"7 ","element":"a"},{"text":"shows the testing result of the learned Gd-Net on optimizing a 5-d Gaussian function with different initial points. The test on a ","element":"span"},{"style":{"height":16.58},"width":40.93,"height":41.46,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2003.04521/images/9-3.png","element":"img","alt":" χ2","inline":true,"padRight":true},{"text":"function is shown in Fig. ","element":"span"},{"href":"#id-73","text":"8. ","element":"a"},{"text":"In these figures, first-order and second-order optimization algorithms, including steepest gradient descent, conjugate descent and BFGS, are used for comparison. From the figures, it is clear that Gd-Net requires much less iterations to reach the minimum than the compared algorithms.","element":"span"}],[{"text":"Further, we observed that unlike BFGS, where a positive-definite Hessian matrix is a must, Gd-Net can cope with ill-conditioned Hessians. Fig. ","element":"span"},{"href":"#id-74","text":"9 ","element":"a"},{"text":"shows the results on a 2-d Gaussian function with ill-posed Hessian. For an initial point that is far away from a minimizer, its Hessian is nearly singular which implies that the search area is rather flat. From the left plot of Fig. ","element":"span"},{"href":"#id-74","text":"9, ","element":"a"},{"text":"it is seen that Gd-Net gradually decreases, while the other methods fail to make any progress. On the right plot, it is seen that Gd-Net finally progresses out the flat area and the criterion starts decreasing quickly.","element":"span"}],[{"style":{"fontStyle":"italic"},"text":"B. The fixed escaping policy","element":"span"}],[{"text":"In this section, controlled experiments are carried out to justify the ability of the fixed policy. We first consider a low-dimensional non-convex optimization problem with two local minimizers, then a high-dimensional highly non-convex problems with many local minimizers. The fixed policy is compared with random sampling on these test problems.","element":"span"}],[{"style":{"width":"98%"},"width":991,"height":736,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2003.04521/images/9-4.png","element":"img"}],[{"id":"id-73","text":"Fig. 8. The optimization curve of the learned Gd-Net on a 2-d ","element":"figcaption","subtype":"caption"},{"style":{"height":13.3},"width":155.48,"height":33.25,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2003.04521/images/9-5.png","element":"img","alt":" χ2 function","inline":true,"padRight":true},{"text":"with two different initial points.","element":"figcaption","subtype":"caption"}],[{"style":{"width":"97%"},"width":975,"height":347,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2003.04521/images/9-6.png","element":"img"}],[{"text":"Fig. 9. ","element":"figcaption","subtype":"caption"},{"id":"id-74","text":"The optimization procedure of Gd-Net (with 5 blocks) on a 5-d ","element":"figcaption","subtype":"caption"},{"text":"Gaussian function with an initial point far away from the optimum. The left plot shows the decreasing curve obtained by the first 4 blocks, while the right shows the curve of the rest block.","element":"figcaption","subtype":"caption"}],[{"style":{"fontStyle":"italic"},"text":"1) Mixture of Gaussians functions: ","element":"span"},{"text":"Consider the following mixture of Gaussians functions","element":"span"}],[{"id":"id-96","style":{"width":"92%"},"width":928,"height":111,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2003.04521/images/9-7.png","element":"img"}],[{"text":"where ","element":"span"},{"style":{"height":14},"width":250.74,"height":35,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2003.04521/images/9-8.png","element":"img","alt":" x ∈ Rn, ci > 0","inline":true,"padRight":true},{"text":"and ","element":"span"},{"style":{"height":13.2},"width":118.9,"height":33,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2003.04521/images/9-9.png","element":"img","alt":" Σi ⪰ 0","inline":true},{"text":". The mixture of Gaussian functions have ","element":"span"},{"style":{"fontStyle":"italic"},"text":"m ","element":"span"},{"text":"local minimizers at ","element":"span"},{"style":{"height":10},"width":35.01,"height":25,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2003.04521/images/9-10.png","element":"img","alt":" µi","inline":true},{"text":"’s.","element":"span"}],[{"text":"In the experiments, we set ","element":"span"},{"style":{"fontStyle":"italic"},"text":"m ","element":"span"},{"text":"= 2","element":"span"},{"style":{"fontStyle":"italic"},"text":", ","element":"span"},{"text":"3","element":"span"},{"text":". To test the ability of the escaping scheme, we assume the escaping starts from a local minimizer. We test on dimension ","element":"span"},{"style":{"fontStyle":"italic"},"text":"n ","element":"span"},{"text":"= 2","element":"span"},{"style":{"fontStyle":"italic"},"text":", ","element":"span"},{"text":"3","element":"span"},{"style":{"fontStyle":"italic"},"text":", ","element":"span"},{"text":"5","element":"span"},{"style":{"fontStyle":"italic"},"text":", ","element":"span"},{"text":"8","element":"span"},{"style":{"fontStyle":"italic"},"text":", ","element":"span"},{"text":"10","element":"span"},{"text":". The other algorithmic parameters are ","element":"span"},{"style":{"height":14.4},"width":389.34,"height":36,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2003.04521/images/9-11.png","element":"img","alt":" N0 = 2, 3, 5, 8, 10, P =","inline":true,"padRight":true},{"text":"15","element":"span"},{"style":{"fontStyle":"italic"},"text":", ","element":"span"},{"text":"20","element":"span"},{"style":{"fontStyle":"italic"},"text":", ","element":"span"},{"text":"50","element":"span"},{"style":{"fontStyle":"italic"},"text":", ","element":"span"},{"text":"100","element":"span"},{"style":{"fontStyle":"italic"},"text":", ","element":"span"},{"text":"250 ","element":"span"},{"text":"for ","element":"span"},{"style":{"fontStyle":"italic"},"text":"n ","element":"span"},{"text":"= 2","element":"span"},{"style":{"fontStyle":"italic"},"text":", ","element":"span"},{"text":"3","element":"span"},{"style":{"fontStyle":"italic"},"text":", ","element":"span"},{"text":"5","element":"span"},{"style":{"fontStyle":"italic"},"text":", ","element":"span"},{"text":"8","element":"span"},{"style":{"fontStyle":"italic"},"text":", ","element":"span"},{"text":"10","element":"span"},{"style":{"fontStyle":"italic"},"text":", ","element":"span"},{"text":"respectively, and ","element":"span"},{"style":{"height":6.8},"width":66.27,"height":17,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2003.04521/images/9-12.png","element":"img","alt":" σ =","inline":true},{"style":{"height":14.8},"width":355.42,"height":37,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2003.04521/images/9-13.png","element":"img","alt":"0.1, δ0 = 0.2, N = 20","inline":true},{"text":".","element":"span"}],[{"text":"Table ","element":"span"},{"href":"#id-75","text":"I ","element":"a"},{"text":"shows the average number of samplings used to escape from local optimum and the standard deviation (in brackets) over 500 runs obtained by using the fixed policy and the random sampling method.","element":"span"}],[{"text":"From the table, it is observed that the fixed policy requires less samples than that of the random sampling, and the standard deviation is smaller. The ","element":"span"},{"style":{"fontStyle":"italic"},"text":"p","element":"span"},{"text":"-value obtained by applying the rank sum hypothesis test at ","element":"span"},{"text":"5% ","element":"span"},{"text":"significance level is shown in the last column. The hypothesis test suggests that the fixed policy outperforms the random sampling approach significantly (the p-value is less than ","element":"span"},{"text":"0","element":"span"},{"style":{"fontStyle":"italic"},"text":".","element":"span"},{"text":"05","element":"span"},{"text":").","element":"span"}],[{"text":"TABLE I T","element":"figcaption","subtype":"caption"},{"text":"HE NUMBER OF SAMPLINGS USED TO ESCAPE","element":"figcaption","subtype":"caption"},{"text":".","element":"figcaption","subtype":"caption"}],[{"id":"id-75","style":{"width":"75%"},"width":756,"height":298,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2003.04521/images/10-0.png","element":"img"}],[{"style":{"fontStyle":"italic"},"text":"2) Deep neural network: ","element":"span"},{"text":"The loss function of a deep neural network has many local optimizers. We take the training of a deep neural network for image classification on CIFAR-10 as an example. For CIFAR-10, an 8-layer convolution neural network similar to Le-Net ","element":"span"},{"href":"#id-76","referenceIndex":38,"text":"[38]","element":"a"},{"text":", with 2520-d parameters, is applied. The cross entropy is used as the loss function.","element":"span"}],[{"text":"The number of local minimizers found by a method is used as the metric of comparison. Given a maximal number of attempts ","element":"span"},{"style":{"fontStyle":"italic"},"text":"P","element":"span"},{"text":", a larger number of local minimizers indicates a higher probability of escaping local minimum, and hence a better performance. For CIFAR-10, ADAM ","element":"span"},{"href":"#id-16","referenceIndex":17,"text":"[17] ","element":"a"},{"text":"with mini-batch stochastic gradient is applied in the minimization phase.","element":"span"}],[{"text":"Note that existing filled functions often involve ","element":"span"},{"style":{"fontStyle":"italic"},"text":"f","element":"span"},{"text":"(","element":"span"},{"style":{"fontStyle":"italic"},"text":"x","element":"span"},{"text":")","element":"span"},{"text":". This usually makes the application of mini-batch stochastic gradient method difficult if ","element":"span"},{"style":{"fontStyle":"italic"},"text":"f","element":"span"},{"text":"(","element":"span"},{"style":{"fontStyle":"italic"},"text":"x","element":"span"},{"text":") ","element":"span"},{"text":"is not sum of sub-functions. Instead, the auxiliary function ","element":"span"},{"style":{"height":16},"width":90.64,"height":40,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2003.04521/images/10-1.png","element":"img","alt":"�H(x)","inline":true,"padRight":true},{"text":"used in this paper does not involve ","element":"span"},{"style":{"fontStyle":"italic"},"text":"f","element":"span"},{"text":"(","element":"span"},{"style":{"fontStyle":"italic"},"text":"x","element":"span"},{"text":")","element":"span"},{"text":". Fig. ","element":"span"},{"href":"#id-77","text":"10 ","element":"a"},{"text":"shows the scores (cf. Eq. ","element":"span"},{"href":"#id-78","text":"21) ","element":"a"},{"text":"against the distance to current local optimum with different mini-batch sizes. From the figure, it is seen that with different batch-size, the scores exhibit similar behavior. This shows the applicability of the proposed escaping method to stochastic-based local search algorithms. In the experiment, the parameters to apply Alg. ","element":"span"},{"href":"#id-48","text":"2 ","element":"a"},{"text":"is set as ","element":"span"},{"style":{"height":14},"width":784.98,"height":35,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2003.04521/images/10-2.png","element":"img","alt":" N0 = 300, P = 1000, N = 10, σ = 0.01, a = 1","inline":true,"padRight":true},{"text":"and ","element":"span"},{"style":{"height":13.99},"width":139.72,"height":34.98,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2003.04521/images/10-3.png","element":"img","alt":"δ0 = 0.5","inline":true},{"text":".","element":"span"}],[{"text":"In the following, the effective samplings","element":"span"},{"text":"3 ","element":"span"},{"text":"in 1000 samplings are used to compare the proposed escaping policy against the random sampling. The obtained promising directions with different thresholds in 500 runs are summarized in Table ","element":"span"},{"href":"#id-79","text":"II. ","element":"a"},{"text":"It is clear that the proposed escaping policy is able to find more samples with positive scores than that of the random sampling.","element":"span"}],[{"text":"TABLE II T","element":"figcaption","subtype":"caption"},{"text":"HE NUMBER OF EFFECTIVE SAMPLINGS IN ","element":"figcaption","subtype":"caption"},{"text":"1000 ","element":"figcaption","subtype":"caption"},{"text":"SAMPLINGS","element":"figcaption","subtype":"caption"},{"text":".","element":"figcaption","subtype":"caption"}],[{"id":"id-79","style":{"width":"101%"},"width":1022,"height":154,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2003.04521/images/10-4.png","element":"img"}],[{"style":{"fontStyle":"italic"},"text":"C. The fixed policy vs. the learned policy","element":"span"}],[{"text":"In this section we show the effectiveness of the learned policy. The policy function ","element":"span"},{"style":{"height":16},"width":125.47,"height":40,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2003.04521/images/10-5.png","element":"img","alt":" φ(st; θ)","inline":true,"padRight":true},{"text":"in Eq. ","element":"span"},{"href":"#id-80","text":"8 ","element":"a"},{"text":"is set as a 3 layer network with sigmoid as hidden layer activation function, and a fully-connected output layer with linear activation function.","element":"span"}],[{"style":{"width":"74%"},"width":746,"height":567,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2003.04521/images/10-6.png","element":"img"}],[{"id":"id-77","text":"Fig. 10. The curves of the score against the step size w.r.t. two mini-batches ","element":"figcaption","subtype":"caption"},{"text":"when training a convolution neural network for CIFAR-10.","element":"figcaption","subtype":"caption"}],[{"style":{"fontWeight":"bold"},"text":"Training. ","element":"span"},{"text":"A single Gaussian mixture function with two local minimizers is used for training the policy network in 2-d and 5-d, respectively. The other parameters are set ","element":"span"},{"style":{"height":16},"width":669.78,"height":40,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2003.04521/images/10-7.png","element":"img","alt":"N0 = 2(5); P = 15(50); NT = 20(50)","inline":true,"padRight":true},{"text":"and ","element":"span"},{"style":{"fontStyle":"italic"},"text":"E ","element":"span"},{"text":"= 30 ","element":"span"},{"text":"for 2 (5)-d. The number of hidden layer units is 5 and 200 for 2-d and 5-d, respectively.","element":"span"}],[{"style":{"fontWeight":"bold"},"text":"Testing. ","element":"span"},{"text":"To test the learned policy, we also use the Gaussian mixture functions with two local minimizers in 2-d and 5-d. Here we set ","element":"span"},{"style":{"height":14},"width":322.3,"height":35,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2003.04521/images/10-8.png","element":"img","alt":" m = 2, c1 = c2 = 1","inline":true},{"text":". Table ","element":"span"},{"href":"#id-81","text":"III ","element":"a"},{"text":"shows the average number of samplings used to escape from local optimum and the standard deviation (in brackets) over 500 runs obtained by using the fixed policy, the learned policy and the random sampling. Detailed configurations of the functions used for training can be found in Appendix D.","element":"span"}],[{"text":"TABLE III ","element":"figcaption","subtype":"caption"},{"id":"id-81","text":"T","element":"figcaption","subtype":"caption"},{"text":"HE AVERAGE NUMBER OF SAMPLINGS USED TO FIND THE PROMISING DIRECTION IN DIFFERENT SETTINGS","element":"figcaption","subtype":"caption"},{"text":".","element":"figcaption","subtype":"caption"}],[{"style":{"width":"80%"},"width":810,"height":405,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2003.04521/images/10-9.png","element":"img"}],[{"text":"From the table, it is seen clearly that the learned policy requires less samplings to reach new local optimum. To observe the behaviors of the compared escaping policies better, Fig. ","element":"span"},{"href":"#id-82","text":"11 ","element":"a"},{"text":"shows the histograms of the number of effective samplings for a 2-D function. From the figure, we see that the fixed policy is mostly likely to escape the current local optimum in one sampling, but it also is highly possible to require more samplings. That is, the number of effective samplings by the fixed policy follows a heavy-tail distribution. For the learned policy, the effective sampling numbers are mostly concentrated in the first 7 samplings. This shows that the learned policy is more robust than the other policies, which can also be confirmed in Table ","element":"span"},{"href":"#id-81","text":"III ","element":"a"},{"text":"by the standard deviations. We may","element":"span"}],[{"style":{"width":"80%"},"width":804,"height":639,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2003.04521/images/11-0.png","element":"img"}],[{"text":"Fig. 11. ","element":"figcaption","subtype":"caption"},{"id":"id-82","text":"The histogram of the number of effective samplings for a 2-D ","element":"figcaption","subtype":"caption"},{"text":"Gaussian mixture function by the fixed policy, the learned policy and the random sampling policy.","element":"figcaption","subtype":"caption"}],[{"style":{"width":"87%"},"width":875,"height":486,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2003.04521/images/11-1.png","element":"img"}],[{"id":"id-83","text":"Fig. 12. The running procedure of L","element":"figcaption","subtype":"caption"},{"style":{"height":6.4},"width":15,"height":16,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2003.04521/images/11-2.png","element":"img","alt":"2","inline":true},{"text":"GO, in which GD, BFGS and Gd-Net are used as local search, while the learned policy is used to escape from local minimum (represented in pink dotted line).","element":"figcaption","subtype":"caption"}],[{"text":"thus conclude that the learned policy is more efficient than the fixed policy and random sampling.","element":"span"}],[{"style":{"fontStyle":"italic"},"text":"D. The global search ability","element":"span"}],[{"text":"In this section, we study the global search ability of L","element":"span"},{"style":{"height":7.6},"width":16,"height":19,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2003.04521/images/11-3.png","element":"img","alt":"2","inline":true},{"text":"GO in comparison with the filled function method proposed in ","element":"span"},{"href":"#id-7","referenceIndex":8,"text":"[8]","element":"a"},{"text":". Three examples, including the three hump function, robust regression and neural network classifier, are used as benchmarks. ","element":"span"},{"style":{"fontStyle":"italic"},"text":"1) Three-hump function: ","element":"span"},{"text":"The function is defined as","element":"span"}],[{"style":{"width":"70%"},"width":707,"height":46,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2003.04521/images/11-4.png","element":"img"}],[{"text":"It has three local minimizers at ","element":"span"},{"style":{"height":16},"width":386.33,"height":40,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2003.04521/images/11-5.png","element":"img","alt":" [−1.73, −0.87]⊺, [0, 0]⊺","inline":true},{"text":", and ","element":"span"},{"style":{"height":16},"width":199.54,"height":40,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2003.04521/images/11-6.png","element":"img","alt":"[1.73, 0.87]⊺","inline":true},{"text":". The global minimizer is at ","element":"span"},{"style":{"height":16},"width":97.7,"height":40,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2003.04521/images/11-7.png","element":"img","alt":" [0, 0]⊺","inline":true},{"text":". In our test, the algorithm parameters are set as ","element":"span"},{"style":{"height":14.8},"width":489.74,"height":37,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2003.04521/images/11-8.png","element":"img","alt":" N0 = 2, P = 15, σ = 0.1, δ =","inline":true,"padRight":true},{"text":"0","element":"span"},{"style":{"fontStyle":"italic"},"text":".","element":"span"},{"text":"2 ","element":"span"},{"text":"and ","element":"span"},{"style":{"fontStyle":"italic"},"text":"N ","element":"span"},{"text":"= 20","element":"span"},{"text":". We run L","element":"span"},{"style":{"height":7.6},"width":16,"height":19,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2003.04521/images/11-9.png","element":"img","alt":"2","inline":true},{"text":"GO 20 times with different initial points. Fig. ","element":"span"},{"href":"#id-83","text":"12 ","element":"a"},{"text":"shows the averaged optimization process of L","element":"span"},{"style":{"height":7.6},"width":16,"height":19,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2003.04521/images/11-10.png","element":"img","alt":"2","inline":true},{"text":"GO. From the figure, it is seen that L","element":"span"},{"style":{"height":7.6},"width":16,"height":19,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2003.04521/images/11-11.png","element":"img","alt":"2","inline":true},{"text":"GO is able to reach the local minimizer one by one. It is also seen that Gd-Net performs better than BFGS and steepest descent.","element":"span"}],[{"text":"Table ","element":"span"},{"href":"#id-84","text":"IV ","element":"a"},{"text":"shows the number of effective samplings when using the fixed policy, random sampling and learned policy as escaping scheme. It is seen that the filled function method has failed due to the existence of the saddle point as shown in Fig. ","element":"span"},{"href":"#id-47","text":"3.","element":"a"}],[{"text":"TABLE IV ","element":"figcaption","subtype":"caption"},{"id":"id-84","text":"T","element":"figcaption","subtype":"caption"},{"text":"HE NUMBER OF SAMPLINGS USED TO FIND THE PROMISING DIRECTIONS FOR THE THREE","element":"figcaption","subtype":"caption"},{"text":"-","element":"figcaption","subtype":"caption"},{"text":"HUMP FUNCTION","element":"figcaption","subtype":"caption"},{"text":".","element":"figcaption","subtype":"caption"}],[{"style":{"width":"93%"},"width":938,"height":727,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2003.04521/images/11-12.png","element":"img"}],[{"text":"Fig. 13. ","element":"figcaption","subtype":"caption"},{"id":"id-85","text":"The contour of the robust regression function with ","element":"figcaption","subtype":"caption"},{"style":{"height":7.77},"width":95.97,"height":19.43,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2003.04521/images/11-13.png","element":"img","alt":" w1 =","inline":true},{"style":{"height":12.8},"width":542.22,"height":32,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2003.04521/images/11-14.png","element":"img","alt":"(−8, −8), w2 = (5, 5) and b1 = b2 = 0","inline":true},{"text":", respectively.","element":"figcaption","subtype":"caption"}],[{"style":{"fontStyle":"italic"},"text":"2) Robust regression: ","element":"span"},{"text":"For the robust linear regression problem ","element":"span"},{"href":"#id-19","referenceIndex":20,"text":"[20]","element":"a"},{"text":", a popular choice of the loss function is the GemanMcClure estimator, which can be written as follows:","element":"span"}],[{"style":{"width":"92%"},"width":928,"height":110,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2003.04521/images/11-15.png","element":"img"}],[{"text":"where ","element":"span"},{"style":{"height":16.58},"width":294.32,"height":41.46,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2003.04521/images/11-16.png","element":"img","alt":" w ∈ Rd, b ∈ R","inline":true,"padRight":true},{"text":"represent the weights and biases, respectively. ","element":"span"},{"style":{"height":15.77},"width":130.53,"height":39.44,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2003.04521/images/11-17.png","element":"img","alt":" xi ∈ Rd","inline":true,"padRight":true},{"text":"and ","element":"span"},{"style":{"height":14},"width":110.51,"height":35,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2003.04521/images/11-18.png","element":"img","alt":" yi ∈ R","inline":true,"padRight":true},{"text":"is the feature vector and label of the ","element":"span"},{"style":{"fontStyle":"italic"},"text":"i","element":"span"},{"text":"-th instance and ","element":"span"},{"style":{"height":11.6},"width":94.95,"height":29,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2003.04521/images/11-19.png","element":"img","alt":" c ∈ R","inline":true,"padRight":true},{"text":"is a constant that modulates the shape of the loss function.","element":"span"}],[{"text":"The landscape of the robust regression problem can be systematically controlled. Specifically, we can decide the number of local minimizers, their locations and criteria readily. Note that given ","element":"span"},{"style":{"fontStyle":"italic"},"text":"{","element":"span"},{"style":{"fontStyle":"italic"},"text":"w, b","element":"span"},{"style":{"fontStyle":"italic"},"text":"}","element":"span"},{"text":", the training data can be created by","element":"span"}],[{"style":{"width":"65%"},"width":658,"height":38,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2003.04521/images/11-20.png","element":"img"}],[{"text":"Different ","element":"span"},{"style":{"fontStyle":"italic"},"text":"{","element":"span"},{"style":{"fontStyle":"italic"},"text":"w, b","element":"span"},{"style":{"fontStyle":"italic"},"text":"} ","element":"span"},{"text":"indicates different local minimum.","element":"span"}],[{"text":"In our experiments, we randomly sample 50 points of ","element":"span"},{"style":{"height":17.19},"width":242.76,"height":42.98,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2003.04521/images/11-21.png","element":"img","alt":"xj ∼ N(0, I)","inline":true,"padRight":true},{"text":"in ","element":"span"},{"style":{"height":13.38},"width":44.78,"height":33.46,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2003.04521/images/11-22.png","element":"img","alt":" R2","inline":true},{"text":", and divide them to two sets ","element":"span"},{"style":{"height":13.19},"width":97.51,"height":32.98,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2003.04521/images/11-23.png","element":"img","alt":" S1 =","inline":true},{"style":{"height":16.79},"width":304.71,"height":41.98,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2003.04521/images/11-24.png","element":"img","alt":"{xj, 1 ≤ i ≤ 10}","inline":true,"padRight":true},{"text":"and ","element":"span"},{"style":{"height":16.79},"width":433.64,"height":41.98,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2003.04521/images/11-25.png","element":"img","alt":" S2 = {xj, 11 ≤ i ≤ 50}","inline":true},{"text":". For each set ","element":"span"},{"style":{"height":14},"width":215.03,"height":35,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2003.04521/images/11-26.png","element":"img","alt":" Si, 1 ≤ i ≤ 2","inline":true},{"text":", give a ","element":"span"},{"style":{"height":16},"width":129.8,"height":40,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2003.04521/images/11-27.png","element":"img","alt":" {wi, bi}","inline":true},{"text":", apply ","element":"span"},{"style":{"height":17.2},"width":313.17,"height":43,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2003.04521/images/11-28.png","element":"img","alt":" yj = w⊺i xj +bi +ϵ","inline":true},{"text":", a ","element":"span"},{"text":"training set ","element":"span"},{"style":{"height":13.99},"width":34,"height":34.98,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2003.04521/images/11-29.png","element":"img","alt":" Ti","inline":true,"padRight":true},{"text":"can be obtained. Combining them, we obtain the whole data set ","element":"span"},{"style":{"height":13.99},"width":148.9,"height":34.98,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2003.04521/images/11-30.png","element":"img","alt":" T = ∪Ti","inline":true},{"text":". Given this training set, it is known that the objective function has two obvious local minimizers at ","element":"span"},{"style":{"height":16},"width":130.6,"height":40,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2003.04521/images/11-31.png","element":"img","alt":" (w1, b1)","inline":true,"padRight":true},{"text":"and ","element":"span"},{"style":{"height":16},"width":130.6,"height":40,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2003.04521/images/11-32.png","element":"img","alt":" (w2, b2)","inline":true,"padRight":true},{"text":"and lots of other local minimizers. Please see Fig. ","element":"span"},{"href":"#id-85","text":"13 ","element":"a"},{"text":"for contour of the robust regression function with ","element":"span"},{"style":{"height":16},"width":487.59,"height":40,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2003.04521/images/11-33.png","element":"img","alt":" w1 = (−8, −8), w2 = (5, 5)","inline":true,"padRight":true},{"text":"and ","element":"span"},{"style":{"height":13.19},"width":218.26,"height":32.98,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2003.04521/images/11-34.png","element":"img","alt":" b1 = b2 = 0","inline":true},{"text":". There are two main local minimizers at ","element":"span"},{"style":{"height":16},"width":130.6,"height":40,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2003.04521/images/11-35.png","element":"img","alt":" (w1, b1)","inline":true,"padRight":true},{"text":"and ","element":"span"},{"style":{"height":16},"width":130.6,"height":40,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2003.04521/images/11-36.png","element":"img","alt":" (w2, b2)","inline":true,"padRight":true},{"text":"with ","element":"span"},{"style":{"height":16},"width":361.42,"height":40,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2003.04521/images/11-37.png","element":"img","alt":"f(w2, b2) < f(w1, b1)","inline":true},{"text":"), and many other local minimizers.","element":"span"}],[{"text":"Fig. ","element":"span"},{"href":"#id-86","text":"14 ","element":"a"},{"text":"shows the optimization curve of the robust regression function, in which Gd-Net and GD are compared. We notice that BFGS is not convergent in this case since landscape here is vary flat. From the figure, we can see that for robust regression function, Gd-Net also performs better than GD. Table ","element":"span"},{"href":"#id-87","text":"V ","element":"a"},{"text":"shows the average numbers of effective samplings","element":"span"}],[{"style":{"width":"86%"},"width":866,"height":473,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2003.04521/images/12-0.png","element":"img"}],[{"id":"id-86","text":"Fig. 14. The optimization procedure of L","element":"figcaption","subtype":"caption"},{"style":{"height":6.4},"width":15,"height":16,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2003.04521/images/12-1.png","element":"img","alt":"2","inline":true},{"text":"GO and the steepest gradient with the learned policy on robust regression.","element":"figcaption","subtype":"caption"}],[{"style":{"width":"80%"},"width":812,"height":402,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2003.04521/images/12-2.png","element":"img"}],[{"text":"Fig. 15. ","element":"figcaption","subtype":"caption"},{"id":"id-88","text":"The optimization curve of L","element":"figcaption","subtype":"caption"},{"style":{"height":6.4},"width":15,"height":16,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2003.04521/images/12-3.png","element":"img","alt":"2","inline":true},{"text":"GO and ADAM on training the classification network.","element":"figcaption","subtype":"caption"}],[{"text":"obtained by the compared policies. From the table it is clear that the learned policy performs the best, while the filled function method needs much more times.","element":"span"}],[{"text":"TABLE V ","element":"figcaption","subtype":"caption"},{"id":"id-87","text":"T","element":"figcaption","subtype":"caption"},{"text":"HE NUMBER OF SAMPLINGS USED TO FIND THE PROMISING DIRECTIONS FOR ROBUST REGRESSION FUNCTION","element":"figcaption","subtype":"caption"},{"text":".","element":"figcaption","subtype":"caption"}],[{"style":{"width":"93%"},"width":938,"height":119,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2003.04521/images/12-4.png","element":"img"}],[{"style":{"fontStyle":"italic"},"text":"3) neural network classifier: ","element":"span"},{"text":"We construct a small network with one hidden layer for a classification problem in 2-d ","element":"span"},{"href":"#id-19","referenceIndex":20,"text":"[20]","element":"a"},{"text":". The number of hidden layer is one, and the total dimension of network is 5. The goal is to classify ","element":"span"},{"style":{"height":16},"width":313.35,"height":40,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2003.04521/images/12-5.png","element":"img","alt":" Si = {xi + ε|ε ∼","inline":true},{"style":{"height":17.38},"width":321.54,"height":43.46,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2003.04521/images/12-6.png","element":"img","alt":"N(0, σ2)}, i = 1, 2","inline":true,"padRight":true},{"text":"into two classes, where ","element":"span"},{"style":{"height":16.98},"width":248.17,"height":42.46,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2003.04521/images/12-7.png","element":"img","alt":" x1 ̸= x2 ∈ R2","inline":true},{"text":". The cross entropy is used as the loss function. ADAM is compared with L","element":"span"},{"style":{"height":7.6},"width":16,"height":19,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2003.04521/images/12-8.png","element":"img","alt":"2","inline":true},{"text":"GO. In ADAM, the learning rate is 0.001, and the hyper-parameters for momentum estimation are ","element":"span"},{"text":"0","element":"span"},{"style":{"fontStyle":"italic"},"text":".","element":"span"},{"text":"9 ","element":"span"},{"text":"and ","element":"span"},{"text":"0","element":"span"},{"style":{"fontStyle":"italic"},"text":".","element":"span"},{"text":"9","element":"span"},{"text":".","element":"span"}],[{"text":"Fig. ","element":"span"},{"href":"#id-88","text":"15 ","element":"a"},{"text":"shows the optimization curve. From the figure, we see that L","element":"span"},{"style":{"height":7.6},"width":16,"height":19,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2003.04521/images/12-9.png","element":"img","alt":"2","inline":true},{"text":"GO performs better than ADAM. It can escape from the local optimum, and reach a better optimum successfully.","element":"span"}]]},{"heading":"VI. CONCLUSION AND FUTURE WORK","paragraphs":[[{"id":"id-35","text":"This paper proposed a two-phase global optimization algo- ","element":"span"},{"text":"rithm for smooth non-convex function. In the minimization phase, a local optimization algorithm, called Gd-Net, was obtained by the model-driven learning approach. The method was established by learning the parameters of a non-linear combination of different descent directions through deep neural network training on a class of Gaussian family function. In the escaping phase, a fixed escaping policy was first developed based on the modeling of the escaping phase as an MDP. We further proposed to learn the escaping policy by policy gradient.","element":"span"}],[{"text":"A series of experiments have been carried out. First, controlled experimental results showed that Gd-Net performs better than classical algorithms such as steepest gradient descent, conjugate descent and BFGS on locally convex functions. The generalization ability of the learned algorithm was also verified on higher dimensional functions and on functions with contour different to the Gaussian family function. Second, experimental results showed that the fixed policy was more able to find promising solutions than random sampling, while the learned policy performed better than the fixed policy. Third, the proposed two-phase global algorithm, L","element":"span"},{"style":{"height":7.6},"width":16,"height":19,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2003.04521/images/12-10.png","element":"img","alt":"2","inline":true},{"text":"GO, showed its effectiveness on a benchmark function and two machine learning problems.","element":"span"}],[{"text":"In the future, we plan to work on the following avenues. First, since the Hessian matrix is used in Gd-Net, it is thus not readily applicable to high-dimensional functions. Research on learning to learn approach for high-dimensional functions is appealing. Second, we found that learning the escaping policy is particularly difficult for high-dimensional functions. It is thus necessary to develop a better learning approach. Third, the two-phase approach is not the only way for global optimization. We intend to develop learning to learn approaches based on other global methods, such as branch and bound ","element":"span"},{"href":"#id-89","referenceIndex":39,"text":"[39]","element":"a"},{"text":", and for other types of optimization problems such as non-smooth, non-convex and non-derivative.","element":"span"}]]},{"heading":"REFERENCES","paragraphs":[[{"id":"id-0","text":"[1] L. Dixon and G. Szeg¨o, ","element":"span"},{"style":{"fontStyle":"italic"},"text":"Towards Global Optimization","element":"span"},{"text":". ","element":"span"},{"text":"New York: Elsevier, 1975.","element":"span"}],[{"id":"id-1","text":"[2] R. Horst and P. Pardalos, Eds., ","element":"span"},{"style":{"fontStyle":"italic"},"text":"Handbook of Global Optimization","element":"span"},{"text":". Dordrecht: Kluwer, 1995.","element":"span"}],[{"id":"id-2","text":"[3] (2019). [Online]. Available: ","element":"span"},{"href":"https://www.mat.univie.ac.at/~neum/glopt.html","text":"https://www.mat.univie.ac.at/","element":"a"},{"style":{"height":3.6},"width":23,"height":9,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2003.04521/images/12-11.png","element":"img","alt":"∼","inline":true},{"text":"neum/glopt. ","element":"span"},{"href":"https://www.mat.univie.ac.at/~neum/glopt.html","text":"html","element":"a"}],[{"id":"id-3","text":"[4] J. Pinter, “Continuous global optimization: Applications,” in ","element":"span"},{"style":{"fontStyle":"italic"},"text":"Encyclopedia of Optimization","element":"span"},{"text":", C. Floudas and P. Pardalos, Eds. ","element":"span"},{"text":"Boston, M.A.: Springer, 2008.","element":"span"}],[{"id":"id-4","text":"[5] A. Neumaier, “Convexification and global optimization in continuous ","element":"span"},{"text":"global optimization and constraint satisfaction,” in ","element":"span"},{"style":{"fontStyle":"italic"},"text":"Acta Numerica","element":"span"},{"text":", A. Iserles, Ed. ","element":"span"},{"text":"Cambridge University press, 2004.","element":"span"}],[{"id":"id-5","text":"[6] P. Gary, W. Hart, L. Painton, C. Phillips, M. Trahan, and J. Wagner, “A ","element":"span"},{"text":"survey of global optimization methods,” 1997.","element":"span"}],[{"id":"id-6","text":"[7] A. Levy and S. G´omez, “The tunneling method applied to global ","element":"span"},{"text":"optimization,” in ","element":"span"},{"style":{"fontStyle":"italic"},"text":"Numerical Optimization","element":"span"},{"text":", P. Boggs, R. Byrd, and R. Schnabel, Eds. ","element":"span"},{"text":"SIAM, 1985.","element":"span"}],[{"id":"id-7","text":"[8] R. P. Ge and Y. F. Qin, “A class of filled functions for finding global ","element":"span"},{"text":"minimizers of a function of several variables,” ","element":"span"},{"style":{"fontStyle":"italic"},"text":"Journal of Optimization Theory and Applications","element":"span"},{"text":", vol. 54, no. 2, pp. 241–252, 1987.","element":"span"}],[{"id":"id-8","text":"[9] S. Boyd and L. Vandenberghe, ","element":"span"},{"style":{"fontStyle":"italic"},"text":"Convex optimization","element":"span"},{"text":". ","element":"span"},{"text":"Cambridge university press, 2004.","element":"span"}],[{"id":"id-9","text":"[10] R. Fletcher and C. M. Reeves, “Function minimization by conjugate ","element":"span"},{"text":"gradients,” ","element":"span"},{"style":{"fontStyle":"italic"},"text":"Computer Journal","element":"span"},{"text":", vol. 7, no. 2, pp. 149–154, 1964.","element":"span"}],[{"id":"id-10","text":"[11] D. H. Wolpert and W. G. Macready, “No free lunch theorems for ","element":"span"},{"text":"optimization,” ","element":"span"},{"style":{"fontStyle":"italic"},"text":"IEEE Transactions on Evolutionary Computation","element":"span"},{"text":", vol. 1, no. 1, pp. 67–82, 1997.","element":"span"}],[{"id":"id-11","text":"[12] W. Sun and Y. Yuan, ","element":"span"},{"style":{"fontStyle":"italic"},"text":"Optimization theory and methods: nonlinear programming","element":"span"},{"text":". ","element":"span"},{"text":"Springer Science & Business Media, 2006, vol. 1.","element":"span"}],[{"id":"id-12","text":"[13] D. W. Marquardt, “An algorithm for least-squares estimation of non- ","element":"span"},{"text":"linear parameters,” ","element":"span"},{"style":{"fontStyle":"italic"},"text":"Journal of the society for Industrial and Applied Mathematics","element":"span"},{"text":", vol. 11, no. 2, pp. 431–441, 1963.","element":"span"}],[{"id":"id-13","text":"[14] B. T. Polyak, “Some methods of speeding up the convergence of iter- ","element":"span"},{"text":"ation methods,” ","element":"span"},{"style":{"fontStyle":"italic"},"text":"USSR Computational Mathematics and Mathematical Physics","element":"span"},{"text":", vol. 4, no. 5, pp. 1–17, 1964.","element":"span"}],[{"id":"id-14","text":"[15] J. Duchi, E. Hazan, and Y. Singer, “Adaptive subgradient methods ","element":"span"},{"text":"for online learning and stochastic optimization,” ","element":"span"},{"style":{"fontStyle":"italic"},"text":"Journal of Machine Learning Research","element":"span"},{"text":", vol. 12, no. Jul, pp. 2121–2159, 2011.","element":"span"}],[{"id":"id-15","text":"[16] M. D. Zeiler, “Adadelta: an adaptive learning rate method,” ","element":"span"},{"style":{"fontStyle":"italic"},"text":"arXiv preprint arXiv:1212.5701","element":"span"},{"text":", 2012.","element":"span"}],[{"id":"id-16","text":"[17] D. P. Kingma and J. Ba, “Adam: A method for stochastic optimization,” ","element":"span"},{"text":"in ","element":"span"},{"style":{"fontStyle":"italic"},"text":"ICLR","element":"span"},{"text":", 2015.","element":"span"}],[{"id":"id-17","text":"[18] M. Andrychowicz, M. Denil, S. Gomez, M. W. Hoffman, D. Pfau, ","element":"span"},{"text":"T. Schaul, B. Shillingford, and N. De Freitas, “Learning to learn by gradient descent by gradient descent,” in ","element":"span"},{"style":{"fontStyle":"italic"},"text":"NIPS","element":"span"},{"text":", 2016, pp. 3981–3989.","element":"span"}],[{"id":"id-18","text":"[19] S. Hochreiter and J. Schmidhuber, “Long short-term memory,” ","element":"span"},{"style":{"fontStyle":"italic"},"text":"Neural Computation","element":"span"},{"text":", vol. 9, no. 8, pp. 1735–1780, 1997.","element":"span"}],[{"id":"id-19","text":"[20] K. Li and J. Malik, “Learning to optimize,” in ","element":"span"},{"style":{"fontStyle":"italic"},"text":"ICLR","element":"span"},{"text":", 2017.","element":"span"}],[{"id":"id-20","text":"[21] Y. Chen, Hoffman, M. W, S. G. Colmenarejo, M. Denil, T. P. Lillicrap, ","element":"span"},{"text":"and N. de Freitas, “Learning to learn without gradient descent by gradient descent,” ","element":"span"},{"style":{"fontStyle":"italic"},"text":"ICML","element":"span"},{"text":", 2017.","element":"span"}],[{"id":"id-21","text":"[22] Z. Xu and J. Sun, “Model-driven deep-learning,” ","element":"span"},{"style":{"fontStyle":"italic"},"text":"National Science Review","element":"span"},{"text":", vol. v.5, no. 1, pp. 26–28, 2018.","element":"span"}],[{"id":"id-22","text":"[23] J. Sun, H. Li, Z. Xu ","element":"span"},{"style":{"fontStyle":"italic"},"text":"et al.","element":"span"},{"text":", “Deep admm-net for compressive sensing mri,” in ","element":"span"},{"style":{"fontStyle":"italic"},"text":"NIPS","element":"span"},{"text":", 2016, pp. 10–18.","element":"span"}],[{"id":"id-23","text":"[24] S. Wang, J. Sun, and Z. Xu, “Hyperadam: A learnable task-adaptive ","element":"span"},{"text":"adam for network training,” in ","element":"span"},{"style":{"fontStyle":"italic"},"text":"AAAI","element":"span"},{"text":", 2019.","element":"span"}],[{"id":"id-24","text":"[25] K. Lv, S. Jiang, and J. Li, “Learning gradient descent: Better general- ","element":"span"},{"text":"ization and longer horizons,” in ","element":"span"},{"style":{"fontStyle":"italic"},"text":"ICML","element":"span"},{"text":", 2017, pp. 2247–2255.","element":"span"}],[{"id":"id-25","text":"[26] A. Levy and A. Montalvo, “The tunneling algorithm for the global ","element":"span"},{"text":"minimization of functions,” ","element":"span"},{"style":{"fontStyle":"italic"},"text":"SIAM Journal on Scientific and Statistical Computing","element":"span"},{"text":", 1985.","element":"span"}],[{"id":"id-26","text":"[27] Y. Xu, Y. Zhang, and S. Wang, “A modified tunneling function method ","element":"span"},{"text":"for non-smooth global optimization and its application in artificial neural network,” ","element":"span"},{"style":{"fontStyle":"italic"},"text":"Applied Mathematical Modelling","element":"span"},{"text":", vol. 39, pp. 6348–6450, 2015.","element":"span"}],[{"id":"id-27","text":"[28] H. Lin, Y. Wang, and L. Fan, “A filled function method with one ","element":"span"},{"text":"parameter for unconstrained global optimization,” ","element":"span"},{"style":{"fontStyle":"italic"},"text":"Applied Mathematical Modelling","element":"span"},{"text":", vol. 218, pp. 3776–3785, 2011.","element":"span"}],[{"id":"id-28","text":"[29] Y. Zhang, L. Zhang, and Y. Xu, “New filled functions for non-smooth ","element":"span"},{"text":"global optimization,” ","element":"span"},{"style":{"fontStyle":"italic"},"text":"Applied Mathematical Modelling","element":"span"},{"text":", vol. 33, no. 7, pp. 3114–3129, 2009.","element":"span"}],[{"id":"id-29","text":"[30] L. Zhang, C. Ng, D. Li, and W. Tian, “A new filled function method ","element":"span"},{"text":"for global optimization,” ","element":"span"},{"style":{"fontStyle":"italic"},"text":"Journal of Global optimization","element":"span"},{"text":", vol. 28, no. 1, pp. 17–43, 2004.","element":"span"}],[{"id":"id-30","text":"[31] S. Ma, Y. Yang, and H. Liu, “A parameter-free filled function for uncon- ","element":"span"},{"text":"strained global optimization,” ","element":"span"},{"style":{"fontStyle":"italic"},"text":"Applied Mathematics and Computation","element":"span"},{"text":", vol. 215, no. 10, pp. 3610–3619, 2010.","element":"span"}],[{"id":"id-37","text":"[32] K. Arulkumaran, M. P. Deisenroth, M. Brundage, and A. A. Bharath, ","element":"span"},{"text":"“Deep reinforcement learning: A brief survey,” ","element":"span"},{"style":{"fontStyle":"italic"},"text":"IEEE Signal Processing Magazine","element":"span"},{"text":", vol. 34, no. 6, pp. 26–38, 2017.","element":"span"}],[{"id":"id-38","text":"[33] D. Silver, A. Huang, C. J. Maddison, A. Guez, L. Sifre, G. Van ","element":"span"},{"text":"Den Driessche, J. Schrittwieser, I. Antonoglou, V. Panneershelvam, M. Lanctot ","element":"span"},{"style":{"fontStyle":"italic"},"text":"et al.","element":"span"},{"text":", “Mastering the game of go with deep neural networks and tree search,” ","element":"span"},{"style":{"fontStyle":"italic"},"text":"Nature","element":"span"},{"text":", vol. 529, no. 7587, p. 484, 2016.","element":"span"}],[{"id":"id-39","text":"[34] V. Mnih, K. Kavukcuoglu, D. Silver, A. Graves, I. Antonoglou, D. Wier- ","element":"span"},{"text":"stra, and M. Riedmiller, “Playing atari with deep reinforcement learning,” ","element":"span"},{"style":{"fontStyle":"italic"},"text":"arXiv preprint arXiv:1312.5602","element":"span"},{"text":", 2013.","element":"span"}],[{"id":"id-40","text":"[35] R. Sutton and A. Barto, ","element":"span"},{"style":{"fontStyle":"italic"},"text":"Reinforcement Learning:An Introduction","element":"span"},{"text":", 1998.","element":"span"}],[{"id":"id-45","text":"[36] R. Wilson, “Multiresolution gaussian mixture models: Theory and ","element":"span"},{"text":"applications,” in ","element":"span"},{"style":{"fontStyle":"italic"},"text":"IEEE International Conference on Pattern Recognition","element":"span"},{"text":". Citeseer, 2000.","element":"span"}],[{"id":"id-64","text":"[37] Y. Liang, L. Zhang, M. Li, and B. Han, “A filled function method for ","element":"span"},{"text":"global optimization,” ","element":"span"},{"style":{"fontStyle":"italic"},"text":"Journal of Computational and Applied Mathematics","element":"span"},{"text":", vol. 205, no. 1, pp. 16–31, 2007.","element":"span"}],[{"id":"id-76","text":"[38] Y. LeCun, L. Bottou, Y. Bengio, P. Haffner ","element":"span"},{"style":{"fontStyle":"italic"},"text":"et al.","element":"span"},{"text":", “Gradient-based learning applied to document recognition,” ","element":"span"},{"style":{"fontStyle":"italic"},"text":"Proceedings of the IEEE","element":"span"},{"text":", vol. 86, no. 11, pp. 2278–2324, 1998.","element":"span"}],[{"id":"id-89","text":"[39] E. L. Lawler and D. E. Wood, “Branch-and-bound methods: A survey,” ","element":"span"},{"style":{"fontStyle":"italic"},"text":"Operations Research","element":"span"},{"text":", vol. 14, no. 4, pp. 699–719, 1966.","element":"span"}]]},{"heading":"APPENDIX A","paragraphs":[[{"text":"The proof of Theorem ","element":"span"},{"href":"#id-46","text":"1 ","element":"a"},{"text":"can be found below.","element":"span"}],[{"style":{"width":"79%"},"width":795,"height":186,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2003.04521/images/13-0.png","element":"img"}],[{"style":{"height":17.9},"width":907.47,"height":44.75,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2003.04521/images/13-1.png","element":"img","alt":"�yk−1 = w3kgk − w4kgk−1, �yk−1 = w1kgk − w2kgk−1","inline":true},{"text":", and ","element":"span"},{"style":{"height":16},"width":504.55,"height":40,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2003.04521/images/13-2.png","element":"img","alt":"�Hk−1 = βkHk−1 + (1 − βk)I","inline":true},{"text":". With exact linear search, we have ","element":"span"},{"style":{"height":17.22},"width":192.7,"height":43.05,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2003.04521/images/13-3.png","element":"img","alt":" g⊺ksk−1 = 0","inline":true},{"text":", therefore","element":"span"}],[{"style":{"width":"67%"},"width":673,"height":216,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2003.04521/images/13-4.png","element":"img"}],[{"text":"It is clear that ","element":"span"},{"style":{"height":17.22},"width":188.08,"height":43.05,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2003.04521/images/13-5.png","element":"img","alt":" −g⊺kdk > 0","inline":true,"padRight":true},{"text":"if ","element":"span"},{"style":{"height":13.19},"width":169.78,"height":32.98,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2003.04521/images/13-6.png","element":"img","alt":" Hk−1 ≻ 0","inline":true},{"text":", otherwise a ","element":"span"},{"style":{"height":14.4},"width":118.4,"height":36,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2003.04521/images/13-7.png","element":"img","alt":" βk > 0","inline":true,"padRight":true},{"text":"can be chosen to make ","element":"span"},{"style":{"height":16},"width":396.14,"height":40,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2003.04521/images/13-8.png","element":"img","alt":" (βkHk−1 + (1 − βk)I)","inline":true,"padRight":true},{"text":"diagonally dominant, which means ","element":"span"},{"style":{"height":17.22},"width":573.3,"height":43.04,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2003.04521/images/13-9.png","element":"img","alt":" g⊺k(βkHk−1 + (1 − βk)I)gk > 0","inline":true},{"text":". ","element":"span"},{"text":"Therefore, we can always make sure ","element":"span"},{"style":{"height":17.22},"width":173.85,"height":43.04,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2003.04521/images/13-10.png","element":"img","alt":" g⊺kdk < 0","inline":true},{"text":", i.e. ","element":"span"},{"style":{"height":13.19},"width":37.74,"height":32.98,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2003.04521/images/13-11.png","element":"img","alt":" dk","inline":true,"padRight":true},{"text":"is ","element":"span"},{"text":"a descent direction, and","element":"span"}],[{"style":{"width":"48%"},"width":489,"height":41,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2003.04521/images/13-12.png","element":"img"}],[{"text":"Since ","element":"span"},{"style":{"fontStyle":"italic"},"text":"f","element":"span"},{"text":"(","element":"span"},{"style":{"fontStyle":"italic"},"text":"x","element":"span"},{"text":") ","element":"span"},{"text":"is ","element":"span"},{"text":"bounded, ","element":"span"},{"text":"there ","element":"span"},{"text":"exists ","element":"span"},{"style":{"height":16},"width":96.4,"height":40,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2003.04521/images/13-13.png","element":"img","alt":"f(x∗)","inline":true,"padRight":true},{"text":"such ","element":"span"},{"text":"that ","element":"span"},{"style":{"height":16},"width":391.84,"height":40,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2003.04521/images/13-14.png","element":"img","alt":"limk→∞ f(xk) = f(x⋆)","inline":true},{"text":".","element":"span"}]]},{"heading":"APPENDIX B","paragraphs":[[{"text":"This section gives details of the proofs for the theorems in Section ","element":"span"},{"text":"III. ","element":"span"},{"text":"The proof to Theorem ","element":"span"},{"href":"#id-67","text":"2 ","element":"a"},{"text":"is shown below.","element":"span"}],[{"style":{"fontStyle":"italic"},"text":"Proof. ","element":"span"},{"text":"According to assumption (2), ","element":"span"},{"style":{"fontStyle":"italic"},"text":"g","element":"span"},{"text":"(","element":"span"},{"style":{"fontStyle":"italic"},"text":"t","element":"span"},{"text":") ","element":"span"},{"text":"is convex in ","element":"span"},{"style":{"height":16},"width":107,"height":40,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2003.04521/images/13-15.png","element":"img","alt":" [−δ, δ]","inline":true},{"text":". As ","element":"span"},{"style":{"height":22.32},"width":614.66,"height":55.8,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2003.04521/images/13-16.png","element":"img","alt":" g′(0) = f ′(x)��x=x0 = 0, g′′(0) > 0","inline":true},{"text":", then ","element":"span"},{"style":{"height":16},"width":164.53,"height":40,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2003.04521/images/13-17.png","element":"img","alt":" g′(t) > 0","inline":true,"padRight":true},{"text":"in ","element":"span"},{"style":{"height":16},"width":83.35,"height":40,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2003.04521/images/13-18.png","element":"img","alt":"(0, δ]","inline":true},{"text":". Therefore, ","element":"span"},{"style":{"fontStyle":"italic"},"text":"g","element":"span"},{"text":"(","element":"span"},{"style":{"fontStyle":"italic"},"text":"t","element":"span"},{"text":") ","element":"span"},{"text":"is monotonically increasing in ","element":"span"},{"style":{"height":16},"width":78.93,"height":40,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2003.04521/images/13-19.png","element":"img","alt":" [0, δ]","inline":true},{"text":", and monotonically decreasing in ","element":"span"},{"style":{"height":16},"width":152.16,"height":40,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2003.04521/images/13-20.png","element":"img","alt":" [T −δ, T]","inline":true},{"text":". Since ","element":"span"},{"style":{"height":17.38},"width":257.46,"height":43.46,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2003.04521/images/13-21.png","element":"img","alt":" f(x) ∈ C2(Rn)","inline":true},{"text":", then ","element":"span"},{"style":{"height":17.38},"width":253.61,"height":43.46,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2003.04521/images/13-22.png","element":"img","alt":" g(t) ∈ C2[0, T]","inline":true},{"text":". This implies that ","element":"span"},{"style":{"height":16},"width":77.51,"height":40,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2003.04521/images/13-23.png","element":"img","alt":" g′(t)","inline":true,"padRight":true},{"text":"is continuous in ","element":"span"},{"text":"[0","element":"span"},{"style":{"fontStyle":"italic"},"text":", T","element":"span"},{"text":"]","element":"span"},{"text":".","element":"span"}],[{"text":"Since ","element":"span"},{"style":{"height":16},"width":162.42,"height":40,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2003.04521/images/13-24.png","element":"img","alt":" g′(δ) > 0","inline":true,"padRight":true},{"text":"and ","element":"span"},{"style":{"height":16},"width":242.94,"height":40,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2003.04521/images/13-25.png","element":"img","alt":" g′(T − δ) < 0","inline":true},{"text":", then there exists a ","element":"span"},{"style":{"height":14},"width":18,"height":35,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2003.04521/images/13-26.png","element":"img","alt":" ξ","inline":true,"padRight":true},{"text":"such that ","element":"span"},{"style":{"height":16},"width":160.18,"height":40,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2003.04521/images/13-27.png","element":"img","alt":" g′(ξ) = 0","inline":true},{"text":". Further, ","element":"span"},{"style":{"height":14},"width":18,"height":35,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2003.04521/images/13-28.png","element":"img","alt":" ξ","inline":true,"padRight":true},{"text":"is unique since it is assumed that there is no other local minima between ","element":"span"},{"style":{"height":9.19},"width":38.78,"height":22.98,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2003.04521/images/13-29.png","element":"img","alt":" x0","inline":true,"padRight":true},{"text":"and ","element":"span"},{"style":{"height":9.19},"width":38.78,"height":22.98,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2003.04521/images/13-30.png","element":"img","alt":" x1","inline":true},{"text":". Then we have ","element":"span"},{"style":{"height":16},"width":150.14,"height":40,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2003.04521/images/13-31.png","element":"img","alt":" g′(t) > 0","inline":true,"padRight":true},{"text":"in ","element":"span"},{"style":{"height":16},"width":83.98,"height":40,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2003.04521/images/13-32.png","element":"img","alt":" [0, ξ)","inline":true},{"text":", and ","element":"span"},{"style":{"height":16},"width":150.14,"height":40,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2003.04521/images/13-33.png","element":"img","alt":" g′(t) < 0","inline":true,"padRight":true},{"text":"in ","element":"span"},{"style":{"height":16},"width":92.3,"height":40,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2003.04521/images/13-34.png","element":"img","alt":" (ξ, T]","inline":true},{"text":".","element":"span"}],[{"text":"To prove Theorem ","element":"span"},{"href":"#id-49","text":"3, ","element":"a"},{"text":"we first prove Lemma ","element":"span"},{"href":"#id-90","text":"5.","element":"a"}],[{"id":"id-90","style":{"fontWeight":"bold"},"text":"Lemma 5. ","element":"span"},{"style":{"fontStyle":"italic"},"text":"Suppose ","element":"span"},{"style":{"fontStyle":"italic"},"text":"F","element":"span"},{"text":"(","element":"span"},{"style":{"fontStyle":"italic"},"text":"a","element":"span"},{"text":") ","element":"span"},{"style":{"fontStyle":"italic"},"text":"to be the function defined in Alg. ","element":"span"},{"href":"#id-48","style":{"fontStyle":"italic"},"text":"2. ","element":"a"},{"style":{"fontStyle":"italic"},"text":"For fixed ","element":"span"},{"style":{"fontStyle":"italic"},"text":"a ","element":"span"},{"style":{"fontStyle":"italic"},"text":"and ","element":"span"},{"style":{"fontStyle":"italic"},"text":"N","element":"span"},{"style":{"fontStyle":"italic"},"text":", we have:","element":"span"}],[{"style":{"width":"72%"},"width":725,"height":103,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2003.04521/images/13-35.png","element":"img"}],[{"style":{"fontStyle":"italic"},"text":"where ","element":"span"},{"style":{"fontStyle":"italic"},"text":"g","element":"span"},{"style":{"height":16},"width":218.32,"height":40,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2003.04521/images/13-36.png","element":"img","alt":"(t) = f(x0 +","inline":true,"padRight":true},{"style":{"fontStyle":"italic"},"text":"td","element":"span"},{"text":")","element":"span"},{"style":{"fontStyle":"italic"},"text":".","element":"span"}],[{"style":{"fontStyle":"italic"},"text":"Proof. ","element":"span"},{"text":"Since ","element":"span"},{"style":{"height":16},"width":228.78,"height":40,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2003.04521/images/13-37.png","element":"img","alt":" {x1, · · · , xN}","inline":true,"padRight":true},{"text":"are all on the line ","element":"span"},{"style":{"height":13.19},"width":128.56,"height":32.98,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2003.04521/images/13-38.png","element":"img","alt":" x0 + td","inline":true,"padRight":true},{"text":"with ","element":"span"},{"style":{"fontStyle":"italic"},"text":"t > ","element":"span"},{"text":"0","element":"span"},{"text":". Therefore, each ","element":"span"},{"style":{"height":14},"width":229.68,"height":35,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2003.04521/images/13-39.png","element":"img","alt":" xi, 1 ≤ i ≤ N","inline":true,"padRight":true},{"text":"can be written as","element":"span"}],[{"style":{"width":"22%"},"width":227,"height":36,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2003.04521/images/13-40.png","element":"img"}],[{"text":"where ","element":"span"},{"style":{"height":13.99},"width":194.12,"height":34.98,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2003.04521/images/13-41.png","element":"img","alt":" t1 = δ0 > 0","inline":true,"padRight":true},{"text":"and ","element":"span"},{"style":{"height":12.39},"width":265.33,"height":30.98,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2003.04521/images/13-42.png","element":"img","alt":" t1 < t2 · · · < tN","inline":true},{"text":". Further, we have","element":"span"}],[{"style":{"width":"86%"},"width":866,"height":40,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2003.04521/images/13-43.png","element":"img"}],[{"text":"or equivalently,","element":"span"}],[{"style":{"width":"91%"},"width":917,"height":46,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2003.04521/images/13-44.png","element":"img"}],[{"text":"We thus have:","element":"span"}],[{"id":"id-52","style":{"width":"93%"},"width":943,"height":796,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2003.04521/images/14-0.png","element":"img"}],[{"text":"Since ","element":"span"},{"style":{"height":16},"width":77.51,"height":40,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2003.04521/images/14-1.png","element":"img","alt":" g′(t)","inline":true,"padRight":true},{"text":"is continuous in ","element":"span"},{"style":{"height":16},"width":116.67,"height":40,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2003.04521/images/14-2.png","element":"img","alt":" [t1, tN]","inline":true},{"text":", it is Riemann integrable. Thus, for a fixed ","element":"span"},{"style":{"fontStyle":"italic"},"text":"N","element":"span"},{"text":", we have","element":"span"}],[{"style":{"width":"65%"},"width":655,"height":117,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2003.04521/images/14-3.png","element":"img"}],[{"text":"This finishes the proof.","element":"span"}],[{"text":"In the sequel, we define","element":"span"}],[{"style":{"width":"67%"},"width":678,"height":46,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2003.04521/images/14-4.png","element":"img"}],[{"text":"where ","element":"span"},{"style":{"height":16},"width":174.32,"height":40,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2003.04521/images/14-5.png","element":"img","alt":" a ∈ [0, ∞)","inline":true},{"text":". Here ","element":"span"},{"style":{"fontStyle":"italic"},"text":"G","element":"span"},{"text":"(","element":"span"},{"style":{"fontStyle":"italic"},"text":"a","element":"span"},{"text":") ","element":"span"},{"text":"is just the distance between ","element":"span"},{"style":{"height":9.19},"width":49.78,"height":22.98,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2003.04521/images/14-6.png","element":"img","alt":" xN","inline":true,"padRight":true},{"text":"and ","element":"span"},{"style":{"height":9.19},"width":38.78,"height":22.98,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2003.04521/images/14-7.png","element":"img","alt":" x1","inline":true,"padRight":true},{"text":"along ","element":"span"},{"style":{"fontStyle":"italic"},"text":"d","element":"span"},{"text":". Obviously, ","element":"span"},{"style":{"fontStyle":"italic"},"text":"G","element":"span"},{"text":"(","element":"span"},{"style":{"fontStyle":"italic"},"text":"a","element":"span"},{"text":") ","element":"span"},{"text":"is a polynomial function of ","element":"span"},{"style":{"fontStyle":"italic"},"text":"a","element":"span"},{"text":", and it is monotonically increasing.","element":"span"}],[{"text":"Based on Lemma ","element":"span"},{"href":"#id-90","text":"5, ","element":"a"},{"text":"Lemma ","element":"span"},{"href":"#id-91","text":"6 ","element":"a"},{"text":"can be established.","element":"span"}],[{"id":"id-91","style":{"fontWeight":"bold"},"text":"Lemma 6. ","element":"span"},{"style":{"fontStyle":"italic"},"text":"Suppose that ","element":"span"},{"style":{"height":13.19},"width":236.44,"height":32.98,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2003.04521/images/14-8.png","element":"img","alt":" x′ = x0 + Td","inline":true,"padRight":true},{"style":{"fontStyle":"italic"},"text":"is a point such that ","element":"span"},{"style":{"height":16},"width":237.84,"height":40,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2003.04521/images/14-9.png","element":"img","alt":"f(x0) ≥ f(x′)","inline":true},{"style":{"fontStyle":"italic"},"text":", and there are no other local minimizer within ","element":"span"},{"style":{"height":13.19},"width":42.18,"height":32.98,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2003.04521/images/14-10.png","element":"img","alt":"B0","inline":true},{"style":{"fontStyle":"italic"},"text":". If ","element":"span"},{"style":{"height":6.8},"width":26,"height":17,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2003.04521/images/14-11.png","element":"img","alt":" α","inline":true,"padRight":true},{"style":{"fontStyle":"italic"},"text":"is sufficiently small, then there exists an ","element":"span"},{"style":{"height":10.99},"width":37.06,"height":27.46,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2003.04521/images/14-12.png","element":"img","alt":" a∗","inline":true,"padRight":true},{"style":{"fontStyle":"italic"},"text":"such that ","element":"span"},{"style":{"height":16},"width":185.89,"height":40,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2003.04521/images/14-13.png","element":"img","alt":"F ′(a∗) = 0","inline":true},{"style":{"fontStyle":"italic"},"text":".","element":"span"}],[{"style":{"fontStyle":"italic"},"text":"Proof. ","element":"span"},{"text":"Since ","element":"span"},{"style":{"height":16},"width":237.84,"height":40,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2003.04521/images/14-14.png","element":"img","alt":" f(x′) ≤ f(x0)","inline":true},{"text":", and ","element":"span"},{"style":{"height":16},"width":286.02,"height":40,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2003.04521/images/14-15.png","element":"img","alt":" g(t) = f(x0+td)","inline":true,"padRight":true},{"text":"is smooth, there exists a ","element":"span"},{"style":{"height":14},"width":18,"height":35,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2003.04521/images/14-16.png","element":"img","alt":" ξ","inline":true,"padRight":true},{"text":"such that ","element":"span"},{"style":{"height":18.7},"width":395.91,"height":46.74,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2003.04521/images/14-17.png","element":"img","alt":" ξ = arg maxt∈[t1,T ] g(t)","inline":true},{"text":".","element":"span"}],[{"text":"Let’s consider two cases. First, let ","element":"span"},{"style":{"height":14},"width":202.14,"height":35,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2003.04521/images/14-18.png","element":"img","alt":" D1 = ξ − t1","inline":true},{"text":", then there is an ","element":"span"},{"style":{"height":9.19},"width":37.06,"height":22.98,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2003.04521/images/14-19.png","element":"img","alt":" a1","inline":true,"padRight":true},{"text":"s.t. ","element":"span"},{"style":{"height":16},"width":204.1,"height":40,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2003.04521/images/14-20.png","element":"img","alt":" G(a1) = D1","inline":true,"padRight":true},{"text":"according to the intermediate value theorem. As ","element":"span"},{"style":{"height":6.8},"width":26,"height":17,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2003.04521/images/14-21.png","element":"img","alt":" α","inline":true,"padRight":true},{"text":"is sufficiently small, we have: ","element":"span"},{"style":{"height":14},"width":204.25,"height":35,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2003.04521/images/14-22.png","element":"img","alt":" ∀ε > 0, ∃ α","inline":true,"padRight":true},{"text":"s.t.","element":"span"}],[{"style":{"width":"50%"},"width":507,"height":103,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2003.04521/images/14-23.png","element":"img"}],[{"text":"Note that we can choose ","element":"span"},{"style":{"height":13.99},"width":33.71,"height":34.98,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2003.04521/images/14-24.png","element":"img","alt":" δ0","inline":true,"padRight":true},{"text":"s.t.","element":"span"}],[{"style":{"width":"71%"},"width":715,"height":200,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2003.04521/images/14-25.png","element":"img"}],[{"text":"Let ","element":"span"},{"style":{"fontStyle":"italic"},"text":"ϵ","element":"span"},{"style":{"height":23.36},"width":180.66,"height":58.4,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2003.04521/images/14-26.png","element":"img","alt":"0 = 1a1� tNt1","inline":true,"padRight":true},{"style":{"fontStyle":"italic"},"text":"g","element":"span"},{"style":{"height":16},"width":57.07,"height":40,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2003.04521/images/14-27.png","element":"img","alt":"′(t)","inline":true},{"style":{"fontStyle":"italic"},"text":"dt ","element":"span"},{"text":"and ","element":"span"},{"style":{"fontStyle":"italic"},"text":"ϵ ","element":"span"},{"style":{"height":16},"width":116.04,"height":40,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2003.04521/images/14-28.png","element":"img","alt":" = ϵ0/2","inline":true},{"text":", then ","element":"span"},{"style":{"height":13.19},"width":73.9,"height":32.98,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2003.04521/images/14-29.png","element":"img","alt":" ∃ α0","inline":true,"padRight":true},{"text":"s.t. ","element":"span"},{"style":{"height":16},"width":156.39,"height":40,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2003.04521/images/14-30.png","element":"img","alt":" |F ′(a1)−","inline":true,"padRight":true},{"style":{"fontStyle":"italic"},"text":"ϵ","element":"span"},{"style":{"height":16},"width":28.88,"height":40,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2003.04521/images/14-31.png","element":"img","alt":"0|","inline":true,"padRight":true},{"style":{"fontStyle":"italic"},"text":"< ϵ","element":"span"},{"style":{"height":16},"width":122.74,"height":40,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2003.04521/images/14-32.png","element":"img","alt":"0/2 ⇒","inline":true,"padRight":true},{"style":{"fontStyle":"italic"},"text":"F ","element":"span"},{"style":{"height":16},"width":81.62,"height":40,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2003.04521/images/14-33.png","element":"img","alt":"′(a1)","inline":true,"padRight":true},{"style":{"fontStyle":"italic"},"text":"> ϵ","element":"span"},{"style":{"height":16},"width":130.86,"height":40,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2003.04521/images/14-34.png","element":"img","alt":"0/2 > 0","inline":true},{"text":".","element":"span"}],[{"text":"Similarly, if let ","element":"span"},{"style":{"height":13.19},"width":209.45,"height":32.98,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2003.04521/images/14-35.png","element":"img","alt":" D2 = T − t1","inline":true},{"text":", then there is ","element":"span"},{"style":{"height":9.19},"width":37.06,"height":22.98,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2003.04521/images/14-36.png","element":"img","alt":" a2","inline":true,"padRight":true},{"text":"s.t. ","element":"span"},{"style":{"height":16},"width":143.36,"height":40,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2003.04521/images/14-37.png","element":"img","alt":" G(a2) =","inline":true},{"style":{"height":13.19},"width":48.99,"height":32.98,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2003.04521/images/14-38.png","element":"img","alt":"D2","inline":true},{"text":", as ","element":"span"},{"style":{"height":6.8},"width":26,"height":17,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2003.04521/images/14-39.png","element":"img","alt":" α","inline":true,"padRight":true},{"text":"is sufficiently small, we have: ","element":"span"},{"style":{"height":14},"width":207.53,"height":35,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2003.04521/images/14-40.png","element":"img","alt":" ∀ε > 0, ∃ α","inline":true,"padRight":true},{"text":"s.t.","element":"span"}],[{"style":{"width":"50%"},"width":507,"height":103,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2003.04521/images/14-41.png","element":"img"}],[{"text":"Note that","element":"span"}],[{"style":{"width":"89%"},"width":899,"height":200,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2003.04521/images/14-42.png","element":"img"}],[{"text":"Let ","element":"span"},{"style":{"fontStyle":"italic"},"text":"ϵ","element":"span"},{"style":{"height":23.36},"width":208.92,"height":58.4,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2003.04521/images/14-43.png","element":"img","alt":"1 = 1a2� tNt1","inline":true,"padRight":true},{"style":{"fontStyle":"italic"},"text":"g","element":"span"},{"style":{"height":16},"width":57.07,"height":40,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2003.04521/images/14-44.png","element":"img","alt":"′(t)","inline":true},{"style":{"fontStyle":"italic"},"text":"dt ","element":"span"},{"text":"and ","element":"span"},{"style":{"fontStyle":"italic"},"text":"ϵ ","element":"span"},{"style":{"height":16},"width":161.17,"height":40,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2003.04521/images/14-45.png","element":"img","alt":" = −ϵ1/2","inline":true},{"text":", then ","element":"span"},{"style":{"height":13.19},"width":85.21,"height":32.98,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2003.04521/images/14-46.png","element":"img","alt":" ∃ α0","inline":true,"padRight":true},{"text":"s.t. ","element":"span"},{"style":{"fontStyle":"italic"},"text":"|","element":"span"},{"style":{"fontStyle":"italic"},"text":"F ","element":"span"},{"style":{"height":16},"width":398,"height":40,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2003.04521/images/14-47.png","element":"img","alt":"′(a2) − ϵ1| < −ϵ1/2 ⇒","inline":true,"padRight":true},{"style":{"fontStyle":"italic"},"text":"F ","element":"span"},{"style":{"height":16},"width":81.62,"height":40,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2003.04521/images/14-48.png","element":"img","alt":"′(a2)","inline":true,"padRight":true},{"style":{"fontStyle":"italic"},"text":"< ϵ","element":"span"},{"style":{"height":16},"width":130.86,"height":40,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2003.04521/images/14-49.png","element":"img","alt":"1/2 < 0","inline":true},{"text":".","element":"span"}],[{"text":"In summary, we have ","element":"span"},{"style":{"height":16},"width":222.24,"height":40,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2003.04521/images/14-50.png","element":"img","alt":" F ′(a1) > 0","inline":true,"padRight":true},{"text":"and ","element":"span"},{"style":{"height":16},"width":222.24,"height":40,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2003.04521/images/14-51.png","element":"img","alt":" F ′(a2) < 0","inline":true},{"text":", according to the intermediate value theorem, there exists an ","element":"span"},{"style":{"height":10.99},"width":37.06,"height":27.47,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2003.04521/images/14-52.png","element":"img","alt":"a∗","inline":true,"padRight":true},{"text":"such that ","element":"span"},{"style":{"height":16},"width":185.85,"height":40,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2003.04521/images/14-53.png","element":"img","alt":" F ′(a∗) = 0","inline":true},{"text":".","element":"span"}],[{"text":"If there are ","element":"span"},{"style":{"fontStyle":"italic"},"text":"L ","element":"span"},{"text":"local minimizers ","element":"span"},{"style":{"height":16.98},"width":180.62,"height":42.46,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2003.04521/images/14-54.png","element":"img","alt":" x1, · · · , xL","inline":true,"padRight":true},{"text":"in ","element":"span"},{"style":{"height":13.19},"width":42.18,"height":32.98,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2003.04521/images/14-55.png","element":"img","alt":" B0","inline":true,"padRight":true},{"text":"whose criteria are bigger than ","element":"span"},{"style":{"height":16},"width":95.95,"height":40,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2003.04521/images/14-56.png","element":"img","alt":" f(x0)","inline":true},{"text":", and a local minimizer ","element":"span"},{"style":{"height":6.8},"width":36.78,"height":17,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2003.04521/images/14-57.png","element":"img","alt":"x′","inline":true,"padRight":true},{"text":"with smaller criterion outside ","element":"span"},{"style":{"height":13.19},"width":42.18,"height":32.98,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2003.04521/images/14-58.png","element":"img","alt":" B0","inline":true},{"text":". Denote ","element":"span"},{"style":{"height":14},"width":147.51,"height":35,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2003.04521/images/14-59.png","element":"img","alt":" fmin =","inline":true},{"style":{"height":17.78},"width":319.98,"height":44.44,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2003.04521/images/14-60.png","element":"img","alt":"mini=1,...,L{f(xi)}","inline":true,"padRight":true},{"text":", we have ","element":"span"},{"style":{"fontStyle":"italic"},"text":"L ","element":"span"},{"text":"+ 1 ","element":"span"},{"text":"local maximizers ","element":"span"},{"style":{"height":14},"width":80.94,"height":35,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2003.04521/images/14-61.png","element":"img","alt":" ξ1 <","inline":true},{"style":{"height":14.79},"width":191.68,"height":36.98,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2003.04521/images/14-62.png","element":"img","alt":"· · · < ξL+1","inline":true},{"text":". Since ","element":"span"},{"style":{"height":16},"width":250.13,"height":40,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2003.04521/images/14-63.png","element":"img","alt":" f(x′) < f(x0)","inline":true},{"text":", we can set ","element":"span"},{"style":{"height":13.99},"width":33.71,"height":34.98,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2003.04521/images/14-64.png","element":"img","alt":" δ0","inline":true,"padRight":true},{"text":"such that ","element":"span"},{"style":{"height":16},"width":331.47,"height":40,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2003.04521/images/14-65.png","element":"img","alt":"fmin > f(x0 + δ0d)","inline":true},{"text":". Substituting ","element":"span"},{"style":{"height":14.79},"width":79.76,"height":36.98,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2003.04521/images/14-66.png","element":"img","alt":" ξL+1","inline":true,"padRight":true},{"text":"to ","element":"span"},{"style":{"height":14},"width":18,"height":35,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2003.04521/images/14-67.png","element":"img","alt":" ξ","inline":true,"padRight":true},{"text":"in the proof, we can prove Theorem ","element":"span"},{"href":"#id-49","text":"3.","element":"a"}]]},{"heading":"APPENDIX C","paragraphs":[[{"text":"In the following, we will explain why ","element":"span"},{"style":{"height":13.19},"width":136.75,"height":32.98,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2003.04521/images/14-68.png","element":"img","alt":" Pc > Pr","inline":true},{"text":". In Alg. ","element":"span"},{"href":"#id-54","text":"4, ","element":"a"},{"text":"the main idea is using negative linear combination and adding a noise to make algorithm robustly. Now we will explain the insight of ’negative linear combination’. We first assume that there are two local minimizers. Without loss of generality, suppose that we are at a local minimum ","element":"span"},{"style":{"height":9.19},"width":38.78,"height":22.98,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2003.04521/images/14-69.png","element":"img","alt":" x0","inline":true},{"text":", and there exists a local minima ","element":"span"},{"style":{"height":6.8},"width":36.78,"height":17,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2003.04521/images/14-70.png","element":"img","alt":" x′","inline":true,"padRight":true},{"text":"(","element":"span"},{"style":{"height":16},"width":237.83,"height":40,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2003.04521/images/14-71.png","element":"img","alt":"f(x′) < f(x0)","inline":true},{"text":"). Then ","element":"span"},{"style":{"height":6.8},"width":36.78,"height":17,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2003.04521/images/14-72.png","element":"img","alt":" x′","inline":true,"padRight":true},{"text":"has a neighborhood region ","element":"span"},{"style":{"height":13.19},"width":58.33,"height":32.98,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2003.04521/images/14-73.png","element":"img","alt":" Rx′","inline":true},{"text":", which satisfies ","element":"span"},{"style":{"height":16},"width":432.5,"height":40,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2003.04521/images/14-74.png","element":"img","alt":" f(x) < f(x0), ∀x ∈ Rx′","inline":true},{"text":", then ","element":"span"},{"style":{"height":6.8},"width":45.95,"height":17,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2003.04521/images/14-75.png","element":"img","alt":"x′′","inline":true,"padRight":true},{"text":"denotes the center of circumscribed sphere of ","element":"span"},{"style":{"height":13.19},"width":58.33,"height":32.98,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2003.04521/images/14-76.png","element":"img","alt":" Rx′","inline":true},{"text":". Then ","element":"span"},{"style":{"height":13.37},"width":244.82,"height":33.44,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2003.04521/images/14-77.png","element":"img","alt":"d∗ ≜ x′′ − x0","inline":true,"padRight":true},{"text":"is called the central direction in the sequel. Further, we define the ray ","element":"span"},{"style":{"height":14},"width":329.06,"height":35,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2003.04521/images/14-78.png","element":"img","alt":" ℓd = x0 + td, t > 0","inline":true},{"text":". We have the following Lemma ","element":"span"},{"href":"#id-92","text":"7.","element":"a"}],[{"id":"id-92","style":{"fontWeight":"bold"},"text":"Lemma 7. ","element":"span"},{"style":{"fontStyle":"italic"},"text":"Given an initial sample of directions and scores ","element":"span"},{"style":{"height":16},"width":608.61,"height":40,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2003.04521/images/14-79.png","element":"img","alt":"{(d1, u1) · · · , (dN0, uN0)} (N0 ≤ n","inline":true},{"style":{"fontStyle":"italic"},"text":"), using negative linear combination of ","element":"span"},{"style":{"height":14.78},"width":193.8,"height":36.96,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2003.04521/images/14-80.png","element":"img","alt":" d1, · · · , dN0","inline":true},{"style":{"fontStyle":"italic"},"text":", it is of higher probability to obtain ","element":"span"},{"style":{"height":10.99},"width":36.74,"height":27.46,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2003.04521/images/14-81.png","element":"img","alt":" d∗","inline":true,"padRight":true},{"style":{"fontStyle":"italic"},"text":"than that of the random sampling.","element":"span"}],[{"style":{"fontStyle":"italic"},"text":"Proof. ","element":"span"},{"text":"In the following, we first prove the theorem in case ","element":"span"},{"style":{"fontStyle":"italic"},"text":"n ","element":"span"},{"text":"= 2","element":"span"},{"text":". It is then generalized to ","element":"span"},{"style":{"fontStyle":"italic"},"text":"n > ","element":"span"},{"text":"2","element":"span"},{"text":".","element":"span"}],[{"text":"In case ","element":"span"},{"style":{"height":13.19},"width":137.33,"height":32.98,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2003.04521/images/14-82.png","element":"img","alt":" N0 = 2","inline":true,"padRight":true},{"text":"and ","element":"span"},{"style":{"fontStyle":"italic"},"text":"n ","element":"span"},{"text":"= 2","element":"span"},{"text":", suppose at some time step, we have two linearly independent directions ","element":"span"},{"style":{"height":13.19},"width":36.74,"height":32.98,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2003.04521/images/14-83.png","element":"img","alt":" d1","inline":true,"padRight":true},{"text":"and ","element":"span"},{"style":{"height":13.19},"width":36.74,"height":32.98,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2003.04521/images/14-84.png","element":"img","alt":" d2","inline":true,"padRight":true},{"text":"with negative scores. Let ","element":"span"},{"style":{"height":16},"width":427.77,"height":40,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2003.04521/images/14-85.png","element":"img","alt":" Ω = {x : ∥x−x0∥2 ≤ M}","inline":true,"padRight":true},{"text":"be the confined search space. The search space can then be divided into four regions ","element":"span"},{"style":{"height":16.59},"width":183.86,"height":41.46,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2003.04521/images/14-86.png","element":"img","alt":" B1, B2, B3","inline":true,"padRight":true},{"text":"and ","element":"span"},{"style":{"height":13.39},"width":64.55,"height":33.46,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2003.04521/images/14-87.png","element":"img","alt":" B∗4","inline":true},{"text":". Particularly, ","element":"span"},{"style":{"height":16},"width":315.55,"height":40,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2003.04521/images/14-88.png","element":"img","alt":" B∗ = {d = α1d1 +","inline":true},{"style":{"height":16},"width":381,"height":40,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2003.04521/images/14-89.png","element":"img","alt":"α2d2, α1 < 0, α2 < 0}","inline":true},{"text":". Suppose that ","element":"span"},{"style":{"height":6.8},"width":36.78,"height":17,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2003.04521/images/14-90.png","element":"img","alt":" x′","inline":true,"padRight":true},{"text":"has a neighborhood region ","element":"span"},{"style":{"height":13.19},"width":58.33,"height":32.98,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2003.04521/images/14-91.png","element":"img","alt":" Rx′","inline":true},{"text":", which satisfies ","element":"span"},{"style":{"height":16},"width":401.27,"height":40,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2003.04521/images/14-92.png","element":"img","alt":" f(x) < f(x0), ∀x ∈ Rx′","inline":true},{"text":", and the radius of the circumscribed sphere of ","element":"span"},{"style":{"height":13.19},"width":58.33,"height":32.98,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2003.04521/images/14-93.png","element":"img","alt":" Rx′","inline":true,"padRight":true},{"text":"is ","element":"span"},{"style":{"height":9.19},"width":33.98,"height":22.98,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2003.04521/images/14-94.png","element":"img","alt":" r0","inline":true},{"text":". The boundary of the circumscribed sphere and ","element":"span"},{"style":{"height":9.19},"width":38.77,"height":22.98,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2003.04521/images/14-95.png","element":"img","alt":" x0","inline":true,"padRight":true},{"text":"can form a cone ","element":"span"},{"style":{"height":10.98},"width":47.33,"height":27.46,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2003.04521/images/14-96.png","element":"img","alt":" C∗","inline":true},{"text":". By assumption, the lines ","element":"span"},{"style":{"height":14.78},"width":191.06,"height":36.96,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2003.04521/images/14-97.png","element":"img","alt":" ℓdi, i = 1, 2","inline":true,"padRight":true},{"text":"has no interaction with ","element":"span"},{"style":{"height":10.99},"width":47.33,"height":27.46,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2003.04521/images/14-98.png","element":"img","alt":" C∗","inline":true,"padRight":true},{"text":"(otherwise we have found a direction that will lead to the attraction basin of ","element":"span"},{"style":{"height":16},"width":49.96,"height":40,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2003.04521/images/14-99.png","element":"img","alt":" x′)","inline":true},{"text":", i.e.","element":"span"}],[{"style":{"width":"55%"},"width":562,"height":57,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2003.04521/images/14-100.png","element":"img"}],[{"text":"For each ","element":"span"},{"style":{"height":13.19},"width":31.74,"height":32.98,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2003.04521/images/15-0.png","element":"img","alt":" di","inline":true},{"text":", take ","element":"span"},{"style":{"height":12.98},"width":33.78,"height":32.46,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2003.04521/images/15-1.png","element":"img","alt":" xi","inline":true,"padRight":true},{"text":"and ","element":"span"},{"style":{"height":10.98},"width":38.78,"height":27.46,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2003.04521/images/15-2.png","element":"img","alt":" x∗","inline":true,"padRight":true},{"text":"such that ","element":"span"},{"style":{"height":17.03},"width":249.95,"height":42.58,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2003.04521/images/15-3.png","element":"img","alt":" xi = ∂Ω � ℓdi","inline":true,"padRight":true},{"text":"and ","element":"span"},{"style":{"height":16},"width":241.22,"height":40,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2003.04521/images/15-4.png","element":"img","alt":"x∗ = ∂Ω � ℓd∗","inline":true},{"text":", respectively. Then the boundary of ","element":"span"},{"style":{"height":16.99},"width":153.34,"height":42.46,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2003.04521/images/15-5.png","element":"img","alt":" B(xi, r1)","inline":true,"padRight":true},{"text":"and ","element":"span"},{"style":{"height":9.19},"width":38.78,"height":22.98,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2003.04521/images/15-6.png","element":"img","alt":" x0","inline":true,"padRight":true},{"text":"form a cone ","element":"span"},{"style":{"height":12.99},"width":42.33,"height":32.46,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2003.04521/images/15-7.png","element":"img","alt":" Ci","inline":true},{"text":", where","element":"span"}],[{"id":"id-95","style":{"width":"100%"},"width":1005,"height":379,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2003.04521/images/15-8.png","element":"img"}],[{"style":{"height":28.8},"width":281.63,"height":72,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2003.04521/images/15-9.png","element":"img","alt":"x|x ∈ Ω, x /∈ �C�","inline":true},{"text":", then ","element":"span"},{"style":{"height":17.22},"width":39.58,"height":43.05,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2003.04521/images/15-10.png","element":"img","alt":"˜Pc","inline":true},{"text":", the probability of finding ","element":"span"},{"style":{"height":10.98},"width":36.74,"height":27.46,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2003.04521/images/15-11.png","element":"img","alt":" d∗","inline":true,"padRight":true},{"text":"in ","element":"span"},{"style":{"height":10.8},"width":29,"height":27,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2003.04521/images/15-12.png","element":"img","alt":"Ω","inline":true,"padRight":true},{"text":"by the negative linear combination, can be computed as follows:","element":"span"}],[{"style":{"height":19.01},"width":772.88,"height":47.52,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2003.04521/images/15-13.png","element":"img","alt":"˜Pc = P{ ˜d = d∗; x′′ ∈ B∗} + P{ ˜d = d∗; x′′","inline":true,"padRight":true},{"style":{"fontStyle":"italic"},"text":"/","element":"span"},{"style":{"height":16},"width":108.18,"height":40,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2003.04521/images/15-14.png","element":"img","alt":"∈ B∗}","inline":true},{"style":{"height":19.01},"width":621.66,"height":47.52,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2003.04521/images/15-15.png","element":"img","alt":"= P{x′′ ∈ B∗}P{ ˜d = d∗|x′′ ∈ B∗}","inline":true}],[{"text":"where ","element":"span"},{"style":{"height":15.01},"width":27.05,"height":37.52,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2003.04521/images/15-16.png","element":"img","alt":"˜d","inline":true,"padRight":true},{"text":"is the direction got by negative linear combination. Notice that ","element":"span"},{"style":{"height":22.72},"width":479.58,"height":56.79,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2003.04521/images/15-17.png","element":"img","alt":" P{ ˜d = d∗|x′′ ∈ B∗} = Vd∗VB∗","inline":true,"padRight":true},{"text":"where ","element":"span"},{"style":{"height":14},"width":139.89,"height":35,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2003.04521/images/15-18.png","element":"img","alt":" Vd∗, VB∗","inline":true,"padRight":true},{"text":"is ","element":"span"},{"text":"the measure of ","element":"span"},{"style":{"height":14.19},"width":105,"height":35.46,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2003.04521/images/15-19.png","element":"img","alt":" d∗, B∗","inline":true},{"text":", respectively. Denote ","element":"span"},{"style":{"height":17.22},"width":40.58,"height":43.05,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2003.04521/images/15-20.png","element":"img","alt":"˜Pr","inline":true,"padRight":true},{"text":"the probability of finding ","element":"span"},{"style":{"height":10.98},"width":36.74,"height":27.46,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2003.04521/images/15-21.png","element":"img","alt":" d∗","inline":true,"padRight":true},{"text":"in ","element":"span"},{"style":{"height":10.8},"width":29,"height":27,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2003.04521/images/15-22.png","element":"img","alt":" Ω","inline":true,"padRight":true},{"text":"by random sampling, then ","element":"span"},{"style":{"height":13.19},"width":54.84,"height":32.98,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2003.04521/images/15-23.png","element":"img","alt":" Vd∗","inline":true,"padRight":true},{"text":"can be represented by ","element":"span"},{"style":{"height":17.22},"width":40.58,"height":43.05,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2003.04521/images/15-24.png","element":"img","alt":"˜Pr","inline":true,"padRight":true},{"text":"and the measure of ","element":"span"},{"style":{"height":10.8},"width":29,"height":27,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2003.04521/images/15-25.png","element":"img","alt":" Ω","inline":true},{"text":". That is, ","element":"span"},{"style":{"height":17.22},"width":212.56,"height":43.05,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2003.04521/images/15-26.png","element":"img","alt":" Vd∗ = ˜Pr·VΩ","inline":true},{"text":". As ","element":"span"},{"style":{"height":10.98},"width":36.74,"height":27.46,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2003.04521/images/15-27.png","element":"img","alt":" d∗","inline":true,"padRight":true},{"text":"does not interact with ","element":"span"},{"style":{"height":14.83},"width":31,"height":37.07,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2003.04521/images/15-28.png","element":"img","alt":"˜C","inline":true},{"text":", thus ","element":"span"},{"style":{"height":18.83},"width":122.85,"height":47.07,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2003.04521/images/15-29.png","element":"img","alt":" x′′ /∈ ˜C","inline":true},{"text":". Then we have:","element":"span"}],[{"style":{"width":"60%"},"width":605,"height":135,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2003.04521/images/15-30.png","element":"img"}],[{"text":"where ","element":"span"},{"style":{"height":16.19},"width":111.94,"height":40.48,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2003.04521/images/15-31.png","element":"img","alt":" VΩ, V�Ω","inline":true,"padRight":true},{"text":"is the measure of ","element":"span"},{"style":{"height":10.8},"width":29,"height":27,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2003.04521/images/15-32.png","element":"img","alt":" Ω","inline":true,"padRight":true},{"text":"and ","element":"span"},{"style":{"height":10.8},"width":29,"height":27,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2003.04521/images/15-33.png","element":"img","alt":"�Ω","inline":true},{"text":", respectively. The last inequality holds because ","element":"span"},{"style":{"height":11.6},"width":110.91,"height":29,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2003.04521/images/15-34.png","element":"img","alt":"�Ω ⊂ Ω","inline":true},{"text":".If ","element":"span"},{"style":{"height":17.03},"width":227.58,"height":42.58,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2003.04521/images/15-35.png","element":"img","alt":" B∗ � Ci ̸= ∅","inline":true,"padRight":true},{"text":"for ","element":"span"},{"style":{"fontStyle":"italic"},"text":"i ","element":"span"},{"text":"= 1","element":"span"},{"style":{"fontStyle":"italic"},"text":", ","element":"span"},{"text":"2","element":"span"},{"text":", then ","element":"span"},{"style":{"height":12.99},"width":43.23,"height":32.46,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2003.04521/images/15-36.png","element":"img","alt":" Bi","inline":true,"padRight":true},{"text":"is covered by ","element":"span"},{"style":{"height":12.99},"width":42.33,"height":32.46,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2003.04521/images/15-37.png","element":"img","alt":" Ci","inline":true},{"text":". Since the region covered by ","element":"span"},{"style":{"height":16.98},"width":215.56,"height":42.46,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2003.04521/images/15-38.png","element":"img","alt":" Ci (i = 1, 2)","inline":true,"padRight":true},{"text":"in ","element":"span"},{"style":{"height":13.38},"width":48.23,"height":33.46,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2003.04521/images/15-39.png","element":"img","alt":" B3","inline":true,"padRight":true},{"text":"has a larger measure than ","element":"span"},{"style":{"height":13.38},"width":122.68,"height":33.46,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2003.04521/images/15-40.png","element":"img","alt":" B∗, B∗","inline":true,"padRight":true},{"text":"is thus the best region for sampling.","element":"span"}],[{"text":"In case ","element":"span"},{"style":{"fontStyle":"italic"},"text":"n ","element":"span"},{"style":{"fontStyle":"italic"},"text":"> ","element":"span"},{"text":"2","element":"span"},{"text":", we have ","element":"span"},{"style":{"fontStyle":"italic"},"text":"n ","element":"span"},{"text":"directions with negative scores ","element":"span"},{"style":{"height":14},"width":174.56,"height":35,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2003.04521/images/15-41.png","element":"img","alt":" d1, · · · , dn","inline":true},{"text":". If set ","element":"span"},{"style":{"height":13.19},"width":36.74,"height":32.98,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2003.04521/images/15-42.png","element":"img","alt":" d2","inline":true,"padRight":true},{"text":"as the subspace ","element":"span"},{"style":{"fontStyle":"italic"},"text":"S ","element":"span"},{"text":"= ","element":"span"},{"style":{"fontStyle":"italic"},"text":"{","element":"span"},{"style":{"fontStyle":"italic"},"text":"d ","element":"span"},{"text":"= ","element":"span"},{"style":{"height":17.6},"width":365.21,"height":43.99,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2003.04521/images/15-43.png","element":"img","alt":"�ni=2 αidi, αi > 0}","inline":true},{"text":", since ","element":"span"},{"style":{"fontStyle":"italic"},"text":"S ","element":"span"},{"text":"has a zero measure in ","element":"span"},{"style":{"height":10.8},"width":48.78,"height":27,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2003.04521/images/15-44.png","element":"img","alt":" Rn","inline":true},{"text":", ","element":"span"},{"text":"the proof degenerates into the ","element":"span"},{"style":{"fontStyle":"italic"},"text":"n ","element":"span"},{"text":"= 2 ","element":"span"},{"text":"case.","element":"span"}],[{"text":"Furthermore, we will present why ","element":"span"},{"style":{"height":13.19},"width":135.54,"height":32.98,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2003.04521/images/15-45.png","element":"img","alt":" Pc > Pr","inline":true},{"text":":","element":"span"}],[{"text":"We illustrate by using ","element":"span"},{"style":{"height":13.39},"width":47.33,"height":33.47,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2003.04521/images/15-46.png","element":"img","alt":" C2","inline":true,"padRight":true},{"text":"in Fig. ","element":"span"},{"href":"#id-93","text":"16 ","element":"a"},{"text":"in ","element":"span"},{"style":{"fontStyle":"italic"},"text":"n ","element":"span"},{"text":"= 2","element":"span"},{"text":". In Fig. ","element":"span"},{"href":"#id-93","text":"16, ","element":"a"},{"style":{"height":13.19},"width":36.74,"height":32.98,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2003.04521/images/15-47.png","element":"img","alt":"d2","inline":true,"padRight":true},{"text":"is a direction with negative score. If we want to create ","element":"span"},{"style":{"height":15.01},"width":27.04,"height":37.52,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2003.04521/images/15-48.png","element":"img","alt":"˜d","inline":true,"padRight":true},{"text":"which is a promising direction, then ","element":"span"},{"style":{"height":10.98},"width":36.74,"height":27.46,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2003.04521/images/15-49.png","element":"img","alt":" d∗","inline":true,"padRight":true},{"text":"must be between ","element":"span"},{"style":{"height":13.19},"width":61.75,"height":32.98,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2003.04521/images/15-50.png","element":"img","alt":" dlow","inline":true,"padRight":true},{"text":"and ","element":"span"},{"style":{"height":15.59},"width":48.68,"height":38.98,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2003.04521/images/15-51.png","element":"img","alt":" dup","inline":true,"padRight":true},{"text":"as shown in Fig. ","element":"span"},{"href":"#id-94","text":"17.","element":"a"}],[{"text":"To define ","element":"span"},{"style":{"height":13.19},"width":61.75,"height":32.98,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2003.04521/images/15-52.png","element":"img","alt":" dlow","inline":true,"padRight":true},{"text":"and ","element":"span"},{"style":{"height":15.59},"width":48.69,"height":38.98,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2003.04521/images/15-53.png","element":"img","alt":" dup","inline":true},{"text":", let ","element":"span"},{"style":{"height":13.19},"width":45.48,"height":32.98,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2003.04521/images/15-54.png","element":"img","alt":" Cd","inline":true,"padRight":true},{"text":"is the cone made by the boundary of ","element":"span"},{"style":{"height":16},"width":158.65,"height":40,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2003.04521/images/15-55.png","element":"img","alt":" B(xd, r1)","inline":true,"padRight":true},{"text":"and ","element":"span"},{"style":{"height":9.19},"width":38.78,"height":22.97,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2003.04521/images/15-56.png","element":"img","alt":" x0","inline":true,"padRight":true},{"text":"where ","element":"span"},{"style":{"height":16},"width":234.18,"height":40,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2003.04521/images/15-57.png","element":"img","alt":" xd = ∂Ω � ℓd","inline":true,"padRight":true},{"text":"for any ","element":"span"},{"style":{"fontStyle":"italic"},"text":"t > ","element":"span"},{"text":"0 ","element":"span"},{"text":"and ","element":"span"},{"style":{"fontStyle":"italic"},"text":"d","element":"span"},{"text":", and ","element":"span"},{"style":{"height":9.19},"width":33.98,"height":22.97,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2003.04521/images/15-58.png","element":"img","alt":" r1","inline":true,"padRight":true},{"text":"is defined in Eq. ","element":"span"},{"href":"#id-95","text":"32. ","element":"a"},{"text":"We further define ","element":"span"},{"style":{"height":19.97},"width":281.14,"height":49.92,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2003.04521/images/15-59.png","element":"img","alt":"D ˜d = {d| ˜d ∈ Cd","inline":true,"padRight":true},{"text":"and ","element":"span"},{"style":{"height":16.85},"width":242.24,"height":42.12,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2003.04521/images/15-60.png","element":"img","alt":" Cd� ℓd2 = ∅}","inline":true,"padRight":true},{"text":"and ","element":"span"},{"style":{"height":16.96},"width":295.94,"height":42.4,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2003.04521/images/15-61.png","element":"img","alt":" R = {ld|d ∈ D ˜d}","inline":true},{"text":". ","element":"span"},{"style":{"height":13.19},"width":61.75,"height":32.98,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2003.04521/images/15-62.png","element":"img","alt":"dlow","inline":true,"padRight":true},{"text":"and ","element":"span"},{"style":{"height":15.59},"width":48.69,"height":38.98,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2003.04521/images/15-63.png","element":"img","alt":" dup","inline":true,"padRight":true},{"text":"are considered as the ray from ","element":"span"},{"style":{"height":9.19},"width":38.78,"height":22.98,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2003.04521/images/15-64.png","element":"img","alt":" x0","inline":true,"padRight":true},{"text":"to the boundary of ","element":"span"},{"style":{"fontStyle":"italic"},"text":"R","element":"span"},{"text":". Similarly to the definition of ","element":"span"},{"style":{"height":9.19},"width":39.78,"height":22.98,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2003.04521/images/15-65.png","element":"img","alt":" xd","inline":true},{"text":", we define ","element":"span"},{"style":{"height":17.54},"width":520.44,"height":43.85,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2003.04521/images/15-66.png","element":"img","alt":"x ˜d = ∂Ω � ℓ ˜d, xdup = ∂Ω � ℓdup","inline":true,"padRight":true},{"text":"and ","element":"span"},{"style":{"height":16},"width":286.71,"height":40,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2003.04521/images/15-67.png","element":"img","alt":" xdlow = ∂Ω � ℓdlow","inline":true},{"text":".If the distance between ","element":"span"},{"style":{"height":13.38},"width":38.78,"height":33.46,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2003.04521/images/15-68.png","element":"img","alt":" x2","inline":true,"padRight":true},{"text":"and ","element":"span"},{"style":{"height":11.76},"width":44.28,"height":29.4,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2003.04521/images/15-69.png","element":"img","alt":" x ˜d","inline":true,"padRight":true},{"text":"is larger, then the distance between ","element":"span"},{"style":{"height":12.38},"width":59.32,"height":30.96,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2003.04521/images/15-70.png","element":"img","alt":" xdup","inline":true,"padRight":true},{"text":"and ","element":"span"},{"style":{"height":10.78},"width":68.36,"height":26.96,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2003.04521/images/15-71.png","element":"img","alt":" xdlow","inline":true,"padRight":true},{"text":"must be larger as shown in Fig. ","element":"span"},{"href":"#id-94","text":"17. ","element":"a"},{"text":"This implies the probability that ","element":"span"},{"style":{"height":15},"width":27.05,"height":37.51,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2003.04521/images/15-72.png","element":"img","alt":"˜d","inline":true,"padRight":true},{"text":"is promising is higher. When the distance between ","element":"span"},{"style":{"height":13.38},"width":38.78,"height":33.46,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2003.04521/images/15-73.png","element":"img","alt":" x2","inline":true,"padRight":true},{"text":"and ","element":"span"},{"style":{"height":11.76},"width":44.27,"height":29.4,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2003.04521/images/15-74.png","element":"img","alt":" x ˜d","inline":true,"padRight":true},{"text":"is larger than ","element":"span"},{"style":{"height":9.19},"width":33.98,"height":22.98,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2003.04521/images/15-75.png","element":"img","alt":" r1","inline":true},{"text":", i.e. ","element":"span"},{"style":{"height":19.01},"width":130.64,"height":47.52,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2003.04521/images/15-76.png","element":"img","alt":"˜d /∈ C2","inline":true},{"text":", the probability is the maximum since there is no ","element":"span"},{"style":{"fontStyle":"italic"},"text":"d ","element":"span"},{"text":"such that ","element":"span"},{"style":{"height":16.84},"width":217.83,"height":42.11,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2003.04521/images/15-77.png","element":"img","alt":"Cd� ℓd2 ̸= ∅","inline":true,"padRight":true},{"text":"and ","element":"span"},{"style":{"height":17.4},"width":115.33,"height":43.5,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2003.04521/images/15-78.png","element":"img","alt":" ˜d ∈ Cd","inline":true},{"text":". Therefore, it is the best to use the opposite direction of ","element":"span"},{"style":{"height":13.19},"width":36.74,"height":32.97,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2003.04521/images/15-79.png","element":"img","alt":" d2","inline":true,"padRight":true},{"text":"since the point by interacting ","element":"span"},{"style":{"height":13.19},"width":67.74,"height":32.97,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2003.04521/images/15-80.png","element":"img","alt":" −d2","inline":true,"padRight":true},{"text":"and ","element":"span"},{"style":{"height":10.8},"width":29,"height":27,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2003.04521/images/15-81.png","element":"img","alt":" Ω","inline":true,"padRight":true},{"text":"is the furthest to ","element":"span"},{"style":{"height":9.19},"width":38.78,"height":22.98,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2003.04521/images/15-82.png","element":"img","alt":" x2","inline":true},{"text":".","element":"span"}],[{"style":{"width":"100%"},"width":1005,"height":823,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2003.04521/images/15-83.png","element":"img"}],[{"text":"Fig. 16. Illustration of Lemma ","element":"figcaption","subtype":"caption"},{"href":"#id-92","text":"7 ","element":"a","subtype":"caption"},{"text":"in 2-D case. In the figure, ","element":"figcaption","subtype":"caption"},{"style":{"height":8.9},"width":32.43,"height":22.25,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2003.04521/images/15-84.png","element":"img","alt":" d∗ ","inline":true,"padRight":true},{"text":"is to be found in ","element":"figcaption","subtype":"caption"},{"style":{"height":10.97},"width":77.68,"height":27.43,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2003.04521/images/15-85.png","element":"img","alt":" Ω. r0","inline":true,"padRight":true},{"id":"id-93","text":"is the radius of the circumscribed sphere of the attraction basin ","element":"figcaption","subtype":"caption"},{"text":"of ","element":"figcaption","subtype":"caption"},{"style":{"height":12.8},"width":241.9,"height":32,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2003.04521/images/15-86.png","element":"img","alt":" x′. x0, B(x′′, r0)","inline":true,"padRight":true},{"text":"form a cone ","element":"figcaption","subtype":"caption"},{"style":{"height":14.54},"width":526.25,"height":36.34,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2003.04521/images/15-87.png","element":"img","alt":" C∗. For each di, take xi ∈ ∂Ω � di,","inline":true,"padRight":true},{"text":"the boundary of ","element":"figcaption","subtype":"caption"},{"style":{"height":14.5},"width":241.84,"height":36.25,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2003.04521/images/15-88.png","element":"img","alt":" B(xi, r1) and x0","inline":true,"padRight":true},{"text":"forms a cone ","element":"figcaption","subtype":"caption"},{"style":{"height":14.54},"width":322.14,"height":36.34,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2003.04521/images/15-89.png","element":"img","alt":" Ci. Then Ci �{x|x =","inline":true},{"style":{"height":13.6},"width":436.44,"height":34,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2003.04521/images/15-90.png","element":"img","alt":"x0 + td∗, x ∈ Ω, t > 0} = ∅, ∀i","inline":true},{"text":". It is clear that sampling a direction in ","element":"figcaption","subtype":"caption"},{"style":{"height":8.9},"width":42.12,"height":22.25,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2003.04521/images/15-91.png","element":"img","alt":" B∗","inline":true,"padRight":true},{"text":"is the best choice.","element":"figcaption","subtype":"caption"}],[{"text":"Similarly for ","element":"span"},{"style":{"height":13.38},"width":47.33,"height":33.46,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2003.04521/images/15-92.png","element":"img","alt":" C1","inline":true},{"text":", the best direction should be ","element":"span"},{"style":{"height":13.19},"width":67.74,"height":32.98,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2003.04521/images/15-93.png","element":"img","alt":" −d1","inline":true},{"text":". Taking both ","element":"span"},{"style":{"height":13.19},"width":36.74,"height":32.98,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2003.04521/images/15-94.png","element":"img","alt":" d1","inline":true,"padRight":true},{"text":"and ","element":"span"},{"style":{"height":13.19},"width":36.74,"height":32.98,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2003.04521/images/15-95.png","element":"img","alt":" d2","inline":true,"padRight":true},{"text":"into consideration, a direction is promising only if its interaction point with ","element":"span"},{"style":{"height":10.8},"width":29,"height":27,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2003.04521/images/15-96.png","element":"img","alt":" Ω","inline":true,"padRight":true},{"text":"is the furthest to both ","element":"span"},{"style":{"height":13.38},"width":38.78,"height":33.46,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2003.04521/images/15-97.png","element":"img","alt":" x1","inline":true,"padRight":true},{"text":"and ","element":"span"},{"style":{"height":13.39},"width":38.78,"height":33.46,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2003.04521/images/15-98.png","element":"img","alt":" x2","inline":true},{"text":". It is thus the best to sample a direction in the region spanned by ","element":"span"},{"style":{"height":13.19},"width":67.74,"height":32.98,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2003.04521/images/15-99.png","element":"img","alt":" −d1","inline":true,"padRight":true},{"text":"and ","element":"span"},{"style":{"height":13.19},"width":67.74,"height":32.98,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2003.04521/images/15-100.png","element":"img","alt":" −d2","inline":true},{"text":", i.e. ","element":"span"},{"style":{"height":10.99},"width":48.23,"height":27.46,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2003.04521/images/15-101.png","element":"img","alt":" B∗","inline":true},{"text":".","element":"span"}],[{"text":"For ","element":"span"},{"style":{"fontStyle":"italic"},"text":"n > ","element":"span"},{"text":"2","element":"span"},{"text":", we have ","element":"span"},{"style":{"fontStyle":"italic"},"text":"n ","element":"span"},{"text":"directions with negative scores. Given the ","element":"span"},{"style":{"fontStyle":"italic"},"text":"n ","element":"span"},{"text":"directions, we can construct a spanned space ","element":"span"},{"style":{"fontStyle":"italic"},"text":"B ","element":"span"},{"text":"= ","element":"span"},{"style":{"fontStyle":"italic"},"text":"{","element":"span"},{"style":{"fontStyle":"italic"},"text":"d ","element":"span"},{"text":"= ","element":"span"},{"style":{"height":20.4},"width":195.09,"height":50.99,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2003.04521/images/15-102.png","element":"img","alt":"�Ni=1 αidi}","inline":true},{"text":". Depending on the signs of ","element":"span"},{"style":{"height":9.19},"width":36.49,"height":22.98,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2003.04521/images/15-103.png","element":"img","alt":" αi","inline":true},{"text":"’s, we have ","element":"span"},{"style":{"height":13.39},"width":46.92,"height":33.46,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2003.04521/images/15-104.png","element":"img","alt":" 2N","inline":true,"padRight":true},{"text":"sub- ","element":"span"},{"text":"regions ","element":"span"},{"style":{"height":16.58},"width":292.12,"height":41.46,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2003.04521/images/15-105.png","element":"img","alt":" Bi, i = 1, · · · , 2N","inline":true},{"text":". We take ","element":"span"},{"style":{"height":10.98},"width":48.23,"height":27.46,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2003.04521/images/15-106.png","element":"img","alt":" B∗","inline":true,"padRight":true},{"text":"be the region with all negative ","element":"span"},{"style":{"height":9.19},"width":36.49,"height":22.98,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2003.04521/images/15-107.png","element":"img","alt":" αi","inline":true},{"text":"’s.","element":"span"}],[{"text":"Similar to the analysis in ","element":"span"},{"style":{"fontStyle":"italic"},"text":"n ","element":"span"},{"text":"= 2","element":"span"},{"text":", for each ","element":"span"},{"style":{"height":13.19},"width":31.74,"height":32.98,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2003.04521/images/15-108.png","element":"img","alt":" di","inline":true},{"text":", the point ","element":"span"},{"style":{"height":16},"width":144.37,"height":40,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2003.04521/images/15-109.png","element":"img","alt":"Ω � ℓ−di","inline":true,"padRight":true},{"text":"is the furthest to ","element":"span"},{"style":{"height":12.98},"width":33.78,"height":32.46,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2003.04521/images/15-110.png","element":"img","alt":" xi","inline":true},{"text":". A direction is promising only if its interaction point with ","element":"span"},{"style":{"height":10.8},"width":29,"height":27,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2003.04521/images/15-111.png","element":"img","alt":" Ω","inline":true,"padRight":true},{"text":"is the furthest to all ","element":"span"},{"style":{"height":12.98},"width":33.77,"height":32.46,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2003.04521/images/15-112.png","element":"img","alt":" xi","inline":true},{"text":"’s. Therefore, ","element":"span"},{"style":{"height":10.98},"width":48.22,"height":27.46,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2003.04521/images/15-113.png","element":"img","alt":"B∗","inline":true,"padRight":true},{"text":"is the best region for sampling among the ","element":"span"},{"style":{"height":13.38},"width":46.92,"height":33.46,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2003.04521/images/15-114.png","element":"img","alt":" 2N","inline":true,"padRight":true},{"text":"regions.","element":"span"}],[{"text":"A direction is sampled with equal probability in ","element":"span"},{"style":{"height":10.8},"width":29,"height":27,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2003.04521/images/15-115.png","element":"img","alt":" Ω","inline":true,"padRight":true},{"text":"in random sampling. On the contrary, using negative linear combination is sampling in ","element":"span"},{"style":{"height":10.98},"width":48.23,"height":27.46,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2003.04521/images/15-116.png","element":"img","alt":" B∗","inline":true},{"text":". Therefore, we have ","element":"span"},{"style":{"height":13.19},"width":135.54,"height":32.98,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2003.04521/images/15-117.png","element":"img","alt":" Pc > Pr","inline":true},{"text":".","element":"span"}],[{"text":"If ","element":"span"},{"style":{"fontStyle":"italic"},"text":"f","element":"span"},{"text":"(","element":"span"},{"style":{"fontStyle":"italic"},"text":"x","element":"span"},{"text":") ","element":"span"},{"text":"has two local minima, we have explained ","element":"span"},{"style":{"height":13.19},"width":135.54,"height":32.98,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2003.04521/images/15-118.png","element":"img","alt":" Pc > Pr","inline":true},{"text":". In case ","element":"span"},{"style":{"fontStyle":"italic"},"text":"f","element":"span"},{"text":"(","element":"span"},{"style":{"fontStyle":"italic"},"text":"x","element":"span"},{"text":") ","element":"span"},{"text":"has 3 or more local minimizers, the sampling procedure can be done as follows. Assuming we have sampled ","element":"span"},{"style":{"fontStyle":"italic"},"text":"N ","element":"span"},{"text":"directions, ","element":"span"},{"style":{"height":17.53},"width":125.6,"height":43.84,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2003.04521/images/15-119.png","element":"img","alt":" {di}Ni=1","inline":true},{"text":", from which at least one local minimizer ","element":"span"},{"style":{"height":9.19},"width":61.77,"height":22.98,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2003.04521/images/15-120.png","element":"img","alt":"xlast","inline":true,"padRight":true},{"text":"cannot be reached. It is not wise to sample within the cones induced by local minimizers we have visited. Instead, the negative rewards associated with these directions should be used as the linear combination for sampling directions for ","element":"span"},{"style":{"height":9.19},"width":61.77,"height":22.98,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2003.04521/images/15-121.png","element":"img","alt":"xlast","inline":true},{"text":". Therefore, this combination is guaranteed to be more efficient to sample promising directions for ","element":"span"},{"style":{"height":9.19},"width":61.77,"height":22.98,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2003.04521/images/15-122.png","element":"img","alt":" xlast","inline":true,"padRight":true},{"text":"than random sampling.","element":"span"}]]},{"heading":"APPENDIX D","paragraphs":[[{"text":"To train (test) the learned policy, the Gaussian mixture functions are used (cf. Eq. ","element":"span"},{"href":"#id-96","text":"26)","element":"a"},{"text":". And we use ","element":"span"},{"style":{"height":16},"width":381.21,"height":40,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2003.04521/images/15-123.png","element":"img","alt":" Σ1 = diag{1, 1}, Σ2 =","inline":true},{"style":{"height":16},"width":615.89,"height":40,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2003.04521/images/15-124.png","element":"img","alt":"diag{1, 1}, µ1 = [0, 0]⊺; µ2 = [5, 5]⊺.","inline":true,"padRight":true},{"text":"for 2-D problem. When","element":"span"}],[{"style":{"width":"70%"},"width":1448,"height":740,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2003.04521/images/16-0.png","element":"img"}],[{"id":"id-94","text":"Fig. 17. Demonstration of promising direction and optimal direction. (a) shows when ","element":"figcaption","subtype":"caption"},{"style":{"height":15.49},"width":141.2,"height":38.73,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2003.04521/images/16-1.png","element":"img","alt":" x2 and x ˜d ","inline":true,"padRight":true},{"text":"is close to each other, ","element":"figcaption","subtype":"caption"},{"style":{"height":12.8},"width":162.2,"height":31.99,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2003.04521/images/16-2.png","element":"img","alt":" dup and dlow","inline":true,"padRight":true},{"text":"are close too. (b) shows ","element":"figcaption","subtype":"caption"},{"text":"when the distance between ","element":"figcaption","subtype":"caption"},{"style":{"height":15.49},"width":143.12,"height":38.72,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2003.04521/images/16-3.png","element":"img","alt":" x2 and x ˜d ","inline":true,"padRight":true},{"text":"is bigger than ","element":"figcaption","subtype":"caption"},{"style":{"height":7.37},"width":30.29,"height":18.43,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2003.04521/images/16-4.png","element":"img","alt":" r1","inline":true},{"text":", the distance between ","element":"figcaption","subtype":"caption"},{"style":{"height":14.64},"width":190.12,"height":36.61,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2003.04521/images/16-5.png","element":"img","alt":" xdup and xdlow ","inline":true,"padRight":true},{"text":"reaches the maximum.","element":"figcaption","subtype":"caption"}],[{"text":"testing, ","element":"span"},{"style":{"height":16},"width":856.93,"height":40,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2003.04521/images/16-6.png","element":"img","alt":" Σ1 = diag{1, 1, 1, 8, 8}, Σ2 = diag{1, 1, 1, 8, 8}","inline":true},{"text":", ","element":"span"},{"style":{"height":16},"width":851.71,"height":40,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2003.04521/images/16-7.png","element":"img","alt":"µ1 = [0, 0, 0, 0, 0]⊺, µ2 = [−5, −5, −5, −5, −5]⊺","inline":true,"padRight":true},{"text":"for 5-D problem. When training, the following settings with different means and covariances, are applied in Table ","element":"span"},{"href":"#id-81","text":"III.","element":"a"}],[{"style":{"fontStyle":"italic"},"text":"• ","element":"span"},{"text":"order 1-4, problem is in 2-D, ","element":"span"},{"style":{"height":16},"width":329.67,"height":40,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2003.04521/images/16-8.png","element":"img","alt":" Σ1 = diag{1, 8}","inline":true},{"text":", ","element":"span"},{"style":{"height":16},"width":924.15,"height":40,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2003.04521/images/16-9.png","element":"img","alt":"Σ2 = diag{1, 3}, µ1 = [0, 0]⊺; µ2 =","inline":true},{"style":{"height":16},"width":491.89,"height":40,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2003.04521/images/16-10.png","element":"img","alt":"[7, 7]⊺, [5, 7]⊺, [3, 7]⊺, [4, 7]⊺","inline":true,"padRight":true},{"text":"respectively;","element":"span"}],[{"style":{"fontStyle":"italic"},"text":"• ","element":"span"},{"text":"order 5, problem is in 5-D, ","element":"span"},{"text":"Σ","element":"span"},{"style":{"height":16},"width":383.38,"height":40,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2003.04521/images/16-11.png","element":"img","alt":"1 = diag{1, 1, 1, 8, 8}","inline":true},{"text":", ","element":"span"},{"text":"Σ","element":"span"},{"style":{"height":16},"width":704.78,"height":40,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2003.04521/images/16-12.png","element":"img","alt":"2 = diag{1, 1, 1, 8, 8}, µ1 = [0, 0, 0, 0,","inline":true,"padRight":true},{"text":"0]","element":"span"},{"style":{"height":13.78},"width":152.88,"height":34.46,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2003.04521/images/16-13.png","element":"img","alt":"⊺, µ2 =","inline":true,"padRight":true},{"text":"[5","element":"span"},{"style":{"fontStyle":"italic"},"text":", ","element":"span"},{"text":"5","element":"span"},{"style":{"fontStyle":"italic"},"text":", ","element":"span"},{"text":"5","element":"span"},{"style":{"fontStyle":"italic"},"text":", ","element":"span"},{"text":"5","element":"span"},{"style":{"fontStyle":"italic"},"text":", ","element":"span"},{"text":"5]","element":"span"},{"style":{"height":7.2},"width":18,"height":18,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2003.04521/images/16-14.png","element":"img","alt":"⊺","inline":true},{"text":";","element":"span"}],[{"style":{"fontStyle":"italic"},"text":"• ","element":"span"},{"text":"order 6, problem is in 5-D, ","element":"span"},{"text":"Σ","element":"span"},{"style":{"height":16},"width":383.37,"height":40,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2003.04521/images/16-15.png","element":"img","alt":"1 = diag{1, 1, 1, 8, 8}","inline":true},{"text":", ","element":"span"},{"text":"Σ","element":"span"},{"style":{"height":16},"width":704.78,"height":40,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2003.04521/images/16-16.png","element":"img","alt":"2 = diag{1, 1, 1, 8, 8}, µ1 = [0, 0, 0, 0,","inline":true,"padRight":true},{"text":"0]","element":"span"},{"style":{"height":13.79},"width":152.88,"height":34.46,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2003.04521/images/16-17.png","element":"img","alt":"⊺, µ2 =","inline":true,"padRight":true},{"text":"[4","element":"span"},{"style":{"fontStyle":"italic"},"text":", ","element":"span"},{"text":"4","element":"span"},{"style":{"fontStyle":"italic"},"text":", ","element":"span"},{"text":"5","element":"span"},{"style":{"fontStyle":"italic"},"text":", ","element":"span"},{"text":"5","element":"span"},{"style":{"fontStyle":"italic"},"text":", ","element":"span"},{"text":"5]","element":"span"},{"style":{"height":7.2},"width":18,"height":18,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2003.04521/images/16-18.png","element":"img","alt":"⊺","inline":true},{"text":";","element":"span"}],[{"style":{"fontStyle":"italic"},"text":"• ","element":"span"},{"text":"order 7, problem is in 5-D, ","element":"span"},{"text":"Σ","element":"span"},{"style":{"height":16},"width":383.37,"height":40,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2003.04521/images/16-19.png","element":"img","alt":"1 = diag{1, 1, 1, 8, 8}","inline":true},{"text":", ","element":"span"},{"text":"Σ","element":"span"},{"style":{"height":16},"width":704.78,"height":40,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2003.04521/images/16-20.png","element":"img","alt":"2 = diag{1, 1, 1, 8, 8}, µ1 = [0, 0, 0, 0,","inline":true,"padRight":true},{"text":"0]","element":"span"},{"style":{"height":13.78},"width":152.88,"height":34.46,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2003.04521/images/16-21.png","element":"img","alt":"⊺, µ2 =","inline":true,"padRight":true},{"text":"[3","element":"span"},{"style":{"fontStyle":"italic"},"text":", ","element":"span"},{"text":"3","element":"span"},{"style":{"fontStyle":"italic"},"text":", ","element":"span"},{"text":"5","element":"span"},{"style":{"fontStyle":"italic"},"text":", ","element":"span"},{"text":"5","element":"span"},{"style":{"fontStyle":"italic"},"text":", ","element":"span"},{"text":"5]","element":"span"},{"style":{"height":7.2},"width":18,"height":18,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2003.04521/images/16-22.png","element":"img","alt":"⊺","inline":true},{"text":";","element":"span"}],[{"style":{"fontStyle":"italic"},"text":"• ","element":"span"},{"text":"order 8, problem is in 5-D, ","element":"span"},{"text":"Σ","element":"span"},{"style":{"height":16},"width":383.38,"height":40,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2003.04521/images/16-23.png","element":"img","alt":"1 = diag{1, 1, 1, 1, 1}","inline":true},{"text":", ","element":"span"},{"text":"Σ","element":"span"},{"style":{"height":16},"width":704.78,"height":40,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2003.04521/images/16-24.png","element":"img","alt":"2 = diag{1, 1, 1, 8, 8}, µ1 = [0, 0, 0, 0,","inline":true,"padRight":true},{"text":"0]","element":"span"},{"style":{"height":13.79},"width":152.88,"height":34.46,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2003.04521/images/16-25.png","element":"img","alt":"⊺, µ2 =","inline":true,"padRight":true},{"text":"[5","element":"span"},{"style":{"fontStyle":"italic"},"text":", ","element":"span"},{"text":"5","element":"span"},{"style":{"fontStyle":"italic"},"text":", ","element":"span"},{"text":"5","element":"span"},{"style":{"fontStyle":"italic"},"text":", ","element":"span"},{"text":"5","element":"span"},{"style":{"fontStyle":"italic"},"text":", ","element":"span"},{"text":"5]","element":"span"},{"style":{"height":7.2},"width":18,"height":18,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2003.04521/images/16-26.png","element":"img","alt":"⊺","inline":true},{"text":";","element":"span"}],[{"style":{"fontStyle":"italic"},"text":"• ","element":"span"},{"text":"order 9, problem is in 5-D, ","element":"span"},{"text":"Σ","element":"span"},{"style":{"height":16},"width":383.38,"height":40,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2003.04521/images/16-27.png","element":"img","alt":"1 = diag{1, 1, 1, 1, 1}","inline":true},{"text":", ","element":"span"},{"text":"Σ","element":"span"},{"style":{"height":16},"width":704.78,"height":40,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2003.04521/images/16-28.png","element":"img","alt":"2 = diag{1, 1, 1, 8, 8}, µ1 = [0, 0, 0, 0,","inline":true,"padRight":true},{"text":"0]","element":"span"},{"style":{"height":13.78},"width":152.88,"height":34.46,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2003.04521/images/16-29.png","element":"img","alt":"⊺, µ2 =","inline":true,"padRight":true},{"text":"[3","element":"span"},{"style":{"fontStyle":"italic"},"text":", ","element":"span"},{"text":"3","element":"span"},{"style":{"fontStyle":"italic"},"text":", ","element":"span"},{"text":"5","element":"span"},{"style":{"fontStyle":"italic"},"text":", ","element":"span"},{"text":"5","element":"span"},{"style":{"fontStyle":"italic"},"text":", ","element":"span"},{"text":"5]","element":"span"},{"style":{"height":7.2},"width":18,"height":18,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2003.04521/images/16-30.png","element":"img","alt":"⊺","inline":true},{"text":";","element":"span"}]]}],"_version":"3.3.4"},"paperNode":"$28:props:children:props:children:0:props:product"}]]