36:[["$","audio",null,{"id":"tts"}],["$","$L3b",null,{"paperID":"1711.08172","publisher":"arxiv","paperJSON":{"title":"Run-and-Inspect Method for Nonconvex Optimization and Global Optimality Bounds for R-Local Minimizers","paperID":"1711.08172","avgLineHeight":11.44,"imgScale":4,"sections":[{"heading":"Abstract","paragraphs":[[{"text":"Many optimization algorithms converge to stationary points. When the underlying problem is nonconvex, they may get trapped at local minimizers and occasionally stagnate near saddle points. We propose the Run-and-Inspect Method, which adds an “inspect” phase to existing algorithms that helps escape from non-global stationary points. The inspection samples a set of points in a radius ","element":"span"},{"style":{"fontStyle":"italic"},"text":"R ","element":"span"},{"text":"around the current point. When a sample point yields a sufficient decrease in the objective, we resume an existing algorithm from that point. If no sufficient decrease is found, the current point is called an approximate ","element":"span"},{"style":{"fontStyle":"italic"},"text":"R","element":"span"},{"text":"-local minimizer. We show that an ","element":"span"},{"style":{"fontStyle":"italic"},"text":"R","element":"span"},{"text":"-local minimizer is globally optimal, up to a specific error depending on ","element":"span"},{"style":{"fontStyle":"italic"},"text":"R","element":"span"},{"text":", if the objective function can be implicitly decomposed into a smooth convex function plus a restricted function that is possibly nonconvex, nonsmooth. Therefore, for such nonconvex objective functions, verifying global optimality is fundamentally easier. For high-dimensional problems, we introduce blockwise inspections to overcome the curse of dimensionality while still maintaining optimality bounds up to a factor equal to the number of blocks. Our method performs well on a set of artificial and realistic nonconvex problems by coupling with gradient descent, coordinate descent, EM, and prox-linear algorithms.","element":"span"}],[{"style":{"fontWeight":"bold"},"text":"Keywords ","element":"span"},{"text":"R-local minimizer, Run-and-Inspect Method, nonconvex optimization, global minimum, global optimality","element":"span"}],[{"style":{"fontWeight":"bold"},"text":"Mathematics Subject Classification (2000) ","element":"span"},{"text":"90C26 ","element":"span"},{"style":{"height":4.8},"width":11,"height":12,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1711.08172/images/0-0.png","element":"img","alt":" ·","inline":true,"padRight":true},{"text":"90C30 ","element":"span"},{"style":{"height":4.8},"width":11,"height":12,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1711.08172/images/0-1.png","element":"img","alt":" ·","inline":true,"padRight":true},{"text":"49M30 ","element":"span"},{"style":{"height":4.8},"width":11,"height":12,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1711.08172/images/0-2.png","element":"img","alt":" ·","inline":true,"padRight":true},{"text":"65K05","element":"span"}]]},{"heading":"1 Introduction","paragraphs":[[{"text":"This paper introduces and analyzes ","element":"span"},{"style":{"fontStyle":"italic"},"text":"R","element":"span"},{"style":{"fontStyle":"italic"},"text":"-local minimizers ","element":"span"},{"text":"in a class of nonconvex optimization and develops a Run-and-Inspect Method to find them. Consider a possibly nonconvex minimization problem:","element":"span"}],[{"id":"id-26","style":{"width":"64%"},"width":1136,"height":38,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1711.08172/images/0-3.png","element":"img"}],[{"text":"where the variable ","element":"span"},{"style":{"height":12.03},"width":119.54,"height":30.08,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1711.08172/images/0-4.png","element":"img","alt":" x ∈ Rn","inline":true,"padRight":true},{"text":"can be decomposed into ","element":"span"},{"style":{"fontStyle":"italic"},"text":"s ","element":"span"},{"text":"blocks ","element":"span"},{"style":{"height":13.2},"width":260.56,"height":33,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1711.08172/images/0-5.png","element":"img","alt":" x1, ..., xs, s ≥ 1","inline":true},{"text":". We assume ","element":"span"},{"style":{"height":13.04},"width":141.15,"height":32.6,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1711.08172/images/0-6.png","element":"img","alt":" xi ∈ Rni","inline":true},{"text":". We call a point ","element":"span"},{"style":{"height":9.41},"width":24,"height":23.53,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1711.08172/images/0-7.png","element":"img","alt":" ¯x","inline":true,"padRight":true},{"text":"an ","element":"span"},{"style":{"fontStyle":"italic"},"text":"R","element":"span"},{"text":"-local minimizer for some ","element":"span"},{"style":{"fontStyle":"italic"},"text":"R > ","element":"span"},{"text":"0 ","element":"span"},{"text":"if it attains the minimum of ","element":"span"},{"style":{"fontStyle":"italic"},"text":"F ","element":"span"},{"text":"within the ball with center ","element":"span"},{"style":{"height":9.41},"width":24,"height":23.53,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1711.08172/images/0-8.png","element":"img","alt":" ¯x","inline":true,"padRight":true},{"text":"and radius ","element":"span"},{"style":{"fontStyle":"italic"},"text":"R","element":"span"},{"text":".","element":"span"}],[{"text":"This work of Y. Chen is supported in part by Tsinghua Xuetang Mathematics Program and Top Open Program for his short-term visit to UCLA. The work of Y. Sun and W. Yin is supported in part by NSF grant DMS-1720237 and ONR grant N000141712162.","element":"span"}],[{"style":{"width":"65%"},"width":1157,"height":258,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1711.08172/images/0-9.png","element":"img"}],[{"text":"In nonconvex minimization, it is relatively cheap to find a local minimizer but difficult to obtain a global minimizer. For a given ","element":"span"},{"style":{"fontStyle":"italic"},"text":"R > ","element":"span"},{"text":"0","element":"span"},{"text":", the difficulty of finding an ","element":"span"},{"style":{"fontStyle":"italic"},"text":"R","element":"span"},{"text":"-local minimizer lies between those two. Informally, they have the following relationships: for any ","element":"span"},{"style":{"fontStyle":"italic"},"text":"R > ","element":"span"},{"text":"0","element":"span"},{"text":",","element":"span"}],[{"style":{"width":"64%"},"width":1133,"height":209,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1711.08172/images/1-0.png","element":"img"}],[{"text":"We are interested in nonconvex problems for which the last “","element":"span"},{"style":{"height":12.4},"width":30,"height":31,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1711.08172/images/1-1.png","element":"img","alt":"⊇","inline":true},{"text":"” holds with “=,” indicating that any ","element":"span"},{"style":{"fontStyle":"italic"},"text":"R","element":"span"},{"text":"-local minimizer (for a sufficiently large ","element":"span"},{"style":{"fontStyle":"italic"},"text":"R","element":"span"},{"text":") is global. This is possible, for example, if ","element":"span"},{"style":{"fontStyle":"italic"},"text":"F ","element":"span"},{"text":"is the sum of a quadratic function and a sinusoidal oscillation:","element":"span"}],[{"id":"id-0","style":{"width":"67%"},"width":1196,"height":95,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1711.08172/images/1-2.png","element":"img"}],[{"text":"where ","element":"span"},{"style":{"height":11.2},"width":98.32,"height":28,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1711.08172/images/1-3.png","element":"img","alt":" x ∈ R","inline":true,"padRight":true},{"text":"and ","element":"span"},{"style":{"height":13.2},"width":130.84,"height":33,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1711.08172/images/1-4.png","element":"img","alt":" a, b ∈ R","inline":true},{"text":". The range of oscillation is specified by amplitude ","element":"span"},{"style":{"fontStyle":"italic"},"text":"a ","element":"span"},{"text":"and frequency ","element":"span"},{"style":{"height":18.94},"width":17,"height":47.36,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1711.08172/images/1-5.png","element":"img","alt":"b2","inline":true},{"text":". We use ","element":"span"},{"style":{"height":18.54},"width":66.24,"height":46.36,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1711.08172/images/1-6.png","element":"img","alt":"− 12b","inline":true,"padRight":true},{"text":"to shift its phase so that the minimizer of ","element":"span"},{"style":{"fontStyle":"italic"},"text":"F ","element":"span"},{"text":"is ","element":"span"},{"style":{"height":11.63},"width":112.82,"height":29.08,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1711.08172/images/1-7.png","element":"img","alt":" x∗ = 0","inline":true},{"text":". We also add ","element":"span"},{"style":{"fontStyle":"italic"},"text":"a ","element":"span"},{"text":"to level the minimal objective ","element":"span"},{"text":"at ","element":"span"},{"style":{"height":15.63},"width":173.36,"height":39.08,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1711.08172/images/1-8.png","element":"img","alt":" F(x∗) = 0","inline":true},{"text":".","element":"span"}],[{"text":"An example of (","element":"span"},{"href":"#id-0","text":"2","element":"a"},{"text":") with ","element":"span"},{"style":{"fontStyle":"italic"},"text":"a ","element":"span"},{"text":"= 0","element":"span"},{"style":{"fontStyle":"italic"},"text":".","element":"span"},{"text":"3 ","element":"span"},{"text":"and ","element":"span"},{"style":{"fontStyle":"italic"},"text":"b ","element":"span"},{"text":"= 3 ","element":"span"},{"text":"is depicted in Figure ","element":"span"},{"href":"#id-1","text":"1","element":"a"},{"text":".","element":"span"}],[{"style":{"width":"36%"},"width":640,"height":523,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1711.08172/images/1-9.png","element":"img"}],[{"style":{"fontWeight":"bold"},"text":"Fig. 1 ","element":"figcaption","subtype":"caption"},{"id":"id-1","style":{"fontStyle":"italic"},"text":"F","element":"figcaption","subtype":"caption"},{"text":"(","element":"figcaption","subtype":"caption"},{"style":{"fontStyle":"italic"},"text":"x","element":"figcaption","subtype":"caption"},{"text":") ","element":"figcaption","subtype":"caption"},{"text":"in (","element":"figcaption","subtype":"caption"},{"href":"#id-0","text":"2","element":"a","subtype":"caption"},{"text":") with ","element":"figcaption","subtype":"caption"},{"style":{"fontStyle":"italic"},"text":"a ","element":"figcaption","subtype":"caption"},{"text":"= 0","element":"figcaption","subtype":"caption"},{"style":{"fontStyle":"italic"},"text":".","element":"figcaption","subtype":"caption"},{"text":"3","element":"figcaption","subtype":"caption"},{"style":{"fontStyle":"italic"},"text":", b ","element":"figcaption","subtype":"caption"},{"text":"= 3","element":"figcaption","subtype":"caption"},{"text":".","element":"figcaption","subtype":"caption"}],[{"text":"Observe that ","element":"span"},{"style":{"fontStyle":"italic"},"text":"F ","element":"span"},{"text":"has many local minimizers, and its only global minimizer is ","element":"span"},{"style":{"height":11.63},"width":114.38,"height":29.08,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1711.08172/images/1-10.png","element":"img","alt":" x∗ = 0","inline":true},{"text":". Near each local minimizer ","element":"span"},{"style":{"height":9.2},"width":22.38,"height":23,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1711.08172/images/1-11.png","element":"img","alt":" ¯x","inline":true},{"text":", we look for an escape point ","element":"span"},{"style":{"height":15.6},"width":318.85,"height":39,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1711.08172/images/1-12.png","element":"img","alt":" x ∈ [¯x − R, ¯x + R]","inline":true,"padRight":true},{"text":"such that ","element":"span"},{"style":{"height":15.6},"width":208.78,"height":39,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1711.08172/images/1-13.png","element":"img","alt":" f(x) < f(¯x)","inline":true},{"text":". We claim that by taking ","element":"span"},{"style":{"height":18.54},"width":301.43,"height":46.36,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1711.08172/images/1-14.png","element":"img","alt":" R ≥ min{2√a, 2b}","inline":true},{"text":", such an escape point exists for every local minimizer ","element":"span"},{"style":{"height":9.2},"width":22.38,"height":23,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1711.08172/images/1-15.png","element":"img","alt":" ¯x","inline":true,"padRight":true},{"text":"except ","element":"span"},{"style":{"height":11.63},"width":112.99,"height":29.08,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1711.08172/images/1-16.png","element":"img","alt":" ¯x = x∗","inline":true},{"text":".","element":"span"}],[{"style":{"fontWeight":"bold"},"text":"Proposition 1 ","element":"span"},{"style":{"fontStyle":"italic"},"text":"Consider minimizing ","element":"span"},{"style":{"fontStyle":"italic"},"text":"F ","element":"span"},{"style":{"fontStyle":"italic"},"text":"in ","element":"span"},{"text":"(","element":"span"},{"href":"#id-0","text":"2","element":"a"},{"text":")","element":"span"},{"style":{"fontStyle":"italic"},"text":". If ","element":"span"},{"style":{"height":18.54},"width":301.44,"height":46.36,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1711.08172/images/1-17.png","element":"img","alt":" R ≥ min{2√a, 2b}","inline":true},{"style":{"fontStyle":"italic"},"text":", then the only point ","element":"span"},{"style":{"height":9.2},"width":22.38,"height":23,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1711.08172/images/1-18.png","element":"img","alt":" ¯x","inline":true,"padRight":true},{"style":{"fontStyle":"italic"},"text":"that satisfies ","element":"span"},{"style":{"fontStyle":"italic"},"text":"the condition","element":"span"}],[{"id":"id-2","style":{"width":"68%"},"width":1215,"height":46,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1711.08172/images/1-19.png","element":"img"}],[{"style":{"fontStyle":"italic"},"text":"is the global minimizer ","element":"span"},{"style":{"height":11.63},"width":112.83,"height":29.08,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1711.08172/images/1-20.png","element":"img","alt":" x∗ = 0","inline":true},{"style":{"fontStyle":"italic"},"text":".","element":"span"}],[{"style":{"fontStyle":"italic"},"text":"Proof ","element":"span"},{"text":"Suppose ","element":"span"},{"style":{"height":14},"width":93.9,"height":35,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1711.08172/images/1-21.png","element":"img","alt":" ¯x ̸= 0","inline":true},{"text":". Without loss of generality we can further assume ","element":"span"},{"style":{"height":10.8},"width":93.94,"height":27,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1711.08172/images/1-22.png","element":"img","alt":" ¯x > 0","inline":true},{"text":". Recall the global minimizer is ","element":"span"},{"style":{"height":11.63},"width":112.83,"height":29.08,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1711.08172/images/1-23.png","element":"img","alt":" x∗ = 0","inline":true},{"text":".","element":"span"}],[{"text":"i) If ","element":"span"},{"style":{"height":15.2},"width":146.82,"height":38,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1711.08172/images/1-24.png","element":"img","alt":" ¯x ≤ 2√a","inline":true},{"text":", then ","element":"span"},{"style":{"height":15.63},"width":324.28,"height":39.08,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1711.08172/images/1-25.png","element":"img","alt":" x∗ ∈ [¯x − R, ¯x + R]","inline":true,"padRight":true},{"text":"gives ","element":"span"},{"style":{"height":15.6},"width":154.44,"height":39,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1711.08172/images/1-26.png","element":"img","alt":" F(¯x) = 0","inline":true},{"text":", so ","element":"span"},{"style":{"height":9.2},"width":22.38,"height":23,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1711.08172/images/1-27.png","element":"img","alt":" ¯x","inline":true,"padRight":true},{"text":"is the global minimizer. Otherwise, we have ","element":"span"},{"style":{"height":15.69},"width":348.06,"height":39.22,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1711.08172/images/1-28.png","element":"img","alt":"F(¯x − 2√a) < F(¯x).","inline":true,"padRight":true},{"text":"Indeed,","element":"span"}],[{"style":{"width":"64%"},"width":1144,"height":85,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1711.08172/images/1-29.png","element":"img"}],[{"style":{"width":"100%"},"width":1768,"height":271,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1711.08172/images/2-0.png","element":"img"}],[{"text":"This leads to the contradiction similar to part i). ","element":"span"},{"style":{"height":0},"width":19,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1711.08172/images/2-1.png","element":"img","alt":"⊓⊔","inline":true}],[{"text":"Proposition ","element":"span"},{"href":"#id-2","text":"1 ","element":"a"},{"text":"indicates that we can find ","element":"span"},{"style":{"height":11.63},"width":39.06,"height":29.08,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1711.08172/images/2-2.png","element":"img","alt":" x∗","inline":true,"padRight":true},{"text":"of this problem by locating an approximate local minimizer ","element":"span"},{"style":{"height":14.03},"width":40.06,"height":35.08,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1711.08172/images/2-3.png","element":"img","alt":"¯xk","inline":true,"padRight":true},{"text":"(using a proper algorithm) and then inspecting a small region near ","element":"span"},{"style":{"height":14.03},"width":40.06,"height":35.08,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1711.08172/images/2-4.png","element":"img","alt":" ¯xk","inline":true,"padRight":true},{"text":"(e.g., by sampling a set of points). Once the inspection finds a point ","element":"span"},{"style":{"fontStyle":"italic"},"text":"x ","element":"span"},{"text":"such that ","element":"span"},{"style":{"height":18.03},"width":227.06,"height":45.08,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1711.08172/images/2-5.png","element":"img","alt":" f(x) < f(¯xk)","inline":true},{"text":", resume the algorithm from ","element":"span"},{"style":{"fontStyle":"italic"},"text":"x ","element":"span"},{"text":"and let it find the next approximate local minimizer ","element":"span"},{"style":{"height":14.03},"width":81.9,"height":35.08,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1711.08172/images/2-6.png","element":"img","alt":" ¯xk+1","inline":true,"padRight":true},{"text":"such that ","element":"span"},{"style":{"height":18.03},"width":263.42,"height":45.08,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1711.08172/images/2-7.png","element":"img","alt":" f(¯xk+1) ≤ f(x)","inline":true},{"text":". Alternate such running and inspection steps until, at a local minimizer ","element":"span"},{"style":{"height":14.03},"width":50.05,"height":35.08,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1711.08172/images/2-8.png","element":"img","alt":" ¯xK","inline":true},{"text":", the inspection fails to find a better point nearby. Then, ","element":"span"},{"style":{"height":14.03},"width":50.05,"height":35.08,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1711.08172/images/2-9.png","element":"img","alt":" ¯xK","inline":true,"padRight":true},{"text":"must be an approximate global solution. We call this procedure the ","element":"span"},{"style":{"fontStyle":"italic"},"text":"Run-and-Inspect Method","element":"span"},{"text":".","element":"span"}],[{"text":"The coupling of “run” and “inspect” is simple and flexible because, no matter which point the “run” phase generates, being it a saddle point, local minimizer, or global minimizer, the “inspect” phase will either improve upon it or verify its optimality. Because saddle points are easier to escape from than a non-global local minimizer, hereafter, we ignore saddle points in our discussion. Related saddle-point avoiding algorithms are reviewed below along with other literature.","element":"span"}],[{"text":"Sample-based inspection works in low dimensions. However, it suffers from the curse of dimensionality, as the number of points will increase exponentially with the dimension. For high-dimensional problems, the cost will be prohibitive. To address this issue, we define the blockwise ","element":"span"},{"style":{"fontWeight":"bold"},"text":"R","element":"span"},{"text":"-local minimizer and break the inspection into ","element":"span"},{"style":{"fontStyle":"italic"},"text":"s ","element":"span"},{"text":"blocks of low dimensions: ","element":"span"},{"style":{"height":18.03},"width":346.47,"height":45.08,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1711.08172/images/2-10.png","element":"img","alt":" x = [xT1 xT2 · · · xTs ]T","inline":true,"padRight":true},{"text":"where ","element":"span"},{"style":{"height":13.04},"width":141.15,"height":32.6,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1711.08172/images/2-11.png","element":"img","alt":" xi ∈ Rni","inline":true},{"text":". We call a point ","element":"span"},{"style":{"height":9.41},"width":24,"height":23.52,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1711.08172/images/2-12.png","element":"img","alt":" ¯x","inline":true,"padRight":true},{"text":"a ","element":"span"},{"style":{"fontStyle":"italic"},"text":"blockwise ","element":"span"},{"style":{"fontWeight":"bold"},"text":"R","element":"span"},{"style":{"fontStyle":"italic"},"text":"-local minimizer","element":"span"},{"text":", where ","element":"span"},{"style":{"height":18.03},"width":444.04,"height":45.08,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1711.08172/images/2-13.png","element":"img","alt":" R = [R1 R2 · · · Rs]T > 0","inline":true},{"text":", if it satisfies","element":"span"}],[{"style":{"width":"80%"},"width":1424,"height":63,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1711.08172/images/2-14.png","element":"img"}],[{"text":"where ","element":"span"},{"style":{"fontStyle":"italic"},"text":"B","element":"span"},{"text":"(","element":"span"},{"style":{"fontStyle":"italic"},"text":"x, R","element":"span"},{"text":") ","element":"span"},{"text":"is a closed ball with center ","element":"span"},{"style":{"fontStyle":"italic"},"text":"x ","element":"span"},{"text":"and radius ","element":"span"},{"style":{"fontStyle":"italic"},"text":"R","element":"span"},{"text":". To locate a blockwise ","element":"span"},{"style":{"fontWeight":"bold"},"text":"R","element":"span"},{"text":"-local minimizer, the inspection is applied to every block while fixing the others. Its cost grows linearly in the number of blocks when the size of every block is fixed.","element":"span"}],[{"text":"This paper studies ","element":"span"},{"style":{"fontStyle":"italic"},"text":"R","element":"span"},{"text":"-local and blockwise ","element":"span"},{"style":{"fontWeight":"bold"},"text":"R","element":"span"},{"text":"-local minimizers and develop their global optimality bounds for a class of function ","element":"span"},{"style":{"fontStyle":"italic"},"text":"F ","element":"span"},{"text":"that is the sum of a smooth, strongly convex function and a restricted nonconvex function. (Our analysis assumes a property weaker than strong convexity.) Roughly speaking, the landscape of ","element":"span"},{"style":{"fontStyle":"italic"},"text":"F ","element":"span"},{"text":"is convex at a coarse level, and it can have many local minima. (Arguably, if the landscape of ","element":"span"},{"style":{"fontStyle":"italic"},"text":"F ","element":"span"},{"text":"is overall nonconvex, minimizing ","element":"span"},{"style":{"fontStyle":"italic"},"text":"F ","element":"span"},{"text":"is fundamentally difficult.)","element":"span"}],[{"text":"This decomposition is implicit and only used to prove bounds. Our Run-and-Inspect Method, which does ","element":"span"},{"style":{"fontStyle":"italic"},"text":"not ","element":"span"},{"text":"use the decomposition, can still provably find a solution that has a bounded distance to a global minimizer and an objective value that is bounded by the global minimum. Both bounds can be zero with a finite ","element":"span"},{"style":{"fontStyle":"italic"},"text":"R","element":"span"},{"text":".","element":"span"}],[{"text":"The radius ","element":"span"},{"style":{"fontStyle":"italic"},"text":"R ","element":"span"},{"text":"affects theoretical bounds, solution quality, and inspection cost. If ","element":"span"},{"style":{"fontStyle":"italic"},"text":"R ","element":"span"},{"text":"is very small, the inspections will be cheap, but the solution returned by our method will be less likely to be global. On the other hand, an excessive large ","element":"span"},{"style":{"fontStyle":"italic"},"text":"R ","element":"span"},{"text":"leads to expensive inspection and is unnecessary since the goal of inspection is to escape local minima rather than decrease the objective. Theoretically, Theorem ","element":"span"},{"href":"#id-3","text":"3 ","element":"a"},{"text":"indicates a proper choice ","element":"span"},{"style":{"height":18},"width":212.13,"height":45,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1711.08172/images/2-15.png","element":"img","alt":" R = 2�β/L","inline":true},{"text":", where ","element":"span"},{"style":{"height":13.6},"width":68.19,"height":34,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1711.08172/images/2-16.png","element":"img","alt":" β, L","inline":true,"padRight":true},{"text":"are parameters of the functions in the implicit decomposition. Furthermore, if ","element":"span"},{"style":{"fontStyle":"italic"},"text":"R ","element":"span"},{"text":"is larger than a certain threshold given in Theorem ","element":"span"},{"href":"#id-4","text":"4","element":"a"},{"text":", then ","element":"span"},{"style":{"height":9.41},"width":24,"height":23.52,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1711.08172/images/2-17.png","element":"img","alt":" ¯x","inline":true,"padRight":true},{"text":"returned by our method must be a global minimizer. However, as these value and threshold are associated with the implicit decomposition, they are typically unavailable to the user.","element":"span"}],[{"text":"One can imagine that a good practical choice of ","element":"span"},{"style":{"fontStyle":"italic"},"text":"R ","element":"span"},{"text":"would be the radius of the global-minimum valley, assuming this valley is larger than all other local-minimum valleys. This choice is hard to guess, too. Another choice of ","element":"span"},{"style":{"fontStyle":"italic"},"text":"R ","element":"span"},{"text":"is roughly inversely proportional to ","element":"span"},{"style":{"height":15.6},"width":94.88,"height":39,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1711.08172/images/2-18.png","element":"img","alt":" ∥∇f∥","inline":true},{"text":", where ","element":"span"},{"style":{"fontStyle":"italic"},"text":"f ","element":"span"},{"text":"is the smooth convex component in the implicit decomposition of ","element":"span"},{"style":{"fontStyle":"italic"},"text":"F","element":"span"},{"text":". It is possible to estimate ","element":"span"},{"style":{"height":15.6},"width":94.88,"height":39,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1711.08172/images/2-19.png","element":"img","alt":" ∥∇f∥","inline":true,"padRight":true},{"text":"using an area maximum of ","element":"span"},{"style":{"height":15.6},"width":102.13,"height":39,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1711.08172/images/2-20.png","element":"img","alt":" ∥∇F∥","inline":true},{"text":", which itself requires a radius of sampling, unfortunately. (","element":"span"},{"style":{"height":15.6},"width":102.13,"height":39,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1711.08172/images/2-21.png","element":"img","alt":"∥∇F∥","inline":true,"padRight":true},{"text":"is zero at any local minimizer, so its local value is useless.) However, this result indicates that local minimizers that are far from the global minimizer are easier to escape from.","element":"span"}],[{"text":"We empirically observe that it is both fast and reliable to use a large ","element":"span"},{"style":{"fontStyle":"italic"},"text":"R ","element":"span"},{"text":"and sample the ball ","element":"span"},{"style":{"height":15.6},"width":132.1,"height":39,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1711.08172/images/3-0.png","element":"img","alt":" B(¯x, R)","inline":true,"padRight":true},{"text":"outside-in, for example, to sample on a set of rings of radius ","element":"span"},{"style":{"height":13.6},"width":520.39,"height":34,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1711.08172/images/3-1.png","element":"img","alt":" R, R − ∆R, R − 2∆R, . . . > 0","inline":true},{"text":". In most cases, a point on the first couple of rings is quickly found, and we escape to that point. The smallest ring is almost never sampled except when ","element":"span"},{"style":{"height":9.41},"width":24,"height":23.52,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1711.08172/images/3-2.png","element":"img","alt":" ¯x","inline":true,"padRight":true},{"text":"is already an (approximate) global minimizer. Although the final inspection around a global minimizer is generally unavoidable, global minimizers in problems such as compressed sensing and matrix decomposition can be identified without inspection because they have the desired structure or attained a lower bound to the objective value. Anyway, it appears that choosing ","element":"span"},{"style":{"fontStyle":"italic"},"text":"R ","element":"span"},{"text":"is ad hoc but not difficult. Throughout our numerical experiments, we use ","element":"span"},{"style":{"fontStyle":"italic"},"text":"R ","element":"span"},{"text":"= ","element":"span"},{"style":{"fontStyle":"italic"},"text":"O","element":"span"},{"text":"(1) ","element":"span"},{"text":"and obtain excellent results consistently.","element":"span"}],[{"text":"The exposition of this paper is limited to deterministic methods though it is possible to apply stochastic techniques. We can undoubtedly adopt stochastic approximation in the “run” phase when, for example, the objective function has a large-sum structure. Also, if the problem has a coordinate-friendly structure [","element":"span"},{"href":"#id-5","text":"16","element":"a"},{"text":"], we can randomly choose a coordinate, or a block of coordinates, to update each time. Another direction worth pursuing is stochastic sampling during the “inspect” phase. These stochastic techniques are attractive in specific settings, but we focus on non-stochastic techniques and global guarantees in this paper.","element":"span"}],[{"text":"1.1 Related work","element":"span"}],[{"style":{"fontStyle":"italic"},"text":"1.1.1 No spurious local minimum","element":"span"}],[{"text":"For certain nonconvex problems, a local minimum is always global or good enough. Examples include tensor decomposition [","element":"span"},{"href":"#id-6","text":"6","element":"a"},{"text":"], matrix completion [","element":"span"},{"href":"#id-7","text":"7","element":"a"},{"text":"], phase retrieval [","element":"span"},{"href":"#id-8","text":"22","element":"a"},{"text":"], and dictionary learning [","element":"span"},{"href":"#id-8","text":"21","element":"a"},{"text":"] under proper assumptions. When those assumptions are violated to a moderate amount, spurious local minima may appear and be possibly easy to escape. We will inspect them in our future work.","element":"span"}],[{"style":{"fontStyle":"italic"},"text":"1.1.2 First-order methods, derivative-free method, and trust-region method","element":"span"}],[{"text":"For nonconvex optimization, there has been recent work on first-order methods that can guarantee convergence to a stationary point. Examples include the block coordinate update method [","element":"span"},{"href":"#id-8","text":"25","element":"a"},{"text":"], ADMM for nonconvex optimization [","element":"span"},{"href":"#id-8","text":"23","element":"a"},{"text":"], the accelerated gradient algorithm [","element":"span"},{"href":"#id-9","text":"8","element":"a"},{"text":"], the stochastic variance reduction method [","element":"span"},{"href":"#id-10","text":"18","element":"a"},{"text":"], and so on.","element":"span"}],[{"text":"Because the “inspect” phase of our method uses a radius, it is seemingly related to the trust-region method [","element":"span"},{"href":"#id-11","text":"3","element":"a"},{"text":",","element":"span"},{"href":"#id-12","text":"12","element":"a"},{"text":"] and derivative-free method [","element":"span"},{"href":"#id-13","text":"4","element":"a"},{"text":"], both of which also use a radius at each step. However, the latter methods are not specifically designed to escape from a non-global local minimizer.","element":"span"}],[{"style":{"fontStyle":"italic"},"text":"1.1.3 Avoiding saddle points","element":"span"}],[{"text":"A recent line of work aims to avoid saddle points and converge to an ","element":"span"},{"style":{"height":0},"width":19,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1711.08172/images/3-3.png","element":"img","alt":" ϵ","inline":true},{"text":"-second-order stationary point ","element":"span"},{"style":{"height":9.41},"width":24,"height":23.52,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1711.08172/images/3-4.png","element":"img","alt":" ¯x","inline":true,"padRight":true},{"text":"that satisfies","element":"span"}],[{"id":"id-14","style":{"width":"71%"},"width":1259,"height":48,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1711.08172/images/3-5.png","element":"img"}],[{"text":"where ","element":"span"},{"style":{"height":9.6},"width":20,"height":24,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1711.08172/images/3-6.png","element":"img","alt":" ρ","inline":true,"padRight":true},{"text":"is the Lipschitz constant of ","element":"span"},{"style":{"height":17.63},"width":134.64,"height":44.08,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1711.08172/images/3-7.png","element":"img","alt":" ∇2F(x)","inline":true},{"text":". Their assumption is the ","element":"span"},{"style":{"fontStyle":"italic"},"text":"strict saddle ","element":"span"},{"text":"property, that is, a point satisfying (","element":"span"},{"href":"#id-14","text":"5","element":"a"},{"text":") for some ","element":"span"},{"style":{"height":13.2},"width":91.92,"height":33,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1711.08172/images/3-8.png","element":"img","alt":" ρ > 0","inline":true,"padRight":true},{"text":"and ","element":"span"},{"style":{"height":10.8},"width":87.65,"height":27,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1711.08172/images/3-9.png","element":"img","alt":" ϵ > 0","inline":true,"padRight":true},{"text":"must be an approximate local minimizer. On the algorithmic side, there are second-order algorithms [","element":"span"},{"href":"#id-15","text":"13","element":"a"},{"text":",","element":"span"},{"href":"#id-16","text":"15","element":"a"},{"text":"] and first-order stochastic methods [","element":"span"},{"href":"#id-6","text":"6","element":"a"},{"text":",","element":"span"},{"href":"#id-17","text":"9","element":"a"},{"text":",","element":"span"},{"href":"#id-18","text":"14","element":"a"},{"text":"] that can escape saddle points. The second-order algorithms use Hessian information and thus are more expensive at each iteration in high dimensions. Our method can also avoid saddle points.","element":"span"}],[{"style":{"fontStyle":"italic"},"text":"1.1.4 Simulated annealing","element":"span"}],[{"text":"Simulated annealing (SA) [","element":"span"},{"href":"#id-19","text":"11","element":"a"},{"text":"] is a classical method in global optimization, and thermodynamic principles can interpret it. SA uses a Markov chain with a stationary distribution ","element":"span"},{"style":{"height":17.75},"width":148.47,"height":44.38,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1711.08172/images/4-0.png","element":"img","alt":" ∼ e− F (x)T","inline":true,"padRight":true},{"text":", where ","element":"span"},{"style":{"fontStyle":"italic"},"text":"T ","element":"span"},{"text":"is the temperature parameter. By decreasing ","element":"span"},{"style":{"fontStyle":"italic"},"text":"T","element":"span"},{"text":", the distribution tends to concentrate on the global minimizer of ","element":"span"},{"style":{"fontStyle":"italic"},"text":"F","element":"span"},{"text":"(","element":"span"},{"style":{"fontWeight":"bold"},"text":"x","element":"span"},{"text":")","element":"span"},{"text":". However, it is difficult to know exactly when it converges, and the convergence rate can be extremely slow.","element":"span"}],[{"text":"SA can be also viewed as a method that samples the Gibbs distribution using Markov-Chain Monte Carlo (MCMC). Hence, we can apply SA in the “inspection” of our method. SA will generate more samples in a preferred area that are more likely to contain a better point, which once found will stop the inspection. Apparently, because of the hit-and-run nature of our inspection, we do not need to wait for the SA dynamic to converge.","element":"span"}],[{"style":{"fontStyle":"italic"},"text":"1.1.5 Flat minima in the training of neural networks","element":"span"}],[{"text":"Training a (deep) neural network involves nonconvex optimization. We do not necessarily need to find a global minimizer. A local minimizer will suffice if it generalizes well to data not used in training. There are many recent attempts [","element":"span"},{"href":"#id-20","text":"1","element":"a"},{"text":",","element":"span"},{"href":"#id-21","text":"2","element":"a"},{"text":",","element":"span"},{"href":"#id-22","text":"19","element":"a"},{"text":"] that investigate the optimization landscapes and propose methods to find local minima sitting in “rather flat valleys.”","element":"span"}],[{"text":"Paper [","element":"span"},{"href":"#id-20","text":"1","element":"a"},{"text":"] uses entropy-SGD iteration to favor flatter minima. It can be seen as a PDE-based smoothing technique [","element":"span"},{"href":"#id-21","text":"2","element":"a"},{"text":"], which shows that the optimization landscape becomes flatter after smoothing. It makes the theoretical analysis easier and provides explanations for many interesting phenomena in deep neural networks. But, as [","element":"span"},{"href":"#id-8","text":"24","element":"a"},{"text":"] has suggested, a better non-local quantity is required to go further.","element":"span"}],[{"text":"1.2 Notation","element":"span"}],[{"text":"Throughout the paper, ","element":"span"},{"style":{"height":15.6},"width":73.31,"height":39,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1711.08172/images/4-1.png","element":"img","alt":" ∥ · ∥","inline":true,"padRight":true},{"text":"denotes the Euclidean norm. Boldface lower-case letters (e.g., ","element":"span"},{"style":{"fontWeight":"bold"},"text":"x","element":"span"},{"text":") denote vectors. However, when a vector is a block in a larger vector, it is represented with a lower-case letter with a subscript, e.g., ","element":"span"},{"style":{"height":8.21},"width":34.06,"height":20.52,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1711.08172/images/4-2.png","element":"img","alt":" xi","inline":true},{"text":".","element":"span"}],[{"text":"1.3 Organization","element":"span"}],[{"text":"The rest of this paper is organized as follows. Section ","element":"span"},{"text":"2 ","element":"span"},{"text":"presents the main analysis of ","element":"span"},{"style":{"fontStyle":"italic"},"text":"R","element":"span"},{"text":"-local and blockwise ","element":"span"},{"style":{"fontWeight":"bold"},"text":"R","element":"span"},{"text":"-local minimizers, and then introduces the Run-and-Inspect Method. Section ","element":"span"},{"text":"3 ","element":"span"},{"text":"presents numerical results of our Run-and-Inspect method. Finally, Section ","element":"span"},{"text":"4 ","element":"span"},{"text":"concludes this paper.","element":"span"}]]},{"heading":"2 Main Results","paragraphs":[[{"text":"In sections ","element":"span"},{"href":"#id-23","text":"2.1","element":"a"},{"text":"–","element":"span"},{"href":"#id-24","text":"2.3","element":"a"},{"text":", we develop theoretical guarantees for our ","element":"span"},{"style":{"fontStyle":"italic"},"text":"R","element":"span"},{"text":"-local and ","element":"span"},{"style":{"fontWeight":"bold"},"text":"R","element":"span"},{"text":"-local minimizers for a class of nonconvex problems. Then, in section ","element":"span"},{"href":"#id-25","text":"2.4","element":"a"},{"text":", we design algorithms to find those minimizers.","element":"span"}],[{"id":"id-23","text":"2.1 Global optimality bounds","element":"span"}],[{"text":"In this section, we investigate an approach toward deriving error bounds for a point with certain properties.","element":"span"}],[{"text":"Consider problem (","element":"span"},{"href":"#id-26","text":"1","element":"a"},{"text":"), and let ","element":"span"},{"style":{"height":11.63},"width":40.64,"height":29.08,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1711.08172/images/4-3.png","element":"img","alt":" x∗","inline":true,"padRight":true},{"text":"denote one of its global minimizers. A global minimizer owns many nice properties. Finding a global minimizer is equivalent to finding a point satisfying all these properties. Clearly, it is easier to develop algorithms that aim at finding a point ","element":"span"},{"style":{"height":9.41},"width":24,"height":23.53,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1711.08172/images/4-4.png","element":"img","alt":" ¯x","inline":true,"padRight":true},{"text":"satisfying only some of those properties. An example is that when ","element":"span"},{"style":{"fontStyle":"italic"},"text":"F ","element":"span"},{"text":"is everywhere differentiable, ","element":"span"},{"style":{"height":15.63},"width":207.38,"height":39.08,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1711.08172/images/4-5.png","element":"img","alt":" ∇F(x∗) = 0","inline":true,"padRight":true},{"text":"is a necessary optimality condition. So, many first-order algorithms that produce a sequence ","element":"span"},{"style":{"height":14.03},"width":41.64,"height":35.08,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1711.08172/images/4-6.png","element":"img","alt":" xk","inline":true,"padRight":true},{"text":"such that ","element":"span"},{"style":{"height":18.03},"width":352.59,"height":45.08,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1711.08172/images/4-7.png","element":"img","alt":" ∥∇F(xk)∥ → 0 may","inline":true,"padRight":true},{"text":"converge to a global minimizer. Below, we focus on choosing the properties of ","element":"span"},{"style":{"height":11.63},"width":40.65,"height":29.08,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1711.08172/images/5-0.png","element":"img","alt":" x∗","inline":true,"padRight":true},{"text":"so that a point ","element":"span"},{"style":{"height":9.41},"width":24,"height":23.52,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1711.08172/images/5-1.png","element":"img","alt":" ¯x","inline":true,"padRight":true},{"text":"satisfying the same properties will enjoy bounds on ","element":"span"},{"style":{"height":15.63},"width":241.89,"height":39.08,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1711.08172/images/5-2.png","element":"img","alt":" F(¯x) − F(x∗)","inline":true,"padRight":true},{"text":"and ","element":"span"},{"style":{"height":15.63},"width":160.45,"height":39.08,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1711.08172/images/5-3.png","element":"img","alt":" ∥¯x − x∗∥","inline":true},{"text":". Of course, proper assumptions on ","element":"span"},{"style":{"fontStyle":"italic"},"text":"F ","element":"span"},{"text":"are needed, which we will make as we proceed.","element":"span"}],[{"text":"Let us use ","element":"span"},{"style":{"fontWeight":"bold"},"text":"Q ","element":"span"},{"text":"to represent a certain set of properties of ","element":"span"},{"style":{"height":11.63},"width":40.64,"height":29.08,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1711.08172/images/5-4.png","element":"img","alt":" x∗","inline":true},{"text":", and define","element":"span"}],[{"style":{"width":"66%"},"width":1170,"height":40,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1711.08172/images/5-5.png","element":"img"}],[{"text":"which includes ","element":"span"},{"style":{"height":11.63},"width":40.64,"height":29.08,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1711.08172/images/5-6.png","element":"img","alt":" x∗","inline":true},{"text":". For any point ","element":"span"},{"style":{"height":9.41},"width":24,"height":23.53,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1711.08172/images/5-7.png","element":"img","alt":" ¯x","inline":true,"padRight":true},{"text":"that also belongs to the set, we have","element":"span"}],[{"style":{"width":"35%"},"width":620,"height":64,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1711.08172/images/5-8.png","element":"img"}],[{"text":"and","element":"span"}],[{"style":{"width":"21%"},"width":376,"height":41,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1711.08172/images/5-9.png","element":"img"}],[{"text":"where diam(","element":"span"},{"style":{"height":14.46},"width":51.77,"height":36.14,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1711.08172/images/5-10.png","element":"img","alt":"SQ","inline":true},{"text":") stands for the diameter of the set ","element":"span"},{"style":{"height":14.46},"width":51.77,"height":36.14,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1711.08172/images/5-11.png","element":"img","alt":" SQ","inline":true},{"text":". Hence, the problem of constructing an error bound reduces to analyzing the set ","element":"span"},{"style":{"height":14.46},"width":51.77,"height":36.14,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1711.08172/images/5-12.png","element":"img","alt":" SQ","inline":true,"padRight":true},{"text":"under certain assumptions on ","element":"span"},{"style":{"fontStyle":"italic"},"text":"F","element":"span"},{"text":".","element":"span"}],[{"style":{"width":"96%"},"width":1706,"height":33,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1711.08172/images/5-13.png","element":"img"}],[{"style":{"height":15.6},"width":244.1,"height":39,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1711.08172/images/5-14.png","element":"img","alt":"∥∇F(x)∥ ≤ δ","inline":true,"padRight":true},{"text":"with ","element":"span"},{"style":{"height":12},"width":109.3,"height":30,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1711.08172/images/5-15.png","element":"img","alt":" δ > 0","inline":true},{"text":". This choice is admissible since ","element":"span"},{"style":{"height":15.63},"width":353.09,"height":39.08,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1711.08172/images/5-16.png","element":"img","alt":" ∥∇F(x∗)∥ = 0 ≤ δ","inline":true},{"text":". For this choice, we have","element":"span"}],[{"style":{"width":"32%"},"width":581,"height":79,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1711.08172/images/5-17.png","element":"img"}],[{"text":"and","element":"span"}],[{"id":"id-27","style":{"width":"99%"},"width":1764,"height":268,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1711.08172/images/5-18.png","element":"img"}],[{"text":"We use the term “implicit” because this decomposition is only used for analysis, not required by our Run-and-Inspect Method. Define the sets of the global minimizers of ","element":"span"},{"style":{"fontStyle":"italic"},"text":"F ","element":"span"},{"text":"and ","element":"span"},{"style":{"fontStyle":"italic"},"text":"f ","element":"span"},{"text":"as, respectively,","element":"span"}],[{"id":"id-35","style":{"width":"28%"},"width":502,"height":139,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1711.08172/images/5-19.png","element":"img"}],[{"text":"Below we make three assumptions on (","element":"span"},{"href":"#id-27","text":"7","element":"a"},{"text":"). The first and third assumptions are used throughout this section. Only some of our results require the second assumption.","element":"span"}],[{"style":{"fontWeight":"bold"},"text":"Assumption 1 ","element":"span"},{"style":{"fontStyle":"italic"},"text":"f","element":"span"},{"text":"(","element":"span"},{"style":{"fontWeight":"bold"},"text":"x","element":"span"},{"text":") ","element":"span"},{"style":{"fontStyle":"italic"},"text":"is differentiable, and ","element":"span"},{"style":{"height":15.6},"width":109.2,"height":39,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1711.08172/images/5-20.png","element":"img","alt":" ∇f(x)","inline":true,"padRight":true},{"style":{"fontStyle":"italic"},"text":"is ","element":"span"},{"style":{"fontStyle":"italic"},"text":"L","element":"span"},{"style":{"fontStyle":"italic"},"text":"-Lipschitz continuous.","element":"span"}],[{"style":{"fontWeight":"bold"},"text":"Assumption 2 ","element":"span"},{"style":{"fontStyle":"italic"},"text":"f","element":"span"},{"text":"(","element":"span"},{"style":{"fontWeight":"bold"},"text":"x","element":"span"},{"text":") ","element":"span"},{"style":{"fontStyle":"italic"},"text":"satisfies the Polyak-Łojasiewicz (PL) inequality [","element":"span"},{"href":"#id-28","style":{"fontStyle":"italic"},"text":"17","element":"a"},{"style":{"fontStyle":"italic"},"text":"] with ","element":"span"},{"style":{"height":13.2},"width":95.32,"height":33,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1711.08172/images/5-21.png","element":"img","alt":" µ > 0","inline":true},{"style":{"fontStyle":"italic"},"text":":","element":"span"}],[{"style":{"width":"74%"},"width":1325,"height":76,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1711.08172/images/5-22.png","element":"img"}],[{"text":"Given a point ","element":"span"},{"style":{"fontWeight":"bold"},"text":"x","element":"span"},{"text":", we define its projection","element":"span"}],[{"id":"id-29","style":{"width":"24%"},"width":428,"height":76,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1711.08172/images/5-23.png","element":"img"}],[{"text":"Then, the PL inequality (","element":"span"},{"href":"#id-29","text":"8","element":"a"},{"text":") yields the quadratic growth (QG) condition [","element":"span"},{"href":"#id-30","text":"10","element":"a"},{"text":"]:","element":"span"}],[{"id":"id-31","style":{"width":"77%"},"width":1362,"height":67,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1711.08172/images/5-24.png","element":"img"}],[{"text":"Clearly, (","element":"span"},{"href":"#id-29","text":"8","element":"a"},{"text":") and (","element":"span"},{"href":"#id-31","text":"9","element":"a"},{"text":") together imply","element":"span"}],[{"id":"id-32","style":{"width":"60%"},"width":1071,"height":39,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1711.08172/images/5-25.png","element":"img"}],[{"text":"Assumption ","element":"span"},{"href":"#id-29","text":"2 ","element":"a"},{"text":"ensures that the gradient of ","element":"span"},{"style":{"fontStyle":"italic"},"text":"f ","element":"span"},{"text":"bounds its objective error.","element":"span"}],[{"style":{"fontWeight":"bold"},"text":"Assumption 3 ","element":"span"},{"style":{"fontStyle":"italic"},"text":"r","element":"span"},{"text":"(","element":"span"},{"style":{"fontWeight":"bold"},"text":"x","element":"span"},{"text":") ","element":"span"},{"style":{"fontStyle":"italic"},"text":"satisfies ","element":"span"},{"style":{"height":15.6},"width":516.11,"height":39,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1711.08172/images/6-0.png","element":"img","alt":" |r(x) − r(y)| ≤ α∥x − y∥ + 2β","inline":true,"padRight":true},{"style":{"fontStyle":"italic"},"text":"in which ","element":"span"},{"style":{"height":13.6},"width":138.04,"height":34,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1711.08172/images/6-1.png","element":"img","alt":" α, β ≥ 0","inline":true,"padRight":true},{"style":{"fontStyle":"italic"},"text":"are constants.","element":"span"}],[{"text":"Assumption ","element":"span"},{"href":"#id-32","text":"3 ","element":"a"},{"text":"implies that ","element":"span"},{"style":{"fontStyle":"italic"},"text":"r ","element":"span"},{"text":"is overall ","element":"span"},{"style":{"height":6.4},"width":25,"height":16,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1711.08172/images/6-2.png","element":"img","alt":" α","inline":true},{"text":"-Lipschitz continuous with additional oscillations up to ","element":"span"},{"style":{"height":13.6},"width":42.46,"height":34,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1711.08172/images/6-3.png","element":"img","alt":" 2β","inline":true},{"text":". In the implicit decomposition (","element":"span"},{"href":"#id-27","text":"7","element":"a"},{"text":"), though ","element":"span"},{"style":{"fontStyle":"italic"},"text":"r ","element":"span"},{"text":"can cause ","element":"span"},{"style":{"fontStyle":"italic"},"text":"F ","element":"span"},{"text":"to have non-global local minimizers, its impact on the overall landscape of ","element":"span"},{"style":{"fontStyle":"italic"},"text":"f ","element":"span"},{"text":"is limited. For example, the ","element":"span"},{"style":{"height":16.26},"width":250.89,"height":40.66,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1711.08172/images/6-4.png","element":"img","alt":" ℓpp (0 < p < 1)","inline":true,"padRight":true},{"text":"penalty in compressed sensing is ","element":"span"},{"text":"used to induce sparsity of solutions. It is nonconvex and satisfies our assumption","element":"span"}],[{"style":{"width":"52%"},"width":935,"height":48,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1711.08172/images/6-5.png","element":"img"}],[{"text":"In fact, many sparsity-induced penalties satisfy Assumption ","element":"span"},{"href":"#id-32","text":"3","element":"a"},{"text":". Many of them are sharp near ","element":"span"},{"text":"0 ","element":"span"},{"text":"and thus not Lipschitz there. In Assumption ","element":"span"},{"href":"#id-32","text":"3","element":"a"},{"text":", ","element":"span"},{"style":{"height":13.6},"width":23,"height":34,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1711.08172/images/6-6.png","element":"img","alt":" β","inline":true,"padRight":true},{"text":"models their variation near ","element":"span"},{"text":"0 ","element":"span"},{"text":"and ","element":"span"},{"style":{"height":6.4},"width":25,"height":16,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1711.08172/images/6-7.png","element":"img","alt":" α","inline":true,"padRight":true},{"text":"controls their increase elsewhere.","element":"span"}],[{"text":"In section ","element":"span"},{"href":"#id-33","text":"2.2","element":"a"},{"text":", we will show that every ","element":"span"},{"style":{"height":14.43},"width":131.39,"height":36.08,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1711.08172/images/6-8.png","element":"img","alt":" x∗ ∈ χ∗","inline":true,"padRight":true},{"text":"satisfies ","element":"span"},{"style":{"height":15.63},"width":237.05,"height":39.08,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1711.08172/images/6-9.png","element":"img","alt":" ∥∇f(x∗)∥ ≤ δ","inline":true,"padRight":true},{"text":"for a universal ","element":"span"},{"style":{"height":11.2},"width":18,"height":28,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1711.08172/images/6-10.png","element":"img","alt":" δ","inline":true,"padRight":true},{"text":"that depends on ","element":"span"},{"style":{"height":13.6},"width":110.51,"height":34,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1711.08172/images/6-11.png","element":"img","alt":" α, β, L","inline":true},{"text":". So, we choose the condition","element":"span"}],[{"id":"id-34","style":{"width":"59%"},"width":1047,"height":66,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1711.08172/images/6-12.png","element":"img"}],[{"text":"To derive the error bound, we introduce yet another assumption:","element":"span"}],[{"style":{"fontWeight":"bold"},"text":"Assumption 4 ","element":"span"},{"style":{"fontStyle":"italic"},"text":"The set ","element":"span"},{"style":{"height":16.98},"width":42.26,"height":42.46,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1711.08172/images/6-13.png","element":"img","alt":" χ∗f","inline":true,"padRight":true},{"style":{"fontStyle":"italic"},"text":"is bounded. That is, there exists ","element":"span"},{"style":{"height":12.8},"width":122.55,"height":32,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1711.08172/images/6-14.png","element":"img","alt":" M ≥ 0","inline":true,"padRight":true},{"style":{"fontStyle":"italic"},"text":"such that, for any ","element":"span"},{"style":{"height":16.98},"width":164.01,"height":42.46,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1711.08172/images/6-15.png","element":"img","alt":" x, y ∈ χ∗f","inline":true},{"style":{"fontStyle":"italic"},"text":", we ","element":"span"},{"style":{"fontStyle":"italic"},"text":"have ","element":"span"},{"style":{"height":15.6},"width":227.25,"height":39,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1711.08172/images/6-16.png","element":"img","alt":" ∥x − y∥ ≤ M","inline":true},{"style":{"fontStyle":"italic"},"text":".","element":"span"}],[{"text":"When ","element":"span"},{"style":{"fontStyle":"italic"},"text":"f ","element":"span"},{"text":"has a unique global minimizer, we have ","element":"span"},{"style":{"fontStyle":"italic"},"text":"M ","element":"span"},{"text":"= 0 ","element":"span"},{"text":"in Assumption ","element":"span"},{"href":"#id-34","text":"4","element":"a"},{"text":".","element":"span"}],[{"style":{"fontWeight":"bold"},"text":"Theorem 2 ","element":"span"},{"style":{"fontStyle":"italic"},"text":"Take Assumptions ","element":"span"},{"href":"#id-35","style":{"fontStyle":"italic"},"text":"1","element":"a"},{"style":{"fontStyle":"italic"},"text":", ","element":"span"},{"href":"#id-29","style":{"fontStyle":"italic"},"text":"2 ","element":"a"},{"style":{"fontStyle":"italic"},"text":"and ","element":"span"},{"href":"#id-32","style":{"fontStyle":"italic"},"text":"3","element":"a"},{"style":{"fontStyle":"italic"},"text":", and assume that all points in ","element":"span"},{"style":{"height":14.43},"width":41.26,"height":36.08,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1711.08172/images/6-17.png","element":"img","alt":" χ∗","inline":true,"padRight":true},{"style":{"fontStyle":"italic"},"text":"have property ","element":"span"},{"style":{"fontWeight":"bold"},"text":"Q ","element":"span"},{"style":{"fontStyle":"italic"},"text":"in ","element":"span"},{"text":"(","element":"span"},{"href":"#id-34","text":"11","element":"a"},{"text":")","element":"span"},{"style":{"fontStyle":"italic"},"text":". Then, the following properties hold for every ","element":"span"},{"style":{"height":14.46},"width":122.98,"height":36.14,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1711.08172/images/6-18.png","element":"img","alt":" ¯x ∈ SQ","inline":true},{"text":":","element":"span"}],[{"style":{"fontStyle":"italic"},"text":"1. ","element":"span"},{"style":{"height":23.39},"width":368.41,"height":58.48,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1711.08172/images/6-19.png","element":"img","alt":" F(¯x) − F ∗ ≤ δ22µ + 2β","inline":true},{"style":{"fontStyle":"italic"},"text":", if ","element":"span"},{"style":{"height":10},"width":96.81,"height":25,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1711.08172/images/6-20.png","element":"img","alt":" α = 0","inline":true,"padRight":true},{"style":{"fontStyle":"italic"},"text":"in Assumption ","element":"span"},{"href":"#id-32","style":{"fontStyle":"italic"},"text":"3","element":"a"},{"style":{"fontStyle":"italic"},"text":";","element":"span"}],[{"id":"id-37","style":{"fontStyle":"italic"},"text":"2. ","element":"span"},{"style":{"height":21.74},"width":316.66,"height":54.35,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1711.08172/images/6-21.png","element":"img","alt":" d(¯x, χ∗) ≤ 2δµ + M","inline":true,"padRight":true},{"style":{"fontStyle":"italic"},"text":"and ","element":"span"},{"href":"#id-32","style":{"height":23.39},"width":555.55,"height":58.47,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1711.08172/images/6-22.png","element":"img","alt":" F(¯x) − F ∗ ≤ δ2+2αδµ + αM + 2β","inline":true},{"style":{"fontStyle":"italic"},"text":", if ","element":"span"},{"style":{"height":12.4},"width":96.84,"height":31,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1711.08172/images/6-23.png","element":"img","alt":" α ≥ 0","inline":true,"padRight":true},{"style":{"fontStyle":"italic"},"text":"and Assumption ","element":"span"},{"href":"#id-34","style":{"fontStyle":"italic"},"text":"4 ","element":"a"},{"style":{"fontStyle":"italic"},"text":"holds.","element":"span"}],[{"style":{"fontStyle":"italic"},"text":"Proof ","element":"span"},{"text":"To show part 1, we have","element":"span"}],[{"style":{"width":"69%"},"width":1230,"height":171,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1711.08172/images/6-24.png","element":"img"}],[{"text":"Part 2: Since ","element":"span"},{"style":{"fontStyle":"italic"},"text":"f ","element":"span"},{"text":"satisfies the PL inequality (","element":"span"},{"href":"#id-29","text":"8","element":"a"},{"text":") and ","element":"span"},{"style":{"height":14.46},"width":122.97,"height":36.14,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1711.08172/images/6-25.png","element":"img","alt":" ¯x ∈ SQ","inline":true},{"text":", we have","element":"span"}],[{"style":{"width":"27%"},"width":493,"height":90,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1711.08172/images/6-26.png","element":"img"}],[{"text":"By choosing an ","element":"span"},{"style":{"height":14.43},"width":131.38,"height":36.08,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1711.08172/images/6-27.png","element":"img","alt":" x∗ ∈ χ∗","inline":true,"padRight":true},{"text":"and noticing ","element":"span"},{"style":{"height":15.69},"width":141.89,"height":39.22,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1711.08172/images/6-28.png","element":"img","alt":" x∗ ∈ SQ","inline":true},{"text":", we also have ","element":"span"},{"style":{"height":21.74},"width":232.15,"height":54.36,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1711.08172/images/6-29.png","element":"img","alt":" d(x∗, χ∗f) ≤ δµ","inline":true,"padRight":true},{"text":"and thus","element":"span"}],[{"style":{"width":"45%"},"width":807,"height":87,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1711.08172/images/6-30.png","element":"img"}],[{"text":"Below we let ","element":"span"},{"style":{"height":11.4},"width":45.64,"height":28.49,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1711.08172/images/6-31.png","element":"img","alt":" ¯xP","inline":true,"padRight":true},{"text":"and ","element":"span"},{"style":{"height":14.82},"width":45.64,"height":37.06,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1711.08172/images/6-32.png","element":"img","alt":" x∗P","inline":true},{"text":", respectively, denote the projections of ","element":"span"},{"style":{"height":9.41},"width":24,"height":23.53,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1711.08172/images/6-33.png","element":"img","alt":" ¯x","inline":true,"padRight":true},{"text":"and ","element":"span"},{"style":{"height":11.63},"width":40.64,"height":29.08,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1711.08172/images/6-34.png","element":"img","alt":" x∗","inline":true,"padRight":true},{"text":"onto the set ","element":"span"},{"style":{"height":16.98},"width":42.26,"height":42.46,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1711.08172/images/6-35.png","element":"img","alt":" χ∗f","inline":true},{"text":". Since ","element":"span"},{"style":{"height":15.6},"width":141.39,"height":39,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1711.08172/images/6-36.png","element":"img","alt":" f(¯xP) =","inline":true},{"style":{"height":15.63},"width":100.46,"height":39.08,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1711.08172/images/6-37.png","element":"img","alt":"f(x∗P)","inline":true},{"text":", we obtain","element":"span"}],[{"style":{"width":"77%"},"width":1367,"height":260,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1711.08172/images/6-38.png","element":"img"}],[{"text":"In the theorem above, we have constructed global optimality bounds for ","element":"span"},{"style":{"height":9.41},"width":24,"height":23.53,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1711.08172/images/6-39.png","element":"img","alt":" ¯x","inline":true,"padRight":true},{"text":"obeying ","element":"span"},{"style":{"fontWeight":"bold"},"text":"Q","element":"span"},{"text":". In the next two subsections, we show that ","element":"span"},{"style":{"fontStyle":"italic"},"text":"R","element":"span"},{"text":"-local minimizers, which include global minimizers, do obey ","element":"span"},{"style":{"fontWeight":"bold"},"text":"Q ","element":"span"},{"text":"under mild conditions. Hence, the bounds apply to any ","element":"span"},{"style":{"fontStyle":"italic"},"text":"R","element":"span"},{"text":"-local minimizer.","element":"span"}],[{"id":"id-33","text":"2.2 R-local minimizers","element":"span"}],[{"text":"In this section, we define and analyze ","element":"span"},{"style":{"fontStyle":"italic"},"text":"R","element":"span"},{"text":"-local minimizers. We discuss its blockwise version in section ","element":"span"},{"href":"#id-24","text":"2.3","element":"a"},{"text":". Throughout this subsection, we assume that ","element":"span"},{"style":{"height":15.6},"width":179.05,"height":39,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1711.08172/images/7-0.png","element":"img","alt":" R ∈ (0, ∞]","inline":true},{"text":", and ","element":"span"},{"style":{"fontStyle":"italic"},"text":"B","element":"span"},{"text":"(","element":"span"},{"style":{"fontWeight":"bold"},"text":"x","element":"span"},{"style":{"fontStyle":"italic"},"text":", R","element":"span"},{"text":") ","element":"span"},{"text":"is a ","element":"span"},{"style":{"fontStyle":"italic"},"text":"closed ","element":"span"},{"text":"ball centered at ","element":"span"},{"style":{"fontWeight":"bold"},"text":"x ","element":"span"},{"text":"with radius ","element":"span"},{"style":{"fontStyle":"italic"},"text":"R","element":"span"},{"text":".","element":"span"}],[{"style":{"fontWeight":"bold"},"text":"Definition 1 ","element":"span"},{"style":{"fontStyle":"italic"},"text":"The point ","element":"span"},{"style":{"height":9.41},"width":24,"height":23.52,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1711.08172/images/7-1.png","element":"img","alt":" ¯x","inline":true,"padRight":true},{"style":{"fontStyle":"italic"},"text":"is called a standard ","element":"span"},{"style":{"fontStyle":"italic"},"text":"R","element":"span"},{"style":{"fontStyle":"italic"},"text":"-local minimizer of ","element":"span"},{"style":{"fontStyle":"italic"},"text":"F ","element":"span"},{"style":{"fontStyle":"italic"},"text":"if it satisfies","element":"span"}],[{"style":{"width":"60%"},"width":1071,"height":62,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1711.08172/images/7-2.png","element":"img"}],[{"text":"Obviously an ","element":"span"},{"style":{"fontStyle":"italic"},"text":"R","element":"span"},{"text":"-local minimizer is a local minimizer, and when ","element":"span"},{"style":{"height":10.4},"width":127.1,"height":26,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1711.08172/images/7-3.png","element":"img","alt":" R = ∞","inline":true},{"text":", it is a global minimizer. Conversely, a global minimizer is always an ","element":"span"},{"style":{"fontStyle":"italic"},"text":"R","element":"span"},{"text":"-local minimizer. We first bound the gradient of ","element":"span"},{"style":{"fontStyle":"italic"},"text":"f ","element":"span"},{"text":"at an ","element":"span"},{"style":{"fontStyle":"italic"},"text":"R","element":"span"},{"text":"-local minimizer so that ","element":"span"},{"style":{"fontWeight":"bold"},"text":"Q ","element":"span"},{"text":"in (","element":"span"},{"href":"#id-34","text":"11","element":"a"},{"text":") is satisfied.","element":"span"}],[{"style":{"fontWeight":"bold"},"text":"Theorem 3 ","element":"span"},{"style":{"fontStyle":"italic"},"text":"Suppose, in ","element":"span"},{"text":"(","element":"span"},{"href":"#id-27","text":"7","element":"a"},{"text":")","element":"span"},{"style":{"fontStyle":"italic"},"text":", ","element":"span"},{"style":{"fontStyle":"italic"},"text":"f ","element":"span"},{"style":{"fontStyle":"italic"},"text":"and ","element":"span"},{"style":{"fontStyle":"italic"},"text":"r ","element":"span"},{"style":{"fontStyle":"italic"},"text":"satisfy Assumptions ","element":"span"},{"href":"#id-35","style":{"fontStyle":"italic"},"text":"1 ","element":"a"},{"style":{"fontStyle":"italic"},"text":"and ","element":"span"},{"href":"#id-32","style":{"fontStyle":"italic"},"text":"3","element":"a"},{"style":{"fontStyle":"italic"},"text":". Then, a point ","element":"span"},{"style":{"height":9.41},"width":24,"height":23.52,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1711.08172/images/7-4.png","element":"img","alt":" ¯x","inline":true,"padRight":true},{"style":{"fontStyle":"italic"},"text":"obeys ","element":"span"},{"style":{"fontWeight":"bold"},"text":"Q ","element":"span"},{"style":{"fontStyle":"italic"},"text":"in ","element":"span"},{"text":"(","element":"span"},{"href":"#id-34","text":"11","element":"a"},{"text":") ","element":"span"},{"style":{"fontStyle":"italic"},"text":"with ","element":"span"},{"style":{"height":11.2},"width":18,"height":28,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1711.08172/images/7-5.png","element":"img","alt":" δ","inline":true,"padRight":true},{"style":{"fontStyle":"italic"},"text":"given in the following two cases:","element":"span"}],[{"style":{"fontStyle":"italic"},"text":"1. ","element":"span"},{"style":{"height":11.2},"width":95.56,"height":28,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1711.08172/images/7-6.png","element":"img","alt":" δ = α","inline":true,"padRight":true},{"style":{"fontStyle":"italic"},"text":"if ","element":"span"},{"style":{"fontStyle":"italic"},"text":"r ","element":"span"},{"style":{"fontStyle":"italic"},"text":"is differentiable with ","element":"span"},{"style":{"height":12.4},"width":96.84,"height":31,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1711.08172/images/7-7.png","element":"img","alt":" α ≥ 0","inline":true,"padRight":true},{"style":{"fontStyle":"italic"},"text":"and ","element":"span"},{"style":{"height":13.6},"width":95.75,"height":34,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1711.08172/images/7-8.png","element":"img","alt":" β = 0","inline":true,"padRight":true},{"style":{"fontStyle":"italic"},"text":"in ","element":"span"},{"text":"(","element":"span"},{"href":"#id-32","text":"3","element":"a"},{"text":") ","element":"span"},{"style":{"fontStyle":"italic"},"text":"and ","element":"span"},{"style":{"height":9.41},"width":24,"height":23.52,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1711.08172/images/7-9.png","element":"img","alt":" ¯x","inline":true,"padRight":true},{"style":{"fontStyle":"italic"},"text":"is a stationary point of ","element":"span"},{"style":{"fontStyle":"italic"},"text":"F","element":"span"},{"style":{"fontStyle":"italic"},"text":"; ","element":"span"},{"id":"id-3","style":{"fontStyle":"italic"},"text":"2. ","element":"span"},{"style":{"height":19.9},"width":420.12,"height":49.74,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1711.08172/images/7-10.png","element":"img","alt":" δ = α + max{ 4βR , 2√βL}","inline":true,"padRight":true},{"style":{"fontStyle":"italic"},"text":"if ","element":"span"},{"style":{"height":9.41},"width":24,"height":23.52,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1711.08172/images/7-11.png","element":"img","alt":" ¯x","inline":true,"padRight":true},{"style":{"fontStyle":"italic"},"text":"is a standard ","element":"span"},{"style":{"fontStyle":"italic"},"text":"R","element":"span"},{"style":{"fontStyle":"italic"},"text":"-local minimizer of ","element":"span"},{"style":{"fontStyle":"italic"},"text":"F","element":"span"},{"style":{"fontStyle":"italic"},"text":".","element":"span"}],[{"id":"id-36","style":{"width":"99%"},"width":1764,"height":436,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1711.08172/images/7-12.png","element":"img"}],[{"text":"where ","element":"span"},{"style":{"fontStyle":"italic"},"text":"a","element":"span"},{"text":") ","element":"span"},{"text":"is due to the assumption on ","element":"span"},{"style":{"fontStyle":"italic"},"text":"r ","element":"span"},{"text":"and that ","element":"span"},{"style":{"height":15.6},"width":109.2,"height":39,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1711.08172/images/7-13.png","element":"img","alt":" ∇f(x)","inline":true,"padRight":true},{"text":"is ","element":"span"},{"style":{"fontStyle":"italic"},"text":"L","element":"span"},{"text":"-Lipschitz continuous; ","element":"span"},{"style":{"fontStyle":"italic"},"text":"b","element":"span"},{"text":") ","element":"span"},{"text":"is because, as ","element":"span"},{"style":{"height":15.6},"width":123.06,"height":39,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1711.08172/images/7-14.png","element":"img","alt":" ∥x−¯x∥","inline":true,"padRight":true},{"text":"is fixed, the minimum is attained with ","element":"span"},{"style":{"height":23.28},"width":316.98,"height":58.21,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1711.08172/images/7-15.png","element":"img","alt":"x−¯x∥x−¯x∥ = − ∇f(x)∥∇f(x)∥","inline":true},{"text":". If ","element":"span"},{"style":{"height":15.6},"width":284.05,"height":39,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1711.08172/images/7-16.png","element":"img","alt":" ∥∇f(x)∥ ≤ α, Q","inline":true,"padRight":true},{"text":"is immediately satisfied. ","element":"span"},{"text":"Now assume ","element":"span"},{"style":{"height":15.6},"width":226.07,"height":39,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1711.08172/images/7-17.png","element":"img","alt":" ∥∇f(x)∥ > α","inline":true},{"text":". To simplify (","element":"span"},{"href":"#id-36","text":"13","element":"a"},{"text":"), we only need to minimize a quadratic function of ","element":"span"},{"style":{"height":15.6},"width":135.32,"height":39,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1711.08172/images/7-18.png","element":"img","alt":" ∥x − ¯x∥","inline":true,"padRight":true},{"text":"over ","element":"span"},{"text":"[0","element":"span"},{"style":{"fontStyle":"italic"},"text":", R","element":"span"},{"text":"]","element":"span"},{"text":". Hence, the objective equals","element":"span"}],[{"style":{"width":"75%"},"width":1341,"height":114,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1711.08172/images/7-19.png","element":"img"}],[{"text":"If ","element":"span"},{"style":{"height":20.48},"width":258.83,"height":51.21,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1711.08172/images/7-20.png","element":"img","alt":" R ≤ ∥∇f(¯x)∥−αL","inline":true,"padRight":true},{"text":", from ","element":"span"},{"style":{"height":18.94},"width":570.97,"height":47.36,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1711.08172/images/7-21.png","element":"img","alt":" 2β + (α − ∥∇f(¯x)∥)R + L2 R2 ≥ 0","inline":true},{"text":", we get","element":"span"}],[{"style":{"width":"53%"},"width":940,"height":172,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1711.08172/images/7-22.png","element":"img"}],[{"text":"Otherwise, from ","element":"span"},{"style":{"height":22.13},"width":384.53,"height":55.32,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1711.08172/images/7-23.png","element":"img","alt":" 2β − (∥∇f(¯x)∥−α)22L ≥ 0","inline":true},{"text":", we get","element":"span"}],[{"id":"id-39","style":{"width":"21%"},"width":383,"height":46,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1711.08172/images/7-24.png","element":"img"}],[{"text":"Combining both cases, we have ","element":"span"},{"style":{"height":19.9},"width":619.02,"height":49.74,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1711.08172/images/7-25.png","element":"img","alt":" ∥∇f(¯x)∥ ≤ α + max{ 4βR , 2√βL} = δ","inline":true,"padRight":true},{"text":"and thus ","element":"span"},{"style":{"height":13.2},"width":421.78,"height":33,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1711.08172/images/7-26.png","element":"img","alt":" Q. ⊓⊔","inline":true}],[{"text":"The next result is a consequence of part 2 of the theorem above. It presents the values of ","element":"span"},{"style":{"fontStyle":"italic"},"text":"R ","element":"span"},{"text":"that ensure the escape from a non-global local minimizer. In addition, more distant local minimizers ","element":"span"},{"style":{"fontWeight":"bold"},"text":"x ","element":"span"},{"text":"are easier to escape in the sense that ","element":"span"},{"style":{"fontStyle":"italic"},"text":"R ","element":"span"},{"text":"is roughly inversely proportional to ","element":"span"},{"style":{"height":15.6},"width":148.79,"height":39,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1711.08172/images/7-27.png","element":"img","alt":" ∥∇f(x)∥","inline":true},{"text":".","element":"span"}],[{"style":{"fontWeight":"bold"},"text":"Corollary 1 ","element":"span"},{"style":{"fontStyle":"italic"},"text":"Let ","element":"span"},{"style":{"fontWeight":"bold"},"text":"x ","element":"span"},{"style":{"fontStyle":"italic"},"text":"be a local minimizer of ","element":"span"},{"style":{"fontStyle":"italic"},"text":"F ","element":"span"},{"style":{"fontStyle":"italic"},"text":"and ","element":"span"},{"style":{"height":16.22},"width":360.25,"height":40.54,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1711.08172/images/7-28.png","element":"img","alt":" ∥∇f(x)∥ > α+2√βL","inline":true},{"style":{"fontStyle":"italic"},"text":". As long as either ","element":"span"},{"style":{"height":22.69},"width":258.83,"height":56.73,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1711.08172/images/7-29.png","element":"img","alt":" R > 4β∥∇f(x)∥−α","inline":true,"padRight":true},{"style":{"fontStyle":"italic"},"text":"or ","element":"span"},{"style":{"height":18},"width":209.24,"height":45,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1711.08172/images/7-30.png","element":"img","alt":" R ≥ 2�β/L","inline":true},{"style":{"fontStyle":"italic"},"text":", there exists ","element":"span"},{"style":{"height":15.6},"width":203.93,"height":39,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1711.08172/images/7-31.png","element":"img","alt":" y ∈ B(x, R)","inline":true,"padRight":true},{"style":{"fontStyle":"italic"},"text":"such that ","element":"span"},{"style":{"fontStyle":"italic"},"text":"F","element":"span"},{"text":"(","element":"span"},{"style":{"fontWeight":"bold"},"text":"y","element":"span"},{"text":") ","element":"span"},{"style":{"fontStyle":"italic"},"text":"< F","element":"span"},{"text":"(","element":"span"},{"style":{"fontWeight":"bold"},"text":"x","element":"span"},{"text":")","element":"span"},{"style":{"fontStyle":"italic"},"text":".","element":"span"}],[{"style":{"fontStyle":"italic"},"text":"Proof ","element":"span"},{"text":"Assume that the result does ","element":"span"},{"style":{"fontStyle":"italic"},"text":"not ","element":"span"},{"text":"hold. Then, ","element":"span"},{"style":{"fontWeight":"bold"},"text":"x ","element":"span"},{"text":"is an ","element":"span"},{"style":{"fontStyle":"italic"},"text":"R","element":"span"},{"text":"-local minimizer of ","element":"span"},{"style":{"fontStyle":"italic"},"text":"F","element":"span"},{"text":". Applying part 2 of Theorem ","element":"span"},{"href":"#id-3","text":"3","element":"a"},{"text":", we get ","element":"span"},{"style":{"height":19.9},"width":664.77,"height":49.74,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1711.08172/images/8-0.png","element":"img","alt":" ∥∇F(x)∥ ≤ δ = α + max{ 4βR , 2√βL}","inline":true},{"text":". Combining this with the assumption ","element":"span"},{"style":{"height":16.22},"width":375.41,"height":40.54,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1711.08172/images/8-1.png","element":"img","alt":"∥∇f(x)∥ > α + 2√βL","inline":true},{"text":", we obtain","element":"span"}],[{"style":{"width":"99%"},"width":1765,"height":162,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1711.08172/images/8-2.png","element":"img"}],[{"text":"We can further increase ","element":"span"},{"style":{"fontStyle":"italic"},"text":"R ","element":"span"},{"text":"to ensure that any ","element":"span"},{"style":{"fontStyle":"italic"},"text":"R","element":"span"},{"text":"-local minimizer ","element":"span"},{"style":{"height":9.41},"width":24,"height":23.52,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1711.08172/images/8-3.png","element":"img","alt":" ¯x","inline":true,"padRight":true},{"text":"is a global minimizer.","element":"span"}],[{"style":{"fontWeight":"bold"},"text":"Theorem 4 ","element":"span"},{"style":{"fontStyle":"italic"},"text":"Under Assumptions ","element":"span"},{"href":"#id-35","style":{"fontStyle":"italic"},"text":"1","element":"a"},{"style":{"fontStyle":"italic"},"text":", ","element":"span"},{"href":"#id-29","style":{"fontStyle":"italic"},"text":"2 ","element":"a"},{"style":{"fontStyle":"italic"},"text":"and ","element":"span"},{"href":"#id-32","style":{"fontStyle":"italic"},"text":"3 ","element":"a"},{"style":{"fontStyle":"italic"},"text":"and ","element":"span"},{"style":{"height":18},"width":209.25,"height":45,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1711.08172/images/8-4.png","element":"img","alt":" R ≥ 2�β/L","inline":true},{"style":{"fontStyle":"italic"},"text":", we have ","element":"span"},{"style":{"height":23.1},"width":426.95,"height":57.74,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1711.08172/images/8-5.png","element":"img","alt":" d(¯x, χ∗) ≤ 2 α+2√βLµ +M","inline":true,"padRight":true},{"style":{"fontStyle":"italic"},"text":"for any","element":"span"}],[{"style":{"fontStyle":"italic"},"text":"R","element":"span"},{"style":{"fontStyle":"italic"},"text":"-local minimizer ","element":"span"},{"style":{"height":9.41},"width":24,"height":23.53,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1711.08172/images/8-6.png","element":"img","alt":" ¯x","inline":true},{"style":{"fontStyle":"italic"},"text":". Therefore, if ","element":"span"},{"style":{"height":23.1},"width":330.24,"height":57.74,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1711.08172/images/8-7.png","element":"img","alt":" R ≥ 2 α+2√βLµ + M","inline":true},{"style":{"fontStyle":"italic"},"text":", any ","element":"span"},{"style":{"fontStyle":"italic"},"text":"R","element":"span"},{"style":{"fontStyle":"italic"},"text":"-local minimizer ","element":"span"},{"style":{"height":9.41},"width":24,"height":23.53,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1711.08172/images/8-8.png","element":"img","alt":" ¯x","inline":true,"padRight":true},{"style":{"fontStyle":"italic"},"text":"is a global minimizer.","element":"span"}],[{"style":{"fontStyle":"italic"},"text":"Proof ","element":"span"},{"text":"According to Theorem ","element":"span"},{"href":"#id-37","text":"2","element":"a"},{"text":", part 2, and Theorem ","element":"span"},{"href":"#id-3","text":"3","element":"a"},{"text":", part 2,","element":"span"}],[{"id":"id-4","style":{"width":"38%"},"width":678,"height":89,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1711.08172/images/8-9.png","element":"img"}],[{"text":"where, for the latter inequality, we have used ","element":"span"},{"style":{"height":18},"width":219.19,"height":45,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1711.08172/images/8-10.png","element":"img","alt":" R ≥ 2�β/L","inline":true,"padRight":true},{"text":"and thus ","element":"span"},{"style":{"height":16.22},"width":487.99,"height":40.54,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1711.08172/images/8-11.png","element":"img","alt":" max{4β/R, 2√βL} = 2√βL","inline":true},{"text":". By convex analysis on ","element":"span"},{"style":{"fontStyle":"italic"},"text":"f","element":"span"},{"text":", we have ","element":"span"},{"style":{"height":13.6},"width":104.44,"height":34,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1711.08172/images/8-12.png","element":"img","alt":" µ ≤ L","inline":true},{"text":". Using it with ","element":"span"},{"style":{"height":12.4},"width":98.97,"height":31,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1711.08172/images/8-13.png","element":"img","alt":" α ≥ 0","inline":true,"padRight":true},{"text":"and ","element":"span"},{"style":{"height":12.8},"width":115.68,"height":32,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1711.08172/images/8-14.png","element":"img","alt":" M ≥ 0","inline":true},{"text":", we further get ","element":"span"},{"style":{"height":23.1},"width":292.06,"height":57.74,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1711.08172/images/8-15.png","element":"img","alt":" 2 α+2√βLµ + M ≥","inline":true}],[{"style":{"height":18},"width":536.38,"height":45,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1711.08172/images/8-16.png","element":"img","alt":"4√βL/µ ≥ 4√βL/L ≥ 2�β/L","inline":true},{"text":". Therefore, if ","element":"span"},{"style":{"height":23.1},"width":338.38,"height":57.75,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1711.08172/images/8-17.png","element":"img","alt":" R ≥ 2 α+2√βLµ + M","inline":true},{"text":", then there exists ","element":"span"},{"style":{"height":14.43},"width":137.19,"height":36.08,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1711.08172/images/8-18.png","element":"img","alt":" x∗ ∈ χ∗","inline":true,"padRight":true},{"text":"such that ","element":"span"},{"style":{"height":15.63},"width":222.23,"height":39.08,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1711.08172/images/8-19.png","element":"img","alt":"x∗ ∈ B(¯x, R)","inline":true},{"text":". Being an ","element":"span"},{"style":{"fontStyle":"italic"},"text":"R","element":"span"},{"text":"-local minimizer means ","element":"span"},{"style":{"height":9.41},"width":24,"height":23.52,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1711.08172/images/8-20.png","element":"img","alt":" ¯x","inline":true,"padRight":true},{"text":"satisfies ","element":"span"},{"style":{"height":15.63},"width":238.98,"height":39.08,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1711.08172/images/8-21.png","element":"img","alt":" F(¯x) ≤ F(x∗)","inline":true},{"text":", so ","element":"span"},{"style":{"height":9.41},"width":24,"height":23.52,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1711.08172/images/8-22.png","element":"img","alt":" ¯x","inline":true,"padRight":true},{"text":"is a global minimizer. ","element":"span"},{"style":{"height":0},"width":19,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1711.08172/images/8-23.png","element":"img","alt":"⊓⊔","inline":true}],[{"style":{"fontWeight":"bold"},"text":"Remark 1 ","element":"span"},{"style":{"fontStyle":"italic"},"text":"Since the decomposition ","element":"span"},{"text":"(","element":"span"},{"href":"#id-27","text":"7","element":"a"},{"text":") ","element":"span"},{"style":{"fontStyle":"italic"},"text":"is implicit, the constants in our analysis are difficult to estimate in practice. However, if we have a rough estimate of the distance between the global minimizer and its nearby local minimizers, then this distance appears to be a good empirical choice for ","element":"span"},{"style":{"fontStyle":"italic"},"text":"R","element":"span"},{"style":{"fontStyle":"italic"},"text":".","element":"span"}],[{"id":"id-24","text":"2.3 Blockwise ","element":"span"},{"style":{"fontWeight":"bold"},"text":"R","element":"span"},{"text":"-local minimizers","element":"span"}],[{"text":"In this section, we focus on problem (","element":"span"},{"href":"#id-26","text":"1","element":"a"},{"text":"). This blockwise structure of ","element":"span"},{"style":{"fontStyle":"italic"},"text":"F ","element":"span"},{"text":"motivates us to consider block-wise algorithms. Suppose ","element":"span"},{"style":{"height":12.03},"width":133.84,"height":30.08,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1711.08172/images/8-24.png","element":"img","alt":" R ∈ Rs","inline":true,"padRight":true},{"text":"and ","element":"span"},{"style":{"height":15.6},"width":367.9,"height":39,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1711.08172/images/8-25.png","element":"img","alt":" R = (R1, ..., Rs) ≥ 0","inline":true},{"text":". When we fix all blocks but ","element":"span"},{"style":{"height":8.21},"width":34.06,"height":20.52,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1711.08172/images/8-26.png","element":"img","alt":" xi","inline":true},{"text":", we write ","element":"span"},{"style":{"height":15.6},"width":310.04,"height":39,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1711.08172/images/8-27.png","element":"img","alt":"F(¯x1, ..., xi, ..., ¯xs)","inline":true,"padRight":true},{"text":"as ","element":"span"},{"style":{"height":15.6},"width":176.49,"height":39,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1711.08172/images/8-28.png","element":"img","alt":" F(xi, ¯x−i)","inline":true},{"text":".","element":"span"}],[{"style":{"fontWeight":"bold"},"text":"Definition 2 ","element":"span"},{"style":{"fontStyle":"italic"},"text":"A point ","element":"span"},{"style":{"height":9.41},"width":24,"height":23.53,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1711.08172/images/8-29.png","element":"img","alt":" ¯x","inline":true,"padRight":true},{"style":{"fontStyle":"italic"},"text":"is called a blockwise ","element":"span"},{"style":{"fontWeight":"bold"},"text":"R","element":"span"},{"style":{"fontStyle":"italic"},"text":"-local minimizer of ","element":"span"},{"style":{"fontStyle":"italic"},"text":"F ","element":"span"},{"style":{"fontStyle":"italic"},"text":"if it satisfies","element":"span"}],[{"id":"id-38","style":{"width":"99%"},"width":1766,"height":305,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1711.08172/images/8-30.png","element":"img"}],[{"style":{"fontWeight":"bold"},"text":"Theorem 5 ","element":"span"},{"style":{"fontStyle":"italic"},"text":"Suppose ","element":"span"},{"style":{"fontStyle":"italic"},"text":"f ","element":"span"},{"style":{"fontStyle":"italic"},"text":"and ","element":"span"},{"style":{"fontStyle":"italic"},"text":"r ","element":"span"},{"style":{"fontStyle":"italic"},"text":"satisfy Assumptions ","element":"span"},{"href":"#id-35","style":{"fontStyle":"italic"},"text":"1 ","element":"a"},{"style":{"fontStyle":"italic"},"text":"and ","element":"span"},{"href":"#id-32","style":{"fontStyle":"italic"},"text":"3","element":"a"},{"style":{"fontStyle":"italic"},"text":". If ","element":"span"},{"style":{"height":9.41},"width":24,"height":23.53,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1711.08172/images/8-31.png","element":"img","alt":" ¯x","inline":true,"padRight":true},{"style":{"fontStyle":"italic"},"text":"is a blockwise ","element":"span"},{"style":{"fontWeight":"bold"},"text":"R","element":"span"},{"style":{"fontStyle":"italic"},"text":"-local minimizer of ","element":"span"},{"style":{"fontStyle":"italic"},"text":"F","element":"span"},{"style":{"fontStyle":"italic"},"text":", then ","element":"span"},{"style":{"height":14.46},"width":131.59,"height":36.14,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1711.08172/images/8-32.png","element":"img","alt":" ¯x ∈ SQ","inline":true,"padRight":true},{"style":{"fontStyle":"italic"},"text":"(i.e, the property ","element":"span"},{"style":{"fontWeight":"bold"},"text":"Q ","element":"span"},{"style":{"fontStyle":"italic"},"text":"is met) for ","element":"span"},{"style":{"height":19.79},"width":365.63,"height":49.48,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1711.08172/images/8-33.png","element":"img","alt":" δ = ∥v∥ := (� |v2i |)12","inline":true,"padRight":true},{"style":{"fontStyle":"italic"},"text":"where ","element":"span"},{"style":{"height":21.66},"width":457.5,"height":54.16,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1711.08172/images/8-34.png","element":"img","alt":" vi := α + max{ 4βRi , 2√βL}","inline":true},{"style":{"fontStyle":"italic"},"text":", ","element":"span"},{"style":{"height":12.8},"width":154.52,"height":32,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1711.08172/images/8-35.png","element":"img","alt":"1 ≤ i ≤ s","inline":true},{"style":{"fontStyle":"italic"},"text":".","element":"span"}],[{"style":{"height":13.6},"width":143.91,"height":34,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1711.08172/images/8-36.png","element":"img","alt":"Proof ¯xi","inline":true,"padRight":true},{"text":"is an ","element":"span"},{"style":{"height":12.21},"width":41.44,"height":30.52,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1711.08172/images/8-37.png","element":"img","alt":" Ri","inline":true},{"text":"-local minimizer of ","element":"span"},{"style":{"height":15.6},"width":176.5,"height":39,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1711.08172/images/8-38.png","element":"img","alt":" F(xi, ¯x−i)","inline":true},{"text":". Since ","element":"span"},{"style":{"height":15.6},"width":602.16,"height":39,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1711.08172/images/8-39.png","element":"img","alt":" F(xi, ¯x−i) = f(xi, ¯x−i)+r(xi, ¯x−i)","inline":true,"padRight":true},{"text":"and ","element":"span"},{"style":{"height":15.6},"width":169.25,"height":39,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1711.08172/images/8-40.png","element":"img","alt":" f(xi, ¯x−i)","inline":true,"padRight":true},{"text":"and ","element":"span"},{"style":{"height":15.6},"width":164.88,"height":39,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1711.08172/images/8-41.png","element":"img","alt":" r(xi, ¯x−i)","inline":true,"padRight":true},{"text":"satisfy Assumption ","element":"span"},{"href":"#id-35","text":"1 ","element":"a"},{"text":"and Assumption ","element":"span"},{"href":"#id-32","text":"3","element":"a"},{"text":", Theorem ","element":"span"},{"href":"#id-3","text":"3 ","element":"a"},{"text":"shows that ","element":"span"},{"style":{"height":15.6},"width":381.85,"height":39,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1711.08172/images/8-42.png","element":"img","alt":" ∥∇if(¯xi, ¯x−i)∥ ≤ α +","inline":true},{"style":{"height":21.66},"width":372.55,"height":54.15,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1711.08172/images/8-43.png","element":"img","alt":"max{ 4βRi , 2√βL} = vi.","inline":true,"padRight":true},{"text":"Hence ","element":"span"},{"style":{"height":15.6},"width":1259.78,"height":39,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1711.08172/images/8-44.png","element":"img","alt":" ∥∇f(¯x)∥ ≤ ∥v∥. ⊓⊔","inline":true}],[{"style":{"fontWeight":"bold"},"text":"Remark 2 ","element":"span"},{"style":{"fontStyle":"italic"},"text":"We can also obtain a simplified version of Theorem ","element":"span"},{"href":"#id-38","style":{"fontStyle":"italic"},"text":"5","element":"a"},{"style":{"fontStyle":"italic"},"text":", which is","element":"span"}],[{"style":{"width":"48%"},"width":850,"height":92,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1711.08172/images/8-45.png","element":"img"}],[{"style":{"fontStyle":"italic"},"text":"The main difference between the standar and blockwise estimates is the extra factor ","element":"span"},{"style":{"height":15.2},"width":50.42,"height":38,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1711.08172/images/8-46.png","element":"img","alt":"√s","inline":true,"padRight":true},{"style":{"fontStyle":"italic"},"text":"in the latter.","element":"span"}],[{"style":{"fontWeight":"bold"},"text":"Remark 3 ","element":"span"},{"style":{"fontStyle":"italic"},"text":"Since we can set ","element":"span"},{"style":{"height":10.4},"width":120.58,"height":26,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1711.08172/images/9-0.png","element":"img","alt":" R = ∞","inline":true},{"style":{"fontStyle":"italic"},"text":", our results apply to Nash equilibrium points.","element":"span"}],[{"text":"Generalized from Corollary ","element":"span"},{"href":"#id-39","text":"1","element":"a"},{"text":", the following result provides estimates of ","element":"span"},{"style":{"height":12.21},"width":41.44,"height":30.52,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1711.08172/images/9-1.png","element":"img","alt":" Ri","inline":true,"padRight":true},{"text":"for escaping from non-global local minimizers. The estimates are smaller when ","element":"span"},{"style":{"height":13.2},"width":68.23,"height":33,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1711.08172/images/9-2.png","element":"img","alt":" ∇if","inline":true,"padRight":true},{"text":"are larger.","element":"span"}],[{"style":{"fontWeight":"bold"},"text":"Corollary 2 ","element":"span"},{"style":{"fontStyle":"italic"},"text":"Let ","element":"span"},{"style":{"fontWeight":"bold"},"text":"x ","element":"span"},{"style":{"fontStyle":"italic"},"text":"be a local minimizer of ","element":"span"},{"style":{"fontStyle":"italic"},"text":"F ","element":"span"},{"style":{"fontStyle":"italic"},"text":"and ","element":"span"},{"style":{"height":16.22},"width":488.97,"height":40.54,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1711.08172/images/9-3.png","element":"img","alt":" ∥∇if(xi, x−i)∥ > α + 2√βL","inline":true,"padRight":true},{"style":{"fontStyle":"italic"},"text":"for some ","element":"span"},{"style":{"fontStyle":"italic"},"text":"i","element":"span"},{"style":{"fontStyle":"italic"},"text":". As long as ","element":"span"},{"style":{"height":22.69},"width":357.7,"height":56.74,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1711.08172/images/9-4.png","element":"img","alt":"Ri > 4β∥∇if(xi,x−i)∥−α","inline":true},{"style":{"fontStyle":"italic"},"text":", there exists ","element":"span"},{"style":{"height":15.6},"width":225.92,"height":39,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1711.08172/images/9-5.png","element":"img","alt":" y ∈ B(xi, Ri)","inline":true},{"style":{"fontStyle":"italic"},"text":", such that ","element":"span"},{"style":{"height":15.6},"width":389.66,"height":39,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1711.08172/images/9-6.png","element":"img","alt":" F(y, x−i) < F(xi, x−i)","inline":true},{"style":{"fontStyle":"italic"},"text":".","element":"span"}],[{"text":"The theorem below, which follows from Theorems ","element":"span"},{"href":"#id-37","text":"2 ","element":"a"},{"text":"and ","element":"span"},{"href":"#id-38","text":"5","element":"a"},{"text":", bounds the distance of an ","element":"span"},{"style":{"fontWeight":"bold"},"text":"R","element":"span"},{"text":"-local minimizer to the set of global minimizers. We do not have a vector ","element":"span"},{"style":{"fontWeight":"bold"},"text":"R ","element":"span"},{"text":"to ensure the global optimality of ","element":"span"},{"style":{"height":9.41},"width":24,"height":23.52,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1711.08172/images/9-7.png","element":"img","alt":" ¯x","inline":true,"padRight":true},{"text":"due to the blockwise limitation. Of course, after reaching ","element":"span"},{"style":{"height":9.41},"width":24,"height":23.52,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1711.08172/images/9-8.png","element":"img","alt":" ¯x","inline":true},{"text":", if we switch to standard (non-blockwise) inspection to obtain an standard ","element":"span"},{"style":{"fontStyle":"italic"},"text":"R","element":"span"},{"text":"-local minimizer, we will be able to apply Theorem ","element":"span"},{"href":"#id-4","text":"4","element":"a"},{"text":".","element":"span"}],[{"style":{"fontWeight":"bold"},"text":"Theorem 6 ","element":"span"},{"style":{"fontStyle":"italic"},"text":"Suppose ","element":"span"},{"style":{"fontStyle":"italic"},"text":"f ","element":"span"},{"style":{"fontStyle":"italic"},"text":"and ","element":"span"},{"style":{"fontStyle":"italic"},"text":"r ","element":"span"},{"style":{"fontStyle":"italic"},"text":"satisfy Assumptions ","element":"span"},{"href":"#id-35","style":{"fontStyle":"italic"},"text":"1","element":"a"},{"style":{"fontStyle":"italic"},"text":"–","element":"span"},{"href":"#id-32","style":{"fontStyle":"italic"},"text":"3","element":"a"},{"style":{"fontStyle":"italic"},"text":". If ","element":"span"},{"style":{"height":9.41},"width":24,"height":23.53,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1711.08172/images/9-9.png","element":"img","alt":" ¯x","inline":true,"padRight":true},{"style":{"fontStyle":"italic"},"text":"is a blockwise ","element":"span"},{"style":{"fontWeight":"bold"},"text":"R","element":"span"},{"style":{"fontStyle":"italic"},"text":"-local minimizer of ","element":"span"},{"style":{"fontStyle":"italic"},"text":"F","element":"span"},{"style":{"fontStyle":"italic"},"text":", then","element":"span"}],[{"style":{"width":"48%"},"width":856,"height":91,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1711.08172/images/9-10.png","element":"img"}],[{"id":"id-25","text":"2.4 Run-and-Inspect Method","element":"span"}],[{"text":"In this section, we introduce our Run-and-Inspect Method. The “run” phase can use any algorithm that monotonically converges to an approximate stationary point. When the algorithm stops at either an approximate local minimizer or a saddle point, our method starts its “inspection” phase, which either moves to a strictly better point or verifies that the current point is an approximate (blockwise) ","element":"span"},{"style":{"fontStyle":"italic"},"text":"R","element":"span"},{"text":"-local minimizer.","element":"span"}],[{"style":{"fontStyle":"italic"},"text":"2.4.1 Approximate ","element":"span"},{"style":{"fontStyle":"italic"},"text":"R","element":"span"},{"style":{"fontStyle":"italic"},"text":"-local minimizers","element":"span"}],[{"text":"We define ","element":"span"},{"style":{"fontStyle":"italic"},"text":"approximate ","element":"span"},{"style":{"fontStyle":"italic"},"text":"R","element":"span"},{"text":"-local minimizers. Since an ","element":"span"},{"style":{"fontStyle":"italic"},"text":"R","element":"span"},{"text":"-local minimizer is a special case of a blockwise ","element":"span"},{"style":{"fontWeight":"bold"},"text":"R","element":"span"},{"text":"-local minimizer, we only deal with the latter. Let ","element":"span"},{"style":{"height":18.03},"width":281.91,"height":45.08,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1711.08172/images/9-11.png","element":"img","alt":" x = [xT1 · · · xTs ]T","inline":true,"padRight":true},{"text":". A point ","element":"span"},{"style":{"height":9.41},"width":24,"height":23.53,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1711.08172/images/9-12.png","element":"img","alt":" ¯x","inline":true,"padRight":true},{"text":"is called a blockwise ","element":"span"},{"style":{"fontWeight":"bold"},"text":"R","element":"span"},{"text":"-local minimizer of ","element":"span"},{"style":{"height":18.03},"width":469.21,"height":45.08,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1711.08172/images/9-13.png","element":"img","alt":" F up to η = [η1 · · · ηs]T ≥ 0","inline":true,"padRight":true},{"text":"if it satisfies","element":"span"}],[{"style":{"width":"50%"},"width":891,"height":62,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1711.08172/images/9-14.png","element":"img"}],[{"text":"when ","element":"span"},{"style":{"fontStyle":"italic"},"text":"s ","element":"span"},{"text":"= 1","element":"span"},{"text":", we say ","element":"span"},{"style":{"height":9.41},"width":24,"height":23.53,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1711.08172/images/9-15.png","element":"img","alt":" ¯x","inline":true,"padRight":true},{"text":"is an ","element":"span"},{"style":{"fontStyle":"italic"},"text":"R","element":"span"},{"text":"-local minimizer of ","element":"span"},{"style":{"fontStyle":"italic"},"text":"F ","element":"span"},{"text":"up to ","element":"span"},{"style":{"height":9.6},"width":20,"height":24,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1711.08172/images/9-16.png","element":"img","alt":" η","inline":true},{"text":". It is easy to modify the proof of Theorem ","element":"span"},{"href":"#id-3","text":"3 ","element":"a"},{"text":"to get:","element":"span"}],[{"style":{"fontWeight":"bold"},"text":"Theorem 7 ","element":"span"},{"style":{"fontStyle":"italic"},"text":"Suppose ","element":"span"},{"style":{"fontStyle":"italic"},"text":"f ","element":"span"},{"style":{"fontStyle":"italic"},"text":"and ","element":"span"},{"style":{"fontStyle":"italic"},"text":"r ","element":"span"},{"style":{"fontStyle":"italic"},"text":"satisfy Assumptions ","element":"span"},{"href":"#id-35","style":{"fontStyle":"italic"},"text":"1 ","element":"a"},{"style":{"fontStyle":"italic"},"text":"and ","element":"span"},{"href":"#id-32","style":{"fontStyle":"italic"},"text":"3","element":"a"},{"style":{"fontStyle":"italic"},"text":". Then ","element":"span"},{"style":{"height":14.46},"width":133.91,"height":36.14,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1711.08172/images/9-17.png","element":"img","alt":" ¯x ∈ SQ","inline":true,"padRight":true},{"style":{"fontStyle":"italic"},"text":"if ","element":"span"},{"style":{"height":9.41},"width":24,"height":23.53,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1711.08172/images/9-18.png","element":"img","alt":" ¯x","inline":true,"padRight":true},{"style":{"fontStyle":"italic"},"text":"is a blockwise ","element":"span"},{"style":{"fontWeight":"bold"},"text":"R","element":"span"},{"style":{"fontStyle":"italic"},"text":"-local minimizer of ","element":"span"},{"style":{"fontStyle":"italic"},"text":"F ","element":"span"},{"style":{"fontStyle":"italic"},"text":"up to ","element":"span"},{"style":{"height":9.6},"width":20,"height":24,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1711.08172/images/9-19.png","element":"img","alt":" η","inline":true,"padRight":true},{"style":{"fontStyle":"italic"},"text":"and ","element":"span"},{"style":{"height":19.79},"width":348.38,"height":49.48,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1711.08172/images/9-20.png","element":"img","alt":" δ ≥ ∥v∥ := (� |v2i |)12","inline":true,"padRight":true},{"style":{"fontStyle":"italic"},"text":"for ","element":"span"},{"style":{"height":21.66},"width":820.01,"height":54.15,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1711.08172/images/9-21.png","element":"img","alt":" vi = α + max{ 4β+2ηiRi ,�(4β + 2ηi)L}, 1 ≤ i ≤ s","inline":true},{"style":{"fontStyle":"italic"},"text":".","element":"span"}],[{"text":"Whenever the condition ","element":"span"},{"style":{"height":14.46},"width":126.7,"height":36.14,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1711.08172/images/9-22.png","element":"img","alt":" ¯x ∈ SQ","inline":true,"padRight":true},{"text":"holds, our previous results for blockwise ","element":"span"},{"style":{"fontWeight":"bold"},"text":"R","element":"span"},{"text":"-local minimizers are applicable.","element":"span"}],[{"style":{"fontStyle":"italic"},"text":"2.4.2 Algorithms","element":"span"}],[{"text":"Now we present two algorithms based on our Run-and-Inspect Method. Suppose that we have implemented an algorithm and it returns a point ","element":"span"},{"style":{"height":9.41},"width":24,"height":23.52,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1711.08172/images/9-23.png","element":"img","alt":" ¯x","inline":true},{"text":". For simplicity let ","element":"span"},{"style":{"fontWeight":"bold"},"text":"Alg ","element":"span"},{"text":"denote this algorithm. To verify the global optimality of ","element":"span"},{"style":{"height":9.41},"width":24,"height":23.53,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1711.08172/images/9-24.png","element":"img","alt":" ¯x","inline":true},{"text":", we seek to inspect ","element":"span"},{"style":{"fontStyle":"italic"},"text":"F ","element":"span"},{"text":"around ","element":"span"},{"style":{"height":9.41},"width":24,"height":23.53,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1711.08172/images/9-25.png","element":"img","alt":" ¯x","inline":true,"padRight":true},{"text":"by sampling some points. Since a global search is apparently too costly, the inspection is limited in a ball centered at ","element":"span"},{"style":{"height":9.41},"width":24,"height":23.52,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1711.08172/images/9-26.png","element":"img","alt":" ¯x","inline":true},{"text":", and for high-dimensional problems, further limited to lower-dimensional balls.","element":"span"}],[{"text":"The inspection strategy is to sample some points in the ball around the current point and stop whenever either a better point is found or it finishes the last point. By “better\", we mean the objective value decreases by at least a constant amount ","element":"span"},{"style":{"height":10.8},"width":93.51,"height":27,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1711.08172/images/9-27.png","element":"img","alt":" ν > 0","inline":true},{"text":". We call this ","element":"span"},{"style":{"height":6.4},"width":21,"height":16,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1711.08172/images/9-28.png","element":"img","alt":" ν","inline":true,"padRight":true},{"text":"descent threshold. If a better point is found, we resume ","element":"span"},{"style":{"fontWeight":"bold"},"text":"Alg ","element":"span"},{"text":"at that point. If no better point is found, the current point is an ","element":"span"},{"style":{"fontStyle":"italic"},"text":"R","element":"span"},{"text":"-local or ","element":"span"},{"style":{"fontWeight":"bold"},"text":"R","element":"span"},{"text":"-local minimizer of ","element":"span"},{"style":{"fontStyle":"italic"},"text":"F ","element":"span"},{"text":"up to ","element":"span"},{"style":{"height":9.6},"width":20,"height":24,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1711.08172/images/9-29.png","element":"img","alt":" η","inline":true},{"text":", where ","element":"span"},{"style":{"height":9.6},"width":20,"height":24,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1711.08172/images/9-30.png","element":"img","alt":" η","inline":true,"padRight":true},{"text":"depends on the density of sample points and the Lipschitz constant of ","element":"span"},{"style":{"fontStyle":"italic"},"text":"F ","element":"span"},{"text":"in the ball.","element":"span"}],[{"id":"id-40","style":{"width":"100%"},"width":1768,"height":556,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1711.08172/images/10-0.png","element":"img"}],[{"text":"If ","element":"span"},{"style":{"fontWeight":"bold"},"text":"Alg ","element":"span"},{"text":"is a descent method, i.e., ","element":"span"},{"style":{"height":18.03},"width":260.18,"height":45.08,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1711.08172/images/10-1.png","element":"img","alt":" F(¯xk) ≤ F(xk)","inline":true},{"text":", algorithm ","element":"span"},{"href":"#id-40","text":"1 ","element":"a"},{"text":"will stop and output a point ","element":"span"},{"style":{"height":14.48},"width":55.71,"height":36.2,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1711.08172/images/10-2.png","element":"img","alt":" ¯xk∗","inline":true,"padRight":true},{"text":"within finitely many iterations: ","element":"span"},{"style":{"height":20.53},"width":263.6,"height":51.33,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1711.08172/images/10-3.png","element":"img","alt":" k∗ ≤ F (x0)−F ∗ν ,","inline":true,"padRight":true},{"text":"where ","element":"span"},{"style":{"height":11.63},"width":47.25,"height":29.08,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1711.08172/images/10-4.png","element":"img","alt":" F ∗","inline":true,"padRight":true},{"text":"is the global minimum of ","element":"span"},{"style":{"fontStyle":"italic"},"text":"F","element":"span"},{"text":".","element":"span"}],[{"text":"The sampling step is a ","element":"span"},{"style":{"fontStyle":"italic"},"text":"hit-and-run","element":"span"},{"text":", that is, points are only sampled when they are used, and the sampling stops whenever a better point is obtained (or all points have been used). The method of sampling and the number of sample points can vary throughout iterations and depend on the problem structure. In general, sampling points from the outside toward the inside is more efficient.","element":"span"}],[{"text":"Here, we analyze a simple approach in which sufficiently many well-distributed points are sampled to ensure that ","element":"span"},{"style":{"height":14.48},"width":55.71,"height":36.2,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1711.08172/images/10-5.png","element":"img","alt":" ¯xk∗","inline":true,"padRight":true},{"text":"is an approximate ","element":"span"},{"style":{"fontStyle":"italic"},"text":"R","element":"span"},{"text":"-local minimizer.","element":"span"}],[{"id":"id-42","style":{"fontWeight":"bold"},"text":"Theorem 8 ","element":"span"},{"style":{"fontStyle":"italic"},"text":"Assume that ","element":"span"},{"style":{"fontStyle":"italic"},"text":"f","element":"span"},{"text":"(","element":"span"},{"style":{"fontWeight":"bold"},"text":"x","element":"span"},{"text":") ","element":"span"},{"style":{"fontStyle":"italic"},"text":"is ","element":"span"},{"style":{"height":13.03},"width":27,"height":32.57,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1711.08172/images/10-6.png","element":"img","alt":"¯L","inline":true},{"style":{"fontStyle":"italic"},"text":"-Lipschitz continuous","element":"span"},{"style":{"height":7.2},"width":17,"height":18,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1711.08172/images/10-7.png","element":"img","alt":"1","inline":true,"padRight":true},{"style":{"fontStyle":"italic"},"text":"in the ball ","element":"span"},{"style":{"height":15.6},"width":132.1,"height":39,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1711.08172/images/10-8.png","element":"img","alt":" B(¯x, R)","inline":true,"padRight":true},{"style":{"fontStyle":"italic"},"text":"and the set of sample points ","element":"span"},{"style":{"height":15.6},"width":340.29,"height":39,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1711.08172/images/10-9.png","element":"img","alt":"S = {y1, y2, ..., ym}","inline":true,"padRight":true},{"style":{"fontStyle":"italic"},"text":"has density ","element":"span"},{"style":{"height":9.2},"width":21.75,"height":23,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1711.08172/images/10-10.png","element":"img","alt":" ¯r","inline":true},{"style":{"fontStyle":"italic"},"text":", that is,","element":"span"}],[{"style":{"width":"27%"},"width":492,"height":62,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1711.08172/images/10-11.png","element":"img"}],[{"style":{"fontStyle":"italic"},"text":"where ","element":"span"},{"style":{"height":12.21},"width":117.71,"height":30.53,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1711.08172/images/10-12.png","element":"img","alt":" y0 = ¯x","inline":true},{"style":{"fontStyle":"italic"},"text":". If","element":"span"}],[{"style":{"width":"34%"},"width":615,"height":39,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1711.08172/images/10-13.png","element":"img"}],[{"style":{"fontStyle":"italic"},"text":"then the point ","element":"span"},{"style":{"height":9.41},"width":24,"height":23.53,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1711.08172/images/10-14.png","element":"img","alt":" ¯x","inline":true,"padRight":true},{"style":{"fontStyle":"italic"},"text":"is an ","element":"span"},{"style":{"fontStyle":"italic"},"text":"R","element":"span"},{"style":{"fontStyle":"italic"},"text":"-local minimizer of ","element":"span"},{"style":{"fontStyle":"italic"},"text":"F ","element":"span"},{"style":{"fontStyle":"italic"},"text":"up to ","element":"span"},{"style":{"height":17.03},"width":379.76,"height":42.57,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1711.08172/images/10-15.png","element":"img","alt":" η = ν + (¯L + α)¯r + 2β","inline":true},{"style":{"fontStyle":"italic"},"text":".","element":"span"}],[{"style":{"width":"99%"},"width":1765,"height":504,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1711.08172/images/10-16.png","element":"img"}],[{"text":"When the dimension of ","element":"span"},{"style":{"fontWeight":"bold"},"text":"x ","element":"span"},{"text":"is high , it is impractical to inspect over a high-dimensional ball. This motivates us to extend algorithm ","element":"span"},{"href":"#id-40","text":"1 ","element":"a"},{"text":"to its blockwise version.","element":"span"}],[{"id":"id-41","style":{"width":"100%"},"width":1768,"height":627,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1711.08172/images/11-0.png","element":"img"}],[{"text":"Algorithm ","element":"span"},{"href":"#id-41","text":"2 ","element":"a"},{"text":"samples points in a block while keeping other block variables fixed. This algorithm ends with an approximate blockwise ","element":"span"},{"style":{"fontWeight":"bold"},"text":"R","element":"span"},{"text":"-local minimizer.","element":"span"}],[{"id":"id-43","style":{"fontWeight":"bold"},"text":"Theorem 9 ","element":"span"},{"style":{"fontStyle":"italic"},"text":"Assume that ","element":"span"},{"style":{"height":15.6},"width":169.24,"height":39,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1711.08172/images/11-1.png","element":"img","alt":" f(xi, ¯x−i)","inline":true,"padRight":true},{"style":{"fontStyle":"italic"},"text":"is ","element":"span"},{"style":{"height":14.84},"width":38.47,"height":37.09,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1711.08172/images/11-2.png","element":"img","alt":"¯Li","inline":true},{"style":{"fontStyle":"italic"},"text":"-Lipschitz continuous in the ball ","element":"span"},{"style":{"height":15.6},"width":157.85,"height":39,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1711.08172/images/11-3.png","element":"img","alt":" B(¯xi, Ri)","inline":true,"padRight":true},{"style":{"fontStyle":"italic"},"text":"for ","element":"span"},{"style":{"height":12.8},"width":163.64,"height":32,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1711.08172/images/11-4.png","element":"img","alt":" 1 ≤ i ≤ s","inline":true,"padRight":true},{"style":{"fontStyle":"italic"},"text":"and the set of sample points ","element":"span"},{"style":{"height":15.6},"width":572.89,"height":39,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1711.08172/images/11-5.png","element":"img","alt":" Si = {zi1, zi2, . . . , zimi}, 1 ≤ i ≤ s","inline":true,"padRight":true},{"style":{"fontStyle":"italic"},"text":"has blockwise-density ","element":"span"},{"style":{"height":9.2},"width":21.75,"height":23,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1711.08172/images/11-6.png","element":"img","alt":" ¯r","inline":true},{"style":{"fontStyle":"italic"},"text":", that is,","element":"span"}],[{"style":{"width":"45%"},"width":797,"height":62,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1711.08172/images/11-7.png","element":"img"}],[{"style":{"fontStyle":"italic"},"text":"where ","element":"span"},{"style":{"height":11.01},"width":134.03,"height":27.52,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1711.08172/images/11-8.png","element":"img","alt":" zi0 = ¯xi","inline":true},{"style":{"fontStyle":"italic"},"text":". If","element":"span"}],[{"id":"id-44","style":{"width":"53%"},"width":948,"height":39,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1711.08172/images/11-9.png","element":"img"}],[{"style":{"fontStyle":"italic"},"text":"then ","element":"span"},{"style":{"height":9.41},"width":24,"height":23.53,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1711.08172/images/11-10.png","element":"img","alt":" ¯x","inline":true,"padRight":true},{"style":{"fontStyle":"italic"},"text":"is a blockwise ","element":"span"},{"style":{"fontWeight":"bold"},"text":"R","element":"span"},{"style":{"fontStyle":"italic"},"text":"-local minimizer of ","element":"span"},{"style":{"fontStyle":"italic"},"text":"F ","element":"span"},{"style":{"fontStyle":"italic"},"text":"up to ","element":"span"},{"style":{"height":18.03},"width":278.28,"height":45.08,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1711.08172/images/11-11.png","element":"img","alt":" η = [η1, . . . , ηs]T","inline":true,"padRight":true},{"style":{"fontStyle":"italic"},"text":"for ","element":"span"},{"style":{"height":17.03},"width":406.01,"height":42.57,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1711.08172/images/11-12.png","element":"img","alt":" ηi = ν + (¯Li + α)¯r + 2β","inline":true},{"style":{"fontStyle":"italic"},"text":".","element":"span"}],[{"text":"The proof is similar to that of Theorem ","element":"span"},{"href":"#id-42","text":"8","element":"a"},{"text":". The next proposition states that inspection around a point with sufficiently large partial gradient of ","element":"span"},{"style":{"fontStyle":"italic"},"text":"f ","element":"span"},{"text":"ensures a sufficient descent.","element":"span"}],[{"style":{"fontWeight":"bold"},"text":"Proposition 10 ","element":"span"},{"style":{"fontStyle":"italic"},"text":"Assume that we sample points in the ball ","element":"span"},{"style":{"height":15.6},"width":157.85,"height":39,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1711.08172/images/11-13.png","element":"img","alt":" B(¯xi, Ri)","inline":true,"padRight":true},{"style":{"fontStyle":"italic"},"text":"with density ","element":"span"},{"style":{"height":19.2},"width":109.74,"height":48,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1711.08172/images/11-14.png","element":"img","alt":" ¯r ≤ Ri2","inline":true,"padRight":true},{"style":{"fontStyle":"italic"},"text":". If ","element":"span"},{"style":{"height":15.6},"width":295.31,"height":39,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1711.08172/images/11-15.png","element":"img","alt":" ∥∇if(¯xi, ¯x−i)∥ ≥","inline":true}],[{"style":{"height":19.89},"width":302.04,"height":49.73,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1711.08172/images/11-16.png","element":"img","alt":"2Li¯r + 3α + 2β+ν¯r","inline":true,"padRight":true},{"style":{"fontStyle":"italic"},"text":", then there exists at least one sampled point ","element":"span"},{"style":{"height":8.21},"width":30.09,"height":20.52,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1711.08172/images/11-17.png","element":"img","alt":" zi","inline":true,"padRight":true},{"style":{"fontStyle":"italic"},"text":"which satisfies","element":"span"}],[{"style":{"width":"99%"},"width":1764,"height":569,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1711.08172/images/11-18.png","element":"img"}],[{"text":"Therefore ","element":"span"},{"style":{"height":14.84},"width":38.48,"height":37.09,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1711.08172/images/11-19.png","element":"img","alt":"¯Li","inline":true,"padRight":true},{"text":"in Theorem ","element":"span"},{"href":"#id-43","text":"9 ","element":"a"},{"text":"can be bounded by ","element":"span"},{"style":{"height":19.89},"width":437.38,"height":49.74,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1711.08172/images/11-20.png","element":"img","alt":"92Li¯r + 3α + 2β+ν¯r + LiRi","inline":true},{"text":". And we can set","element":"span"}],[{"style":{"width":"36%"},"width":638,"height":77,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1711.08172/images/11-21.png","element":"img"}],[{"text":"This bound is not tight when ","element":"span"},{"style":{"height":12.21},"width":41.44,"height":30.52,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1711.08172/images/11-22.png","element":"img","alt":" Ri","inline":true,"padRight":true},{"text":"is very large.","element":"span"}],[{"text":"2.5 Complexity analysis","element":"span"}],[{"text":"Since Algorithm ","element":"span"},{"href":"#id-41","text":"2 ","element":"a"},{"text":"generalizes Algorithm ","element":"span"},{"href":"#id-40","text":"1 ","element":"a"},{"text":"to multiple blocks, we analyze the complexity of the former. There are quite many parameters that affect the complexity results. In this analysis, we focus on the dimension of the space ","element":"span"},{"style":{"fontStyle":"italic"},"text":"d ","element":"span"},{"text":"and different options to set the dimension ","element":"span"},{"style":{"height":10.4},"width":34.16,"height":26,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1711.08172/images/12-0.png","element":"img","alt":" d′","inline":true,"padRight":true},{"text":"of each block, assuming that all blocks have the same dimension ","element":"span"},{"style":{"height":12.8},"width":120.96,"height":32,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1711.08172/images/12-1.png","element":"img","alt":" d′ ≤ d","inline":true,"padRight":true},{"text":"and ","element":"span"},{"style":{"height":10.4},"width":34.16,"height":26,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1711.08172/images/12-2.png","element":"img","alt":" d′","inline":true,"padRight":true},{"text":"evenly divides ","element":"span"},{"style":{"fontStyle":"italic"},"text":"d","element":"span"},{"text":", thus, creating exactly ","element":"span"},{"style":{"fontStyle":"italic"},"text":"s ","element":"span"},{"text":"= ","element":"span"},{"style":{"height":15.6},"width":73.78,"height":39,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1711.08172/images/12-3.png","element":"img","alt":"d/d′","inline":true,"padRight":true},{"text":"blocks of variables. We assume that the smoothness and strong-convexity parameters of ","element":"span"},{"style":{"fontStyle":"italic"},"text":"f ","element":"span"},{"text":"are ","element":"span"},{"style":{"height":13.6},"width":225.62,"height":34,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1711.08172/images/12-4.png","element":"img","alt":"Li = L ≥ µ","inline":true},{"text":", respectively. Of course, ","element":"span"},{"style":{"height":13.6},"width":67.76,"height":34,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1711.08172/images/12-5.png","element":"img","alt":" L, µ","inline":true,"padRight":true},{"text":"affect the complexity significantly though not as much as the dimensions (unless ","element":"span"},{"style":{"height":13.6},"width":67.76,"height":34,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1711.08172/images/12-6.png","element":"img","alt":" L, µ","inline":true,"padRight":true},{"text":"themselves depend on ","element":"span"},{"style":{"fontStyle":"italic"},"text":"d","element":"span"},{"text":"). Assume the function ","element":"span"},{"style":{"fontStyle":"italic"},"text":"r ","element":"span"},{"text":"satisfies Assumption ","element":"span"},{"href":"#id-32","text":"3 ","element":"a"},{"text":"with parameters ","element":"span"},{"style":{"height":15.6},"width":205.55,"height":39,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1711.08172/images/12-7.png","element":"img","alt":" α, β ∈ [0, 1)","inline":true},{"text":". The ","element":"span"},{"style":{"fontWeight":"bold"},"text":"R","element":"span"},{"text":"-local min tolerance ","element":"span"},{"style":{"height":9.6},"width":31.43,"height":24,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1711.08172/images/12-8.png","element":"img","alt":" ηi","inline":true,"padRight":true},{"text":"of each block ","element":"span"},{"style":{"fontStyle":"italic"},"text":"i ","element":"span"},{"text":"is tied to ","element":"span"},{"style":{"height":13.6},"width":120.72,"height":34,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1711.08172/images/12-9.png","element":"img","alt":" β, ν, ¯r","inline":true,"padRight":true},{"text":"(density of sample points), and ","element":"span"},{"style":{"height":12.21},"width":41.44,"height":30.52,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1711.08172/images/12-10.png","element":"img","alt":" Ri","inline":true},{"text":". Based on Theorem ","element":"span"},{"href":"#id-43","text":"9 ","element":"a"},{"text":"and Proposition ","element":"span"},{"href":"#id-44","text":"10 ","element":"a"},{"text":"and using free parameters ","element":"span"},{"style":{"height":12.4},"width":120.78,"height":31,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1711.08172/images/12-11.png","element":"img","alt":" cν, c¯r, t","inline":true,"padRight":true},{"text":"that will be tuned later, we set ","element":"span"},{"style":{"height":13.6},"width":268.52,"height":34,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1711.08172/images/12-12.png","element":"img","alt":" ν := cνβ, ¯r :=","inline":true}],[{"id":"id-45","style":{"width":"99%"},"width":1766,"height":339,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1711.08172/images/12-13.png","element":"img"}],[{"text":"late that our Algorithm ","element":"span"},{"href":"#id-41","text":"2 ","element":"a"},{"text":"using ","element":"span"},{"style":{"height":12.21},"width":134.76,"height":30.52,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1711.08172/images/12-14.png","element":"img","alt":" Ri = t¯r","inline":true,"padRight":true},{"text":"will return an a","element":"span"},{"href":"#id-37","text":"pp","element":"a"},{"text":"roximate ","element":"span"},{"style":{"fontWeight":"bold"},"text":"R","element":"span"},{"text":"-local minimizer ","element":"span"},{"style":{"height":9.41},"width":24,"height":23.52,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1711.08172/images/12-15.png","element":"img","alt":" ¯x","inline":true,"padRight":true},{"text":"satisfying the","element":"span"}],[{"text":"global error bound:","element":"span"}],[{"id":"id-46","style":{"width":"78%"},"width":1379,"height":115,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1711.08172/images/12-16.png","element":"img"}],[{"text":"where ","element":"span"},{"style":{"height":19.89},"width":389.18,"height":49.73,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1711.08172/images/12-17.png","element":"img","alt":" c2 = 1 + 2 αδ + 2µβδ2 ≤ 4","inline":true},{"text":".","element":"span"},{"text":"For simplicity, we set ","element":"span"},{"style":{"height":13.2},"width":305.06,"height":33,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1711.08172/images/12-18.png","element":"img","alt":" c¯r = cν = 1, t = 6","inline":true,"padRight":true},{"text":"and assume ","element":"span"},{"style":{"height":15.2},"width":160.16,"height":38,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1711.08172/images/12-19.png","element":"img","alt":" α < √βL","inline":true},{"text":". By (","element":"span"},{"href":"#id-45","text":"16","element":"a"},{"text":"), (","element":"span"},{"href":"#id-46","text":"17","element":"a"},{"text":"), we will get","element":"span"}],[{"style":{"width":"68%"},"width":1219,"height":115,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1711.08172/images/12-20.png","element":"img"}],[{"text":"Next, we take three steps to calculate the complexity of Algorithm ","element":"span"},{"href":"#id-41","text":"2","element":"a"},{"text":":","element":"span"}],[{"style":{"fontWeight":"bold"},"text":"– ","element":"span"},{"text":"Since ","element":"span"},{"style":{"fontWeight":"bold"},"text":"Alg ","element":"span"},{"text":"is a descent algorithm and each inspection decreases the objective error by at least ","element":"span"},{"style":{"height":13.6},"width":133.62,"height":34,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1711.08172/images/12-21.png","element":"img","alt":" ν = cνβ","inline":true},{"text":", with initial point ","element":"span"},{"style":{"height":13.63},"width":39.05,"height":34.08,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1711.08172/images/12-22.png","element":"img","alt":" x0","inline":true},{"text":", we need at most ","element":"span"},{"style":{"height":27.6},"width":288.11,"height":69,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1711.08172/images/12-23.png","element":"img","alt":" O�F (x0)−F (x∗)cνβ �","inline":true},{"text":"inspections or loops in Algorithm ","element":"span"},{"href":"#id-41","text":"2","element":"a"},{"text":". Under our assumption ","element":"span"},{"style":{"height":12.08},"width":108.96,"height":30.21,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1711.08172/images/12-24.png","element":"img","alt":" cν = 1","inline":true},{"text":", the number of inspections is ","element":"span"},{"style":{"height":27.6},"width":288.11,"height":69,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1711.08172/images/12-25.png","element":"img","alt":" O�F (x0)−F (x∗)β �","inline":true}],[{"style":{"width":"98%"},"width":1749,"height":558,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1711.08172/images/12-26.png","element":"img"}],[{"text":"Using our choice ","element":"span"},{"style":{"height":10.4},"width":117.39,"height":26,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1711.08172/images/12-27.png","element":"img","alt":" R = t¯r","inline":true,"padRight":true},{"text":"and ","element":"span"},{"style":{"height":18.54},"width":285.89,"height":46.36,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1711.08172/images/12-28.png","element":"img","alt":"12 log� πe2�≈ 0.54","inline":true},{"text":", we get","element":"span"}],[{"style":{"width":"27%"},"width":484,"height":89,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1711.08172/images/12-29.png","element":"img"}],[{"text":"where ","element":"span"},{"style":{"fontStyle":"italic"},"text":"t ","element":"span"},{"text":"= 6 ","element":"span"},{"text":"under our assumption.","element":"span"}],[{"text":"The complexity of Algorithm ","element":"span"},{"href":"#id-41","text":"2 ","element":"a"},{"text":"is the product of the number of loops and the complexity of each loop:","element":"span"}],[{"style":{"width":"98%"},"width":1750,"height":414,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1711.08172/images/13-0.png","element":"img"}],[{"text":"proportional in the dimension ","element":"span"},{"style":{"fontStyle":"italic"},"text":"d","element":"span"},{"text":". Except, if the function ","element":"span"},{"style":{"fontStyle":"italic"},"text":"r ","element":"span"},{"text":"is very nice with ","element":"span"},{"style":{"height":22.02},"width":393.76,"height":55.06,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1711.08172/images/13-1.png","element":"img","alt":" α = O( 1√d), β = O( 1d)","inline":true},{"text":", ","element":"span"},{"text":"then the relative accuracy is still good at ","element":"span"},{"style":{"fontStyle":"italic"},"text":"O ","element":"span"},{"text":"(1)","element":"span"},{"text":".","element":"span"}],[{"style":{"fontWeight":"bold"},"text":"– ","element":"span"},{"text":"In general, we can choose ","element":"span"},{"style":{"height":15.63},"width":184.21,"height":39.08,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1711.08172/images/13-2.png","element":"img","alt":" d′ = Θ(dv)","inline":true,"padRight":true},{"text":"for ","element":"span"},{"style":{"height":15.6},"width":154.15,"height":39,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1711.08172/images/13-3.png","element":"img","alt":" v ∈ (0, 1)","inline":true},{"text":", where the choice of ","element":"span"},{"style":{"fontStyle":"italic"},"text":"v ","element":"span"},{"text":"controls the tradeoff between the accuracy and complexity.","element":"span"}]]},{"heading":"3 Numerical experiments","paragraphs":[[{"text":"In this section, we apply our Run-and-Inspect Method to a set of nonconvex problems. We admit that it is difficult to apply our theoretical results because the implicit decomposition ","element":"span"},{"style":{"fontStyle":"italic"},"text":"F ","element":"span"},{"text":"= ","element":"span"},{"style":{"fontStyle":"italic"},"text":"f ","element":"span"},{"text":"+","element":"span"},{"style":{"fontStyle":"italic"},"text":"r ","element":"span"},{"text":"with ","element":"span"},{"style":{"fontStyle":"italic"},"text":"f, r ","element":"span"},{"text":"satisfying their assumptions is not known. Nonetheless, The encouraging experimental results below demonstrate the effectiveness of our Run-and-Inspect Method on nonconvex problems even though they may not have the decomposition.","element":"span"}],[{"text":"3.1 Test example : Ackley’s function","element":"span"}],[{"text":"The Ackley function is widely used for testing optimization algorithms, and in ","element":"span"},{"style":{"height":13.63},"width":44.34,"height":34.08,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1711.08172/images/13-4.png","element":"img","alt":" R2","inline":true},{"text":", it has the form","element":"span"}],[{"style":{"width":"58%"},"width":1039,"height":619,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1711.08172/images/13-5.png","element":"img"}],[{"style":{"fontWeight":"bold"},"text":"Fig. 2 ","element":"figcaption","subtype":"caption"},{"text":"Landscape of Ackley’s function in ","element":"figcaption","subtype":"caption"},{"style":{"height":10.9},"width":48.63,"height":27.25,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1711.08172/images/13-6.png","element":"img","alt":" R2.","inline":true}],[{"text":"The function is symmetric, and its oscillation is regular. To make it less peculiar, we modify it to an asymmetric function:","element":"span"}],[{"id":"id-47","style":{"width":"79%"},"width":1407,"height":52,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1711.08172/images/13-7.png","element":"img"}],[{"text":"The function ","element":"span"},{"style":{"fontStyle":"italic"},"text":"F ","element":"span"},{"text":"in (","element":"span"},{"href":"#id-47","text":"20","element":"a"},{"text":") has a lot of local minimizers, which are irregularly distributed. If we simply use the gradient descent (GD) method without a good initial guess, it will converge to a nearby local","element":"span"}],[{"style":{"width":"100%"},"width":1768,"height":527,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1711.08172/images/14-0.png","element":"img"}],[{"style":{"fontWeight":"bold"},"text":"Fig. 3 ","element":"figcaption","subtype":"caption"},{"text":"Landscape and contour of ","element":"figcaption","subtype":"caption"},{"style":{"fontStyle":"italic"},"text":"F ","element":"figcaption","subtype":"caption"},{"text":"in (","element":"figcaption","subtype":"caption"},{"href":"#id-47","text":"20","element":"a","subtype":"caption"},{"text":").","element":"figcaption","subtype":"caption"}],[{"text":"minimizer. To escape from local minimizers, we conduct our Run-and-Inspect Method according to Algorithms ","element":"span"},{"href":"#id-40","text":"1 ","element":"a"},{"text":"and ","element":"span"},{"href":"#id-41","text":"2","element":"a"},{"text":". We sample points starting from the outer of the ball toward the inner. The radius ","element":"span"},{"style":{"fontStyle":"italic"},"text":"R ","element":"span"},{"text":"is set as ","element":"span"},{"text":"1 ","element":"span"},{"text":"and ","element":"span"},{"style":{"height":10.8},"width":62.45,"height":27,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1711.08172/images/14-1.png","element":"img","alt":" ∆R","inline":true,"padRight":true},{"text":"as ","element":"span"},{"text":"0","element":"span"},{"style":{"fontStyle":"italic"},"text":".","element":"span"},{"text":"2","element":"span"},{"text":". ","element":"span"},{"style":{"fontWeight":"bold"},"text":"Alg ","element":"span"},{"text":"is GD and block-coordinate descent (BCD), and we apply two-dimensional inspection and blockwise one-dimensional inspection to them, respectively. The step size of GD and BCD is ","element":"span"},{"text":"1","element":"span"},{"style":{"fontStyle":"italic"},"text":"/","element":"span"},{"text":"40","element":"span"},{"text":". The results are shown in Figures ","element":"span"},{"href":"#id-48","text":"4 ","element":"a"},{"text":"and ","element":"span"},{"href":"#id-49","text":"5","element":"a"},{"text":", respectively. Note that the “run” and “inspect” phases can be decoupled, so a blockwise inspection can be used with either standard descent or blockwise descent algorithms.","element":"span"}],[{"style":{"width":"70%"},"width":1248,"height":485,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1711.08172/images/14-2.png","element":"img"}],[{"style":{"fontWeight":"bold"},"text":"Fig. 4 ","element":"figcaption","subtype":"caption"},{"id":"id-48","text":"GD iteration with 2D inspection","element":"figcaption","subtype":"caption"}],[{"style":{"width":"70%"},"width":1248,"height":484,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1711.08172/images/14-3.png","element":"img"}],[{"style":{"fontWeight":"bold"},"text":"Fig. 5 ","element":"figcaption","subtype":"caption"},{"id":"id-49","text":"BCD iteration with blockwise 1D inspection","element":"figcaption","subtype":"caption"}],[{"text":"From the figures, we can observe that blockwise inspection, which is much cheaper than standard inspection, is good at jumping out the valleys of local minimizers. Also, the inspection usually succeeds very quickly at the large initial value of ","element":"span"},{"style":{"fontStyle":"italic"},"text":"R","element":"span"},{"text":", so it is swift. These observations guide our design of inspection. Although smaller values of ","element":"span"},{"style":{"fontStyle":"italic"},"text":"R ","element":"span"},{"text":"are sufficient to escape from local minimizers, especially those that are far away from the global minimizer, we empirically use a rather large ","element":"span"},{"style":{"fontStyle":"italic"},"text":"R ","element":"span"},{"text":"and, to limit the number of sampled points, a relatively large ","element":"span"},{"style":{"height":10.8},"width":62.44,"height":27,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1711.08172/images/15-0.png","element":"img","alt":" ∆R","inline":true,"padRight":true},{"text":"as well.","element":"span"}],[{"text":"When an iterate is already (near) a global minimizer, there is no better point for inspection to find, so the final inspection will go through all sample points in ","element":"span"},{"style":{"height":15.6},"width":132.1,"height":39,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1711.08172/images/15-1.png","element":"img","alt":" B(¯x, R)","inline":true},{"text":", taking very long to complete, unlike the rapid early inspections. In most applications, however, this seems unnecessary. If ","element":"span"},{"style":{"fontStyle":"italic"},"text":"F ","element":"span"},{"text":"is smooth and strongly convex near the global minimizer ","element":"span"},{"style":{"height":11.63},"width":40.64,"height":29.08,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1711.08172/images/15-2.png","element":"img","alt":" x∗","inline":true},{"text":", we can theoretically eliminate spurious local minimizers in ","element":"span"},{"style":{"height":15.6},"width":143.79,"height":39,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1711.08172/images/15-3.png","element":"img","alt":" B(¯x, R′)","inline":true,"padRight":true},{"text":"and thus search only in the smaller region ","element":"span"},{"style":{"height":15.6},"width":310.66,"height":39,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1711.08172/images/15-4.png","element":"img","alt":" B(¯x, R) \\ B(¯x, R′)","inline":true},{"text":". Because the function ","element":"span"},{"style":{"fontStyle":"italic"},"text":"r ","element":"span"},{"text":"can be nonsmooth in our assumption, we do not have ","element":"span"},{"style":{"height":11.2},"width":113.27,"height":28,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1711.08172/images/15-5.png","element":"img","alt":" R′ > 0","inline":true},{"text":". But, our future work will explore more types of ","element":"span"},{"style":{"fontStyle":"italic"},"text":"r","element":"span"},{"text":". It is also worth mentioning that, in some applications, global minimizers can be recognized, for example, based on they having the desired structures, achieving the minimal objective values, or attaining certain lower bounds. If so, the final inspection can be completely avoided.","element":"span"}],[{"text":"3.2 K-means clustering","element":"span"}],[{"text":"Consider applying ","element":"span"},{"style":{"fontStyle":"italic"},"text":"k","element":"span"},{"text":"-means clustering to a set of data ","element":"span"},{"style":{"height":18.03},"width":234.45,"height":45.08,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1711.08172/images/15-6.png","element":"img","alt":" {xi}ni=1 ⊂ Rd","inline":true},{"text":". We assume there are ","element":"span"},{"style":{"fontStyle":"italic"},"text":"K ","element":"span"},{"text":"clusters ","element":"span"},{"style":{"height":18.03},"width":124.39,"height":45.08,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1711.08172/images/15-7.png","element":"img","alt":"{zi}Ki=1","inline":true,"padRight":true},{"text":"and have the variables ","element":"span"},{"style":{"height":18.03},"width":373.95,"height":45.08,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1711.08172/images/15-8.png","element":"img","alt":" z = [z1, ...zK] ∈ Rd×K","inline":true},{"text":". The problem to solve is","element":"span"}],[{"style":{"width":"37%"},"width":669,"height":105,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1711.08172/images/15-9.png","element":"img"}],[{"text":"A classical algorithm is the Expectation Minimization (EM) method, but it is susceptible to local minimizers. We add inspections to EM to improve its results.","element":"span"}],[{"text":"We test the problems in [","element":"span"},{"href":"#id-8","text":"27","element":"a"},{"text":"]. The first problem has synthetic Gaussian data in ","element":"span"},{"style":{"height":13.63},"width":44.34,"height":34.08,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1711.08172/images/15-10.png","element":"img","alt":" R2","inline":true},{"text":". A total of 4000 synthetic data points are generated according to four multivariate Gaussian distributions with 1000 points on each, so there are four clusters. Their means and covariance matrices are:","element":"span"}],[{"style":{"width":"68%"},"width":1207,"height":200,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1711.08172/images/15-11.png","element":"img"}],[{"text":"The EM algorithm is an iteration that alternates between labeling each data point (by associating it to the nearest cluster center) and adjusting the locations of the centers. When the labels stop updating, we start an inspection. In the above problem, the dimension of ","element":"span"},{"style":{"height":8.2},"width":30.1,"height":20.51,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1711.08172/images/15-12.png","element":"img","alt":" zi","inline":true,"padRight":true},{"text":"is two, and we apply a 2D inspection on ","element":"span"},{"style":{"height":8.21},"width":30.1,"height":20.52,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1711.08172/images/15-13.png","element":"img","alt":" zi","inline":true,"padRight":true},{"text":"one after one with radius ","element":"span"},{"style":{"fontStyle":"italic"},"text":"R ","element":"span"},{"text":"= 10","element":"span"},{"text":", step size ","element":"span"},{"style":{"height":10.8},"width":136.92,"height":27,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1711.08172/images/15-14.png","element":"img","alt":" ∆R = 2","inline":true},{"text":", and angle step size ","element":"span"},{"style":{"height":15.6},"width":189.08,"height":39,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1711.08172/images/15-15.png","element":"img","alt":" ∆θ = π/10","inline":true},{"text":". The descent threshold is ","element":"span"},{"style":{"height":10.4},"width":123.77,"height":26,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1711.08172/images/15-16.png","element":"img","alt":" ν = 0.1","inline":true},{"text":".","element":"span"}],[{"text":"The results are presented in Figure ","element":"span"},{"href":"#id-50","text":"6","element":"a"},{"text":". We can see that the EM algorithm stops at a local minimizer but, with the help of inspection, it escapes from the local minimizer and reaches the global minimizer. This escape occurs at the first sample point in the ","element":"span"},{"text":"3","element":"span"},{"text":"rd block at radius ","element":"span"},{"text":"10 ","element":"span"},{"text":"and angle ","element":"span"},{"style":{"height":15.6},"width":173.17,"height":39,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1711.08172/images/15-17.png","element":"img","alt":" θ = 7π/10","inline":true},{"text":". Since the inspection succeeds on the perimet","element":"span"},{"text":"er ","element":"span"},{"text":"of the search ball, it is rapid.","element":"span"}],[{"text":"We also consider the Iris dataset","element":"span"},{"style":{"height":7.2},"width":17,"height":18,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1711.08172/images/15-18.png","element":"img","alt":"2","inline":true},{"text":", which contains 150 4-D data samples from 3 clusters. We compare the performance of the EM algorithm with and without inspection over 500 runs with their initial centers randomly selected from the data samples. We inspect the 4-D variables one after one. Rather than sampling the 4-D polar coordinates, which needs three angular axes, we only inspect two dimensional balls. That is, for center ","element":"span"},{"style":{"height":11.68},"width":30.32,"height":29.21,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1711.08172/images/15-19.png","element":"img","alt":" i0","inline":true,"padRight":true},{"text":"and radius ","element":"span"},{"style":{"fontStyle":"italic"},"text":"r","element":"span"},{"text":", the inspections sample the following points ","element":"span"},{"style":{"height":9.97},"width":42.91,"height":24.94,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1711.08172/images/15-20.png","element":"img","alt":" zi0","inline":true,"padRight":true},{"text":"that has only two angular variables ","element":"span"},{"style":{"height":13.2},"width":88.91,"height":33,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1711.08172/images/15-21.png","element":"img","alt":" θ1, θ2","inline":true},{"text":":","element":"span"}],[{"style":{"width":"73%"},"width":1308,"height":79,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1711.08172/images/15-22.png","element":"img"}],[{"style":{"width":"100%"},"width":1768,"height":541,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1711.08172/images/16-0.png","element":"img"}],[{"style":{"fontWeight":"bold"},"text":"Fig. 6 ","element":"figcaption","subtype":"caption"},{"id":"id-50","text":"Synthetic Gaussian data with 4 clusters. Left: clustering result; Right: objective value in the iteration","element":"figcaption","subtype":"caption"}],[{"text":"Such inspections are very cheap yet still effective. Similar lower-dimensional inspections should be used with high dimensional problems. We choose ","element":"span"},{"style":{"height":15.6},"width":602.28,"height":39,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1711.08172/images/16-1.png","element":"img","alt":" R = 3, ∆R = 1, ∆θ1 = ∆θ2 = π/10","inline":true},{"text":", and a descent threshold ","element":"span"},{"style":{"height":13.63},"width":154.94,"height":34.08,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1711.08172/images/16-2.png","element":"img","alt":"ν = 10−3","inline":true},{"text":". The results are shown in Figures ","element":"span"},{"href":"#id-51","text":"7 ","element":"a"},{"text":"and ","element":"span"},{"href":"#id-52","text":"8","element":"a"},{"text":".","element":"span"}],[{"style":{"width":"44%"},"width":793,"height":651,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1711.08172/images/16-3.png","element":"img"}],[{"style":{"fontWeight":"bold"},"text":"Fig. 7 ","element":"figcaption","subtype":"caption"},{"id":"id-51","text":"histogram of the final objective values in the 500 experiments","element":"figcaption","subtype":"caption"}],[{"style":{"width":"70%"},"width":1252,"height":485,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1711.08172/images/16-4.png","element":"img"}],[{"style":{"fontWeight":"bold"},"text":"Fig. 8 ","element":"figcaption","subtype":"caption"},{"id":"id-52","text":"left: 3-D distribution of Iris data and clustering result in one trial; right: objective value in the iteration of this ","element":"figcaption","subtype":"caption"},{"text":"trial.","element":"figcaption","subtype":"caption"}],[{"text":"Among the 500 runs, EM gets stuck at a high objective value 0.48 for 109 times. With the help of inspection, it manages to locate the optimal objective value around 0.263 every time. The average radius-at-escape during the inspections is 2, and the average number of inspections is merely 1.","element":"span"}],[{"text":"3.3 Nonconvex robust linear regression","element":"span"}],[{"text":"In linear regression, we are given a linear model","element":"span"}],[{"style":{"width":"13%"},"width":240,"height":39,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1711.08172/images/17-0.png","element":"img"}],[{"text":"and the data points ","element":"span"},{"style":{"height":15.6},"width":776.41,"height":39,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1711.08172/images/17-1.png","element":"img","alt":" (x1, y1), (x2, y2), . . . , (xn, yn), yi ∈ R, xi ∈ Rn","inline":true},{"text":". When there are outliers in the data, robustness is necessary for the regression model. Here we consider Tukey’s bisquare loss, which is bounded, nonconvex and defined as:","element":"span"}],[{"style":{"width":"41%"},"width":739,"height":163,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1711.08172/images/17-2.png","element":"img"}],[{"text":"The empirical loss function based on ","element":"span"},{"style":{"height":9.6},"width":20,"height":24,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1711.08172/images/17-3.png","element":"img","alt":" ρ","inline":true,"padRight":true},{"text":"is","element":"span"}],[{"style":{"width":"26%"},"width":464,"height":105,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1711.08172/images/17-4.png","element":"img"}],[{"text":"A commonly used algorithm for this problem is the Iteratively Reweighted Least Squares (IRLS) algorithm [","element":"span"},{"href":"#id-53","text":"5","element":"a"},{"text":"], which may get stuck at a local minimizer. Our Run-and-Inspect Method can help IRLS escape from local minimizers and converge to a global minimizer. Our test uses the model","element":"span"}],[{"style":{"width":"13%"},"width":234,"height":34,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1711.08172/images/17-5.png","element":"img"}],[{"text":"where ","element":"span"},{"style":{"height":6.8},"width":18,"height":17,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1711.08172/images/17-6.png","element":"img","alt":" ε","inline":true,"padRight":true},{"text":"is noise. We generate ","element":"span"},{"style":{"height":16},"width":749.6,"height":40,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1711.08172/images/17-7.png","element":"img","alt":" xi ∼ N(0, 1), εi ∼ N(0, 0.5), i = 1, 2, . . . , 20","inline":true},{"text":". We also create 20% outliers by adding extra noise generated from ","element":"span"},{"style":{"fontStyle":"italic"},"text":"N","element":"span"},{"text":"(0","element":"span"},{"style":{"fontStyle":"italic"},"text":", ","element":"span"},{"text":"5)","element":"span"},{"text":". And we use Algorithm ","element":"span"},{"href":"#id-40","text":"1 ","element":"a"},{"text":"with ","element":"span"},{"style":{"height":16.43},"width":442.03,"height":41.08,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1711.08172/images/17-8.png","element":"img","alt":" R = 5, dR = 0.5, ν = 10−3","inline":true},{"text":". For Tukey’s function, ","element":"span"},{"style":{"height":8.08},"width":34.56,"height":20.2,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1711.08172/images/17-9.png","element":"img","alt":" r0","inline":true,"padRight":true},{"text":"is set to be 4.685. The results are shown in Figure ","element":"span"},{"href":"#id-54","text":"9","element":"a"},{"text":".","element":"span"}],[{"style":{"width":"71%"},"width":1266,"height":489,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1711.08172/images/17-10.png","element":"img"}],[{"style":{"fontWeight":"bold"},"text":"Fig. 9 ","element":"figcaption","subtype":"caption"},{"id":"id-54","text":"The left picture displays the contour of the empirical loss ","element":"figcaption","subtype":"caption"},{"style":{"height":12.8},"width":57.31,"height":32,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1711.08172/images/17-11.png","element":"img","alt":" l(β)","inline":true,"padRight":true},{"text":"and the path of iterates. Starting from the initial point, IRLS converges to a shallow local minimum. With the help of inspection, it escapes and then converges to the global minimum. The right picture shows linear model obtained by IRLS with (red) and without (magenta) inspection.","element":"figcaption","subtype":"caption"}],[{"id":"id-58","text":"3.4 Nonconvex compressed sensing","element":"span"}],[{"text":"Given a matrix ","element":"span"},{"style":{"height":16.43},"width":333.7,"height":41.08,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1711.08172/images/18-0.png","element":"img","alt":" A ∈ Rm×n (m < n)","inline":true,"padRight":true},{"text":"and a sparse signal ","element":"span"},{"style":{"height":12.03},"width":119.54,"height":30.08,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1711.08172/images/18-1.png","element":"img","alt":" x ∈ Rn","inline":true},{"text":", we observe a vector","element":"span"}],[{"id":"id-55","style":{"width":"7%"},"width":136,"height":28,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1711.08172/images/18-2.png","element":"img"}],[{"text":"The problem of compressed sensing aims to recover ","element":"span"},{"style":{"fontStyle":"italic"},"text":"x ","element":"span"},{"text":"approximately. Besides ","element":"span"},{"style":{"height":7.2},"width":33.07,"height":18,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1711.08172/images/18-3.png","element":"img","alt":" ℓ0","inline":true,"padRight":true},{"text":"and ","element":"span"},{"style":{"height":7.2},"width":33.06,"height":18,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1711.08172/images/18-4.png","element":"img","alt":" ℓ1","inline":true},{"text":"-norm, ","element":"span"},{"style":{"height":15.6},"width":181.88,"height":39,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1711.08172/images/18-5.png","element":"img","alt":" ℓp(0 < p <","inline":true,"padRight":true},{"text":"1) ","element":"span"},{"text":"quasi-norm is often used to induce sparse solutions. Below we use ","element":"span"},{"style":{"height":13.91},"width":33.85,"height":34.78,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1711.08172/images/18-6.png","element":"img","alt":" ℓ 12","inline":true,"padRight":true},{"text":"and try to solve the problem","element":"span"}],[{"style":{"width":"35%"},"width":622,"height":78,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1711.08172/images/18-7.png","element":"img"}],[{"text":"by cyclic coordinate update. At iteration ","element":"span"},{"style":{"fontStyle":"italic"},"text":"k","element":"span"},{"text":", it updates the ","element":"span"},{"style":{"fontStyle":"italic"},"text":"j","element":"span"},{"text":"th coordinate, where ","element":"span"},{"style":{"fontStyle":"italic"},"text":"j ","element":"span"},{"text":"= mod(","element":"span"},{"style":{"fontStyle":"italic"},"text":"k, n","element":"span"},{"text":") + 1","element":"span"},{"text":", via","element":"span"}],[{"style":{"width":"76%"},"width":1350,"height":183,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1711.08172/images/18-8.png","element":"img"}],[{"text":"It has been proved in [","element":"span"},{"href":"#id-8","text":"26","element":"a"},{"text":"] that (","element":"span"},{"href":"#id-55","text":"21","element":"a"},{"text":") has a closed-form solution. Define","element":"span"}],[{"style":{"width":"68%"},"width":1202,"height":162,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1711.08172/images/18-9.png","element":"img"}],[{"text":"Then","element":"span"}],[{"style":{"width":"25%"},"width":454,"height":52,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1711.08172/images/18-10.png","element":"img"}],[{"text":"where ","element":"span"},{"style":{"height":17.63},"width":194.14,"height":44.08,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1711.08172/images/18-11.png","element":"img","alt":" µ = ∥Aj∥2","inline":true},{"text":". In our experiments, we choose ","element":"span"},{"style":{"fontStyle":"italic"},"text":"m ","element":"span"},{"text":"= 25","element":"span"},{"style":{"fontStyle":"italic"},"text":", ","element":"span"},{"text":"50","element":"span"},{"style":{"fontStyle":"italic"},"text":", ","element":"span"},{"text":"100 ","element":"span"},{"text":"and ","element":"span"},{"style":{"fontStyle":"italic"},"text":"n ","element":"span"},{"text":"= 2","element":"span"},{"style":{"fontStyle":"italic"},"text":"m","element":"span"},{"text":". The elements of ","element":"span"},{"style":{"fontStyle":"italic"},"text":"A ","element":"span"},{"text":"are generated from ","element":"span"},{"style":{"height":22.02},"width":160.01,"height":55.06,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1711.08172/images/18-12.png","element":"img","alt":" U(0, 1√m)","inline":true,"padRight":true},{"text":"i.i.d. The vector ","element":"span"},{"style":{"fontWeight":"bold"},"text":"x ","element":"span"},{"text":"has 10% nonzeros with their values generated from ","element":"span"},{"style":{"fontStyle":"italic"},"text":"U","element":"span"},{"text":"(0","element":"span"},{"style":{"fontStyle":"italic"},"text":".","element":"span"},{"text":"2","element":"span"},{"style":{"fontStyle":"italic"},"text":", ","element":"span"},{"text":"0","element":"span"},{"style":{"fontStyle":"italic"},"text":".","element":"span"},{"text":"8) ","element":"span"},{"text":"i.i.d. Set ","element":"span"},{"style":{"fontStyle":"italic"},"text":"b ","element":"span"},{"text":"= ","element":"span"},{"style":{"fontStyle":"italic"},"text":"A","element":"span"},{"style":{"fontWeight":"bold"},"text":"x","element":"span"},{"text":". Here, we apply coordinate descent with inspection (CDI), and compared it with standard coordinate descent (CD) and half thresholding algorithm (","element":"span"},{"style":{"fontStyle":"italic"},"text":"half ","element":"span"},{"text":") [","element":"span"},{"href":"#id-8","text":"26","element":"a"},{"text":"]. For every pair of ","element":"span"},{"text":"(","element":"span"},{"style":{"fontStyle":"italic"},"text":"m, n","element":"span"},{"text":")","element":"span"},{"text":", we choose the parameter ","element":"span"},{"style":{"height":10.8},"width":144.3,"height":27,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1711.08172/images/18-13.png","element":"img","alt":" λ = 0.05","inline":true,"padRight":true},{"text":"and run 100 experiments. When the iterates stagnate at a local minimizer ","element":"span"},{"style":{"height":9.41},"width":24,"height":23.52,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1711.08172/images/18-14.png","element":"img","alt":" ¯x","inline":true},{"text":", we perform a blockwise inspection with each block consisting of two coordinates. Checking all pairs of two coordinates is expensive and not necessary since ","element":"span"},{"style":{"height":9.41},"width":24,"height":23.52,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1711.08172/images/18-15.png","element":"img","alt":" ¯x","inline":true,"padRight":true},{"text":"is sparse. We improve the efficiency by pairing only ","element":"span"},{"style":{"fontStyle":"italic"},"text":"i, j ","element":"span"},{"text":"where ","element":"span"},{"style":{"height":14.61},"width":235.81,"height":36.52,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1711.08172/images/18-16.png","element":"img","alt":" xi ̸= 0, xj = 0","inline":true},{"text":". Similar to previous experiments, we sample points from the outer of the 2D ball toward the inner. We choose ","element":"span"},{"style":{"height":13.6},"width":353.16,"height":34,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1711.08172/images/18-17.png","element":"img","alt":" R = 0.5, ∆R = 0.05","inline":true},{"text":". The results are presented in Table ","element":"span"},{"href":"#id-56","text":"1 ","element":"a"},{"text":"and Figure ","element":"span"},{"href":"#id-57","text":"10","element":"a"},{"text":". CDI shows a significant improvement over its competitors.","element":"span"}],[{"text":"3.5 Nonconvex Sparse Logistic Regression","element":"span"}],[{"text":"Logistic regression is a widely-used model for classification. Usually we are given a set of training data ","element":"span"},{"style":{"height":18.43},"width":262.86,"height":46.08,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1711.08172/images/18-18.png","element":"img","alt":"{(x(i), y(i))}Ni=1","inline":true},{"text":", where ","element":"span"},{"style":{"height":15.23},"width":172.76,"height":38.08,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1711.08172/images/18-19.png","element":"img","alt":" x(i) ∈ Rd","inline":true,"padRight":true},{"text":"and ","element":"span"},{"style":{"height":18.43},"width":220.96,"height":46.08,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1711.08172/images/18-20.png","element":"img","alt":" y(i) ∈ {0, 1}","inline":true},{"text":". The label ","element":"span"},{"style":{"fontStyle":"italic"},"text":"y ","element":"span"},{"text":"is assumed to satisfy the following ","element":"span"},{"text":"conditional distribution:","element":"span"}],[{"style":{"width":"99%"},"width":1764,"height":396,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1711.08172/images/18-21.png","element":"img"}],[{"style":{"width":"100%"},"width":1768,"height":425,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1711.08172/images/19-0.png","element":"img"}],[{"style":{"fontWeight":"bold"},"text":"Table 1 ","element":"figcaption","subtype":"caption"},{"id":"id-56","text":"Statistics of 100 compressed sensing problems solved by three ","element":"figcaption","subtype":"caption"},{"style":{"height":17.4},"width":206.22,"height":43.51,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1711.08172/images/19-1.png","element":"img","alt":" ℓ 12 algorithms","inline":true}],[{"text":"1. ","element":"span"},{"style":{"fontStyle":"italic"},"text":"half ","element":"span"},{"text":": iterative half thresholding; CD: coordinate descent; CDI: CD with inspection.","element":"span"}],[{"text":"2. ","element":"span"},{"style":{"fontStyle":"italic"},"text":"a ","element":"span"},{"text":"is the ratio of correctly identified nonzeros to true nonzeros, averaged over the 100 tests (100% is impossible due to noise and model error); ","element":"span"},{"style":{"fontStyle":"italic"},"text":"b ","element":"span"},{"text":"is the number of tests with all true nonzeros identified; ","element":"span"},{"style":{"fontStyle":"italic"},"text":"c ","element":"span"},{"text":"is the number of tests in which the returned points yield lower objective values than that of the true signal (only model error, no algorithm error). Higher ","element":"span"},{"style":{"fontStyle":"italic"},"text":"a, b, c ","element":"span"},{"text":"are better.","element":"span"}],[{"text":"3. “ave obj” is the average of the objective values; lower is better.","element":"span"}],[{"style":{"width":"46%"},"width":816,"height":646,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1711.08172/images/19-2.png","element":"img"}],[{"style":{"fontWeight":"bold"},"text":"Fig. 10 ","element":"figcaption","subtype":"caption"},{"id":"id-57","text":"Comparison of the true signal ","element":"figcaption","subtype":"caption"},{"style":{"fontStyle":"italic"},"text":"x ","element":"figcaption","subtype":"caption"},{"text":"and signals recovered from ","element":"figcaption","subtype":"caption"},{"style":{"fontStyle":"italic"},"text":"half","element":"figcaption","subtype":"caption"},{"text":", CD, CDI.","element":"figcaption","subtype":"caption"}],[{"text":"In one experiment, CDI recovered all positions of nonzeros of ","element":"span"},{"style":{"fontStyle":"italic"},"text":"x","element":"span"},{"text":", while CD failed to recover ","element":"span"},{"style":{"height":11.6},"width":295.89,"height":29,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1711.08172/images/19-3.png","element":"img","alt":" x116, x134. The half","inline":true,"padRight":true},{"text":"algorithm just got stuck at a local minizer far from ","element":"span"},{"style":{"fontStyle":"italic"},"text":"x","element":"span"},{"text":".","element":"span"}],[{"text":"which is convex and differentiable. When ","element":"span"},{"style":{"fontStyle":"italic"},"text":"N ","element":"span"},{"text":"is relatively small, we need variable selection to avoid over-fitting. In this test, we use the minimax concave penalty (MCP) [","element":"span"},{"href":"#id-8","text":"28","element":"a"},{"text":"]:","element":"span"}],[{"style":{"width":"67%"},"width":1191,"height":257,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1711.08172/images/19-4.png","element":"img"}],[{"text":"The penalty ","element":"span"},{"style":{"height":20.18},"width":93.5,"height":50.46,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1711.08172/images/19-5.png","element":"img","alt":" pMCPλ,γ","inline":true,"padRight":true},{"text":"is proximable with","element":"span"}],[{"style":{"width":"34%"},"width":613,"height":114,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1711.08172/images/19-6.png","element":"img"}],[{"text":"where ","element":"span"},{"style":{"height":15.6},"width":439.35,"height":39,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1711.08172/images/19-7.png","element":"img","alt":" Sλ(z) = (|z| − λ)+ sign(z)","inline":true},{"text":". We apply the prox-linear (PL) algorithm to solve this problem. When it nearly converges, inspection is then applied. We design our experiments according to [","element":"span"},{"href":"#id-8","text":"20","element":"a"},{"text":"]: we consider ","element":"span"},{"style":{"fontStyle":"italic"},"text":"d ","element":"span"},{"text":"= 50 ","element":"span"},{"text":"and ","element":"span"},{"style":{"fontStyle":"italic"},"text":"N ","element":"span"},{"text":"= 200 ","element":"span"},{"text":"and","element":"span"}],[{"text":"assume the true ","element":"span"},{"style":{"height":11.63},"width":36.3,"height":29.08,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1711.08172/images/20-0.png","element":"img","alt":" θ∗","inline":true,"padRight":true},{"text":"has ","element":"span"},{"style":{"fontStyle":"italic"},"text":"K ","element":"span"},{"text":"non-zero entries. In the training procedure, we generate data from i.i.d. standard Gaussian distribution, and we randomly choose ","element":"span"},{"style":{"fontStyle":"italic"},"text":"K ","element":"span"},{"text":"non-zero elements with i.i.d standard Gaussian distribution to form ","element":"span"},{"style":{"height":11.63},"width":36.29,"height":29.08,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1711.08172/images/20-1.png","element":"img","alt":" θ∗","inline":true},{"text":". The labels are generated by ","element":"span"},{"style":{"height":18.03},"width":338.16,"height":45.08,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1711.08172/images/20-2.png","element":"img","alt":" y = 1(xT θ + w ≥ 0)","inline":true},{"text":", where ","element":"span"},{"style":{"fontStyle":"italic"},"text":"w ","element":"span"},{"text":"is sampled according to the Gaussian distribution ","element":"span"},{"style":{"height":17.63},"width":158.29,"height":44.08,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1711.08172/images/20-3.png","element":"img","alt":" N(0, ϵ2I)","inline":true},{"text":". We use PL iteration with and without inspection to recover ","element":"span"},{"style":{"height":10.4},"width":18,"height":26,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1711.08172/images/20-4.png","element":"img","alt":" θ","inline":true},{"text":". After that, we generate 1000 random test data points to compute the test error of the ","element":"span"},{"style":{"height":10.4},"width":18,"height":26,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1711.08172/images/20-5.png","element":"img","alt":" θ","inline":true},{"text":". We set the parameter ","element":"span"},{"style":{"height":14},"width":584.96,"height":35,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1711.08172/images/20-6.png","element":"img","alt":" β = 1.5 − 0.06 × K, λ = 1, γ = 5","inline":true,"padRight":true},{"text":"and the step size 0.5 for PL iteration. For each ","element":"span"},{"style":{"fontStyle":"italic"},"text":"K ","element":"span"},{"text":"and ","element":"span"},{"style":{"height":0},"width":19,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1711.08172/images/20-7.png","element":"img","alt":" ϵ","inline":true},{"text":", we run 100 experiments and calculate the mean and variance of the results. The inspection parameters are ","element":"span"},{"style":{"height":13.6},"width":257.91,"height":34,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1711.08172/images/20-8.png","element":"img","alt":" R = 5, ∆R = 1","inline":true},{"text":", and ","element":"span"},{"style":{"height":15.6},"width":186.14,"height":39,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1711.08172/images/20-9.png","element":"img","alt":" ∆θ = π/10","inline":true},{"text":". The sample points in inspections are similar to those in section ","element":"span"},{"href":"#id-58","text":"3.4","element":"a"},{"text":". The results are presented in Table ","element":"span"},{"href":"#id-59","text":"2","element":"a"},{"text":". The objective values and test errors of PLI, the PL algorithm with inspection, are significantly better than the native PL algorithm. On the other hand, the cost is also 3 – 6 times as high.","element":"span"}],[{"style":{"width":"72%"},"width":1286,"height":540,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1711.08172/images/20-10.png","element":"img"}],[{"style":{"fontWeight":"bold"},"text":"Table 2 ","element":"figcaption","subtype":"caption"},{"id":"id-59","text":"Sparse logistic regression results of 100 tests. PL is the prox-linear algorithm. PLI is the PL algorithm with ","element":"figcaption","subtype":"caption"},{"text":"inspection. “var” is variance.","element":"figcaption","subtype":"caption"}],[{"text":"We plot the convergence history of the objective values in one trial and the recovered ","element":"span"},{"style":{"height":10.4},"width":18,"height":26,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1711.08172/images/20-11.png","element":"img","alt":" θ","inline":true,"padRight":true},{"text":"in Figure ","element":"span"},{"href":"#id-60","text":"11","element":"a"},{"text":". It is clear that the inspection works in learning a better ","element":"span"},{"style":{"height":10.4},"width":18,"height":26,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1711.08172/images/20-12.png","element":"img","alt":" θ","inline":true,"padRight":true},{"text":"by reaching a smaller objective value.","element":"span"}],[{"style":{"width":"83%"},"width":1470,"height":565,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1711.08172/images/20-13.png","element":"img"}],[{"style":{"fontWeight":"bold"},"text":"Fig. 11 ","element":"figcaption","subtype":"caption"},{"id":"id-60","text":"Sparse logistic regression result in one trial.","element":"figcaption","subtype":"caption"}]]},{"heading":"4 Conclusions","paragraphs":[[{"text":"In this paper, we have proposed a simple and efficient method for nonconvex optimization, based on our analysis of ","element":"span"},{"style":{"fontStyle":"italic"},"text":"R","element":"span"},{"text":"-local minimizers. The method applies local inspections to escape from local minimizers or verify the current point is an ","element":"span"},{"style":{"fontStyle":"italic"},"text":"R","element":"span"},{"text":"-local minimizer. For a function that can be implicitly decomposed to a smooth, strongly convex function plus a restricted nonconvex functions, our method returns an (approximate) global minimizer. Although some of the tested problems may not possess the assumed decomposition, numerical experiments support the effectiveness of the proposed method.","element":"span"}],[{"style":{"fontWeight":"bold"},"text":"Reference","element":"span"}],[{"id":"id-20","text":"1. Chaudhari, P., Choromanska, A., Soatto, S., LeCun, Y.: Entropy-SGD: Biasing gradient descent into wide valleys. ","element":"span"},{"text":"arXiv preprint arXiv:1611.01838 (2016)","element":"span"}],[{"id":"id-21","text":"2. Chaudhari, P., Oberman, A., Osher, S., Soatto, S., Carlier, G.: Deep relaxation: Partial differential equations for ","element":"span"},{"text":"optimizing deep neural networks. arXiv preprint arXiv:1704.04932 (2017)","element":"span"}],[{"id":"id-11","text":"3. Conn, A.R., Gould, N.I., Toint, P.L.: Trust region methods. SIAM (2000)","element":"span"}],[{"id":"id-13","text":"4. Conn, A.R., Scheinberg, K., Vicente, L.N.: Introduction to Derivative-Free Optimization. No. 8 in MPS-SIAM series ","element":"span"},{"text":"on optimization. Society for Industrial and Applied Mathematics / Mathematical Programming Society, Philadelphia (2009)","element":"span"}],[{"id":"id-53","text":"5. Fox, J.: An R and S-Plus Companion to Applied Regression. Sage Publications, Thousand Oaks, Calif (2002) ","element":"span"},{"id":"id-6","text":"6. Ge, R., Huang, F., Jin, C., Yuan, Y.: Escaping from saddle points—online stochastic gradient for tensor decompo- ","element":"span"},{"text":"sition. In: Conference on Learning Theory, pp. 797–842 (2015)","element":"span"}],[{"id":"id-7","text":"7. Ge, R., Lee, J.D., Ma, T.: Matrix completion has no spurious local minimum. In: Advances in Neural Information ","element":"span"},{"text":"Processing Systems, pp. 2973–2981 (2016)","element":"span"}],[{"id":"id-9","text":"8. Ghadimi, S., Lan, G.: Accelerated gradient methods for nonconvex nonlinear and stochastic programming. Mathe- ","element":"span"},{"text":"matical Programming ","element":"span"},{"style":{"fontWeight":"bold"},"text":"156","element":"span"},{"text":"(1-2), 59–99 (2016)","element":"span"}],[{"id":"id-17","text":"9. Jin, C., Ge, R., Netrapalli, P., Kakade, S.M., Jordan, M.I.: How to escape saddle points efficiently. arXiv preprint ","element":"span"},{"text":"arXiv:1703.00887 (2017)","element":"span"}],[{"id":"id-30","style":{"width":"99%"},"width":1764,"height":64,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1711.08172/images/21-0.png","element":"img"}],[{"id":"id-19","text":"11. Kirkpatrick, S., Gelatt Jr, C.D., Vecchi, M.P.: Optimization by simulated annealing. In: Spin Glass Theory and ","element":"span"},{"text":"Beyond: An Introduction to the Replica Method and Its Applications, pp. 339–348. World Scientific (1987)","element":"span"}],[{"id":"id-12","text":"12. Martínez, J.M., Raydan, M.: Cubic-regularization counterpart of a variable-norm trust-region method for uncon- ","element":"span"},{"text":"strained minimization. Journal of Global Optimization ","element":"span"},{"style":{"fontWeight":"bold"},"text":"68","element":"span"},{"text":"(2), 367–385 (2017)","element":"span"}],[{"id":"id-15","text":"13. Nesterov, Y., Polyak, B.T.: Cubic regularization of Newton method and its global performance. ","element":"span"},{"text":"Mathematical Programming ","element":"span"},{"style":{"fontWeight":"bold"},"text":"108","element":"span"},{"text":"(1), 177–205 (2006)","element":"span"}],[{"id":"id-18","text":"14. Panageas, I., Piliouras, G.: Gradient descent converges to minimizers: The case of non-isolated critical points. CoRR, ","element":"span"},{"text":"abs/1605.00405 (2016)","element":"span"}],[{"id":"id-16","text":"15. Pascanu, R., Dauphin, Y.N., Ganguli, S., Bengio, Y.: On the saddle point problem for non-convex optimization. ","element":"span"},{"text":"arXiv preprint arXiv:1405.4604 (2014)","element":"span"}],[{"id":"id-5","text":"16. Peng, Z., Wu, T., Xu, Y., Yan, M., Yin, W.: Coordinate friendly structures, algorithms and applications. Annals of ","element":"span"},{"text":"Mathematical Sciences and Applications ","element":"span"},{"style":{"fontWeight":"bold"},"text":"1","element":"span"},{"text":"(1), 57–119 (2016)","element":"span"}],[{"id":"id-28","text":"17. Polyak, B.T.: Gradient methods for minimizing functionals. Zhurnal Vychislitel’noi Matematiki i Matematicheskoi ","element":"span"},{"text":"Fiziki ","element":"span"},{"style":{"fontWeight":"bold"},"text":"3","element":"span"},{"text":"(4), 643–653 (1963)","element":"span"}],[{"id":"id-10","text":"18. Reddi, S.J., Hefny, A., Sra, S., Poczos, B., Smola, A.: Stochastic variance reduction for nonconvex optimization. In: ","element":"span"},{"text":"International conference on machine learning, pp. 314–323 (2016)","element":"span"}],[{"id":"id-22","text":"19. Sagun, L., Bottou, L., LeCun, Y.: Singularity of the Hessian in deep learning. arXiv preprint arXiv:1611.07476 ","element":"span"},{"text":"(2016)","element":"span"}],[{"id":"id-8","style":{"width":"99%"},"width":1766,"height":656,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1711.08172/images/21-1.png","element":"img"}]]}],"_version":"3.3.2"},"paperNode":"$28:props:children:props:children:0:props:product"}]]