Finding Approximate Local Minima Faster than Gradient Descent

2016·Arxiv

Abstract

Abstract

We design a non-convex second-order optimization algorithm that is guaranteed to return an approximate local minimum in time which scales linearly in the underlying dimension and the number of training examples. The time complexity of our algorithm to find an approximate local minimum is even faster than that of gradient descent to find a critical point. Our algorithm applies to a general class of optimization problems including training a neural network and other non-convex objectives arising in machine learning.

1 Introduction

Finding a global minimizer of a non-convex optimization problem is NP-hard. Thus, the standard goal of efficient non-convex optimization algorithms is instead to find a local minimum. This problem has become increasingly important as the state-of-the-art in machine learning is attained by non-convex models, many of which are variants of deep neural networks. Experiments in [10, 11, 21] suggest that fast convergence to a local minimum is sufficient for training neural nets, while convergence to critical points (points with vanishing gradients) is not. Theoretical works have also affirmed the same phenomenon for other machine learning problems (see [5, 6, 18, 19] and the references therein).

In this paper we give a provable linear-time algorithm for finding an approximate local minimum in smooth non-convex optimization. It applies to a general setting of machine learning optimization, and in particular to the optimization problem of training deep neural networks. Furthermore, the running time bound of our algorithm is the fastest known even for the more lenient task of computing a point with vanishing gradient (called a critical point), for a wide range of parameters.

Formally, the problem of unconstrained mathematical optimization is stated in general terms as that of finding the minimum value that a function attains over Euclidean space, i.e.

If f is convex, the above formulation is convex optimization and is solvable in (randomized) polynomial time even if only a valuation oracle to f is provided. A crucial property of convex functions is that “local optimality implies global optimality”, allowing for greedy algorithms to reach the global optimum efficiently. Unfortunately, this is no longer the case if f is nonconvex; indeed, even a degree four polynomial can be NP-hard to optimize [23], or even just to check whether a point is not a local minimum [25]. Thus, for non-convex optimization one has to settle for the more modest goal of reaching approximate local optimality efficiently.

Note that a particular interest to machine learning is the optimization of functions of the finite-sum form

Such functions arise when minimizing loss over a training set, where each example i in the set corresponds to one loss function in the summation.

We say that the function f is second-order smooth if it has Lipschitz continuous gradient and Lipschitz continuous Hessian. We say that a point -approximate local minimum if it satisfies (following the tradition of [28]):

where denotes the Euclidean norm of a vector. We say that a point -critical point if it satisfies the gradient condition above, but not necessarily the second-order condition. Critical points include saddle points in addition to local minima. We remark that -approximate local minima (even with function value. However, if we assume in addition the function satisfies the (robust) strict-saddle property [15, 24] (see Section 2 for the precise definition), then an -approximate local minimum is guaranteed to be close to a local minimum for sufficiently small

Our main theorem below states the time required for the proposed algorithm FastCubic to find an -approximate local minimum for second-order smooth functions.

The full statement of Theorem 1 can be found in Section 2.

Hessian-vector products can be computed in linear time —meaning O(nd)— for many machine learning problems such as generalized linear models and training neural networks [1, 29]. We explain this more generally in Appendix A. Therefore,

Another important aspect of our algorithm is that even in terms of just reaching an point, i.e. a point that satisfies without any second-order guarantee, FastCubic is faster than all previous results (see Table 1 for a comparison).

The fastest methods to find critical points for a smooth non-convex function are gradient descent and its extensions, jointly known as first-order methods. These methods are extremely efficient in terms of per-iteration complexity; however, they necessarily suffer from a 1convergence rate [27], to the best of our knowledge, in previous results only higher-order methods seem capable of breaking this 1bottleneck [28]. For certain ranges of parameters, our FastCubic finds local minima even faster than first-order methods, even though they only find critical points. This is depicted in Table 1.

Table 1: Comparison of known methods.

1.1 Related work

Methods that Provably Reach Critical Points. Recall that only a gradient oracle is needed to reach a critical point. The most commonly used algorithm in practice for training non-convex learning machines such as deep neural networks is stochastic gradient descent (SGD), also known as stochastic approximation [30] and its derivatives. Some practical enhancements widely used in practice are based on Nesterov’s acceleration [26] and adaptive regularization [12]. The variance reduction technique, introduced in [32], was extremely successful in convex optimization, but only recently there was a non-convex counterpart with theoretical benefits introduced [2].

Methods that Provably Reach Local Minima. The recent work of Ge et al. [17] showed that a noise-injected version of SGD in fact converges to local minima instead of critical points, as long as the underlying non-convex function is strict-saddle. Their theoretical running time is a large polynomial in the dimension and not competitive with our method (see Table 1).

The work of Lee et al. [24] shows that gradient descent, starting from a random point, almost surely converges to a local minimum of a strict-saddle function. The rates of convergence and precise step-sizes that are required are, however, yet unknown.

If second-order information (i.e., the Hessian oracle) is provided, the cubic-regularization method of Nesterov and Polyak [28] converges in ) iterations. However, each iteration of Nesterov- Polyak requires solving a cubic function which, in general, takes time super-linear in the input representation.

One natural direction is to apply an approximate trust region solver, such as the linear-time solver of [22], to approximately solve the cubic regularization subroutine of Nesterov-Polyak. However, the approximation needed by a naive calculation makes this approach even slower than vanilla gradient descent. Our main challenge is to obtain approximate second-order local-minima and simultaneously improve upon gradient descent.

Independently of this paper and concurrently[7] develop an accelerated gradient descent method that achieves the same running time for finding an approximate local minimum as in our paper. Remarkably, the same running time is obtained via a very different technique.

1.2 Our Techniques

Our algorithm is based on the cubic regularization method of Nesterov and Polyak [8, 9, 28]. At a high level, cubic regularization states that if we can minimize a cubic function

exactly, where is the second-order smoothness of the function f, then we can iteratively perform updates , and this algorithm converges to an -approximate local minimum in ) iterations. Unfortunately, solving this cubic minimization problem exactly, to the best of our knowledge, requires a running time of where is the matrix multiplication constant. Getting around this requires five observations.

The first observation is that, minimizing m(h) up to a constant multiplicative approximation (plus a few other constraints) is sufficient for showing an iteration complexity of proof techniques to show this observation are based on extending Nesterov and Polyak.

The second observation is that the minimizer ) must be of the form 0 is some constant satisfying is the smallest eigenvector of H and denotes the pseudo-inverse of a matrix. This can be viewed as moving in a mixture direction between choosing , and choosing h to follow a shifted Newton’s direction Intuitively, we wish to reduce both the computation of (to Hessian-vector products.

The first task of computing (can be slow, and even if is strictly positive-definite, computing it has a complexity depending on the (possibly huge) condition number of is that it suffices to pick some so both (1) the condition number of is small and (2) the vectors (are close. This relies on the structure of m(h).

The second task of computing v has a complexity depending on 1is the target additive error [13, 14]. The fourth observation is that the choice suffices for the outer loop of cubic regularization to make sufficient progress. This reduces the complexity to compute v.

Finally, finding the correct value itself is as hard as minimizing design an iterative scheme that makes only logarithmic number of guesses on . This procedure either finds the correct one (via binary search), or finds an approximate one, , but satisfying (being sufficiently close.

Putting all the observations together, and balancing all the parameters, we can obtain a cubic minimization subroutine (see FastCubicMin in Algorithm 2) that runs in time

2 Preliminaries and Main Theorem

We use to denote the Euclidean norm of a vector and the spectral norm of a matrix. For a symmetric matrix M we denote by ) respectively the maximum and minimum eigenvalues of M. We denote by is positive semidefinite (PSD). For a PSD matrix M, we denote by its pseudo-inverse if M is not strictly positive definite.

We make the following Lipschitz continuity assumptions for the gradient and Hessian of the target function f. Namely, there exist 0 such that

Definition 2.1. We assume the following complexity parameters on the access to f(x):

Definition 2.2. We say that f is of finite-sum form if for each . In this case, we define to be the time complexity to computearbitrary

Next we define the strict-saddle function for which an -approximate local minimum is almost equivalent to a local minimum [15, 24].

Definition 2.3 (strict saddle)is twice differentiable. For say -strict saddle if every satisfies at least one of the following three conditions:

We see that if a function is ()-strict saddle, then for -approximate local minimum is -close to some local minimum.

2.1 Main Results

The finite-sum setting captures much of supervised learning, including Neural Networks and Generalized Linear Models. The main theorem which we show in our paper is as follows:

Two Known Subroutines. Our running time of FastCubic relies on the following recent results for approximate matrix inverse and approximate PCA:

Theorem 2.4 (Approximate Matrix Inverse). Suppose matrix for constants . Then, we can compute vector x satisfying

using Accelerated gradient descent (AGD) in iterations, each requiring O(d) time plus the time needed to multiply M with a vector.

Moreover, suppose where each is symmetric and satisfies can be computed in time and vector b, then accelerated SVRG [4, 33] computes a vector x that satisfies equation (2.2) in time

Above, the SVRG based running time shall be used only towards our finite-sum case in Definition 2.2.

Theorem 2.5 (AppxPCA [3, 13, 14])be a symmetric matrix with eigenvalues 1 . With probability at least 1, AppxPCA produces a unit vector w satisfying The total running time is ˜

3 Our Fast Cubic Regularization Algorithm

Recall that the cubic regularization method of Nesterov and Polyak [28] studies the following upper bound on the change in objective value as we move from a point : (it follows simply

from the Taylor series truncated to the third order)

Denote by an arbitrary minimizer of ). We propose in this paper a subroutine FastCubicMin to minimizes ) approximately. Note that FastCubicMin returns two vectors then choose to be either , whichever gives a smaller value for

Before discussing the details of FastCubicMin, let us first state a main theorem for

Above, the first guarantee promises that we are either done (because ) is close to zero), or we obtain a 1/3000 multiplicative approximation to ). Our second guarantee in Theorem 2 promises that when we are done (because ) is close to zero), the output vector roughly similar in Euclidean norm and have a small gradient . Our third guarantee gives the time complexity of FastCubicMin.

Now, our final algorithm FastCubic for finding the -approximate local minimum of f(x) is included in Algorithm 1. It simply iteratively calls FastCubicMin to find an approximate minimizer, and it then stops whenever for some large constant c.

Roadmap. In Section 4 we show why Theorem 2 implies Theorem 1. All the remaining sections are for the purpose of proving Theorem 2. Because our FastCubicMin is very technical, instead of stating what the algorithm is right away, we decide to take a different path. In Section 5, we first state a lemma characterizing “what looks like”. In Section 6, we provide a set of sufficient conditions which “look similar” to the characterization of , and show that as long as these conditions are met, Theorem 2-a and 2-b follow easily. Finally, in Section 7, we state FastCubicMin and explain why it satisfies these sufficient conditions and why it runs in the aforementioned time.

4 Theorem 2 implies Theorem 1

In this section, we show that Theorem 2 implies Theorem 1. It relies on the following lemma (proved in Appendix B) regarding the sufficient condition for us to reach an -approximate local minimum.

is an approximate minimizer of satisfying

then we have that

Proof of Theorem 1 from Theorem 2. When FastCubic terminates, we have

fore, it satisfies according to Theorem 2-a. Combining this with Theorem 2-b and Corollary 4.1, we conclude that in the last iteration of FastCubic, our output satisfies . This finishes the proof with respect to the accuracy conditions.

As for the running time, in every iteration except for the last one, . Therefore by (3.1), we must have decreased the objective by at least Ωround, and this cannot happen for more than iterations. The final running time of FastCubic follows from this bound together with Theorem 2-c.

Therefore, in the rest of the paper it suffices to study FastCubicMin and prove Theorem 2.

5 Characterization Lemma of the Minimizer h∗

For notational simplicity in this and the subsequent sections we focus on the following problem:

Recall from the previous section that we have denoted by an arbitrary minimizer of m(h). We have the following lemma which characterizes : (a variant of this lemma has appeared in [8], and we prove it in the appendix for the sake of completeness)

is a minimizer of m(h) if and only if there exists

The objective value in this case is given by

The following corollary comes from Lemma 5.1 and its proof:

in Lemma 5.1 is unique, and for every satisfying have

In the above characterization, we have a crude upper bound on

defined in Lemma 5.1.

and therefore Corollary 5.2.

6 Suﬃcient Conditions for Theorem 2-a and 2-b

Without worrying about the design of FastCubicMin at this moment, let us first state a set of sufficient conditions under which the assumptions in Theorem 2-a can be satisfied.

Main Lemma 1. Consider an algorithm that outputs a real unit vector . Additionally, suppose numbers satisfying the following conditions:

Moreover, suppose that the outputs (satisfy one of the following two cases:

Let us compare such sufficient conditions to the characterization Lemma 5.1.

• In Case 1, up to a very small error ˜, we have essentially found a vector v that satisfies Therefore, this v should be close to for obvious reason. (This is the simple case.)

• In Case 2, we have only found a vector v that satisfies In this case, we also compute an approximate lowest eigenvector additive 1accuracy (see case 2-c). We will make sure that, as long as the conditions in 2-a hold, then either will be an approximate minimizer for

Proof of Main Lemma 1. We first consider Case 1. According to Corollary 5.2, if ˜minimizer of m(h). The following claim extends this argument to the setting when ˜

satisfy Case 1 and ˜satisfies (6.1), then

From the above lemma it follows that either satisfies the conditions of the theorem.

We now consider Case 2, and in this case we make the following two claims:

Claim 6.3.

Lemma 1 now follows from the two claims because we can output the vector which has the lowest value of ) amongst the two choices This satisfies either 3000

The missing proofs of the three claims are deferred to Appendix D.

The next main lemma shows that, under the same sufficient conditions as Main Lemma 1, we also have that Theorem 2-b holds. (Its proof is contained in Appendix E.)

Main Lemma 2. In the same setting as Main Lemma 1, suppose output vector v satisfies the following conditions:

7 Main Algorithms for Theorem 2

We are now ready to state our main algorithm FastCubicMin and sketch why it satisfies the sufficient conditions in Main Lemma 1. As described in Algorithm 2, our algorithm starts with a very large choice and decreases it gradually. At each iteration i, it computes an approximate inverse v satisfying with respect to the current . Then there are three cases, depending on whether is approximately equal to, larger than, or smaller than 2. At a high level, if it is “equal”, then we have met Case 1 in Main Lemma 1; if it is “larger”, then we can binary search the correct value of in the interval []; and if it is “smaller”, then we need to compute an approximate eigenvector and carefully choose the next point

We state our main lemma below regarding the correctness and running time of FastCubicMin.

Main Lemma 3. FastCubicMin in Algorithm 2 outputs a real a unit vector satisfying one of the two sufficient conditions in Main Lemma 1. We also have that the procedure can be implemented in a total running time of

Here ˜O hides logarithmic factors in

We prove the correctness half of Main Lemma 3, and defer its running time analysis to Appendix G.

7.1 Correctness Half of Main Lemma 3

We will now establish the correctness of our algorithm. We first observe that the BinarySearch subroutine returns () that satisfies Case 1 of Main Lemma 1.

Fact 7.1. BinarySearch outputs a pair

Proof. The latter is guaranteed by line 3 in BinarySearch, and the former is implied by the latter because

We also establish the following invariants regarding the values . (Proof in Appendix F.)

Lemma 7.2. The following statements hold for all i until FastCubicMin terminates

Moreover when FastCubicMin terminates at Line 20 we have

We now prove the output (satisfies the sufficient conditions of Main Lemma 1.

Correctness Proof of Main Lemma 3. We carefully verify these sufficient conditions:

• Lemma 7.2 implies • from Lemma 7.2 implies . It is now immediate that the choice of ˜on Line 2 satisfies the Condition (6.1) in the assumption of Main Lemma 1.

• Since ˜it follows that (0 which proves Condition (6.2) in Main Lemma 1.

• We now verify Case 1 and 2 in the assumption of Main Lemma 1. At the beginning of the algorithm, our choice ensures (using Proposition 5.3) that Let us now consider the various places where the algorithm outputs:

– If FastCubicMin terminates at Line 7, then we have and additionally

– If FastCubicMin terminates at Line 9, then Obviously, we must have 1 in this case because . Therefore, Line 10 must have been reached at the previous iteration, so it implies Together, these two imply that we can call ). Owing to Fact 7.1, the subroutine outputs a pair () satisfying the Case 1 requirement of Main Lemma 1.

In sum, we have verified that all the assumptions of Main Lemma 1 hold.

Final Proof of Theorem 2. Theorem 2 is a direct corollary of our main lemmas. Main Lemma 3 ensures that the assumptions of Main Lemma 1 and Main Lemma 2 both hold. Now, using the special choice of , Theorem 2-a immediately comes from Main Lemma 1; Theorem 2-b immediately comes from Main Lemma 2; and Theorem 2-c immediately comes from Main Lemma 3. This finishes the proof of Theorem 2.

Acknowledgements

We thank Ben Recht for very helpful suggestions and corrections to a previous version. Z. Allen-Zhu is supported by an NSF Grant, no. CCF-1412958, and a Microsoft Research Grant, no. 0518584. Any opinions, findings and conclusions or recommendations expressed in this material are those of the author(s) and do not necessarily reflect the views of NSF or Microsoft.

Appendix

A Computing Hessian-Vector Product in Linear Time

In this section we sketch the intuition regarding why Hessian-vector products can be computed in linear time in many interesting (especially machine learning) problems. We start by showing that the gradient can be computed in linear time. The algorithm is often referred to as back-propagation, which dates back to Werbos’s PhD thesis [35], and has been popularized by Rumelharte et al. [31] for training neural networks.

Claim A.1 (back-propagation, informally stated). Suppose a real-valued function be evaluated by a differentiable circuit of size N. Then, the gradient can be computed in time O(N + d) (using a circuit of size

The claim follows from simple induction and chain-rule, and is left to the readers. In the training of neural networks, often the size of circuits that computes the objective f is proportional to (or equal to) the number of parameters d. Thus the gradient can be computed in time O(d) using a circuit of size d.

Next, we consider computing be a function from . Then, we see that if suffices to compute the gradient of g, since

We observe that g(x) can be evaluated in linear time using circuit of size O(d) since we’ve shown Thus, using Claim A.1 again on function we conclude that ) can also be computed in linear time.

B Proof of Lemma B.1 and Corollary 4.1

, it satisfies

Proof of Lemma B.1. Let us denote by ) in this proof. We begin by proving the first order condition. Note that we have

Recall is a minimizer of argmin ). The characterization result in Lemma 5.1 shows H +

They imply

where x uses (B.1) and y uses (B.2).

We compute the norm of the gradient at a point

where z follows from the Lipschitz continuity on the Hessian (2.1). This proves the first conclusion of the lemma.

As for the second-order condition, we first note that for all , by the Lipschitz continuity on the Hessian (2.1), we have . However, this implies

because if two matrices , then it must satisfyas well. We consider two cases: if 0, then we have

Otherwise, we consider the case where be the normalized eigenvector corresponding to ), and define

We calculate ) as follows:

where uses the assumption that 0. Since by definition ), we can deduce from inequality (B.7) that

Now we put together inequalities (B.5) and (B.8), and obtain

Combining (B.6) and (B.9) we finish the proof of Lemma B.1.

is an approximate minimizer of satisfying

then we have that

Proof of Corollary 4.1. First of all, our assumption that , along with inequality

(B.3), tells us that . This, together with our assumption on Since we also assume , we have from Lemma B.1 that

C Proof of Lemma 5.1 and Corollary 5.2

We begin by proving a few lemmas that characterize the system of equations.

Lemma C.1. Consider the following system of equations/inequalities in variables

The following statements hold for any solution (of the above system:

0 ensures that for any solution Furthermore, for any ), the corresponding h is uniquely defined by is invertible. If indeed ), then we have that the equation (has a solution. This implies that g has no component in the null space of , or equivalently that it has no component in the eigenspace corresponding to ). We also have that every solution of (is necessarily of the form

for some in the lowest eigenspace of H.

We will now prove the uniqueness of by contradiction. Consider two distinct values of that satisfy the system (C.1). If both ) we get that

Now note that is a strictly decreasing function over the domain and is strictly increasing over the same domain. Therefore the above two equations cannot be satisfied for two distinct ) which is a contradiction. Suppose now without loss of generality that ). Then we have that the corresponding solution is of the form

for some in the lowest eigenspace of H and g has no component in the lowest eigenspace of H. It follows that ). By a similar argument as in the first case, we can now see that the following conditions,

cannot both be satisfied for ), giving us a contradiction. This finishes the proof of Lemma C.1.

be a solution of the system (C.1). Then we have that

Proof of Lemma C.2. By the definition of the system (C.1), any solution to the system should be such that there exists some

Equality x follows because (follows because and

is a minimizer of m(h) if and only if there exists

The objective value in this case is given by

Proof of Lemma 5.1. We first compute that

For the forward direction, suppose is a minimizer of . Then, the necessary conditions 0 can be written as

Note that if 0. Thus, we only need to focus on 0 for every if that satisfies

Since are not orthogonal, there exists can be done by squaring both sides and solving the linear system in .) Squaring both sides we have

where x and y follow from (C.2) and (C.3), respectively. Since is a minimizer of m(h), we immediately have

and we conclude that (

For the backward direction, we will make use Lemma C.1 and Lemma C.2. First we note that the function m(h) is continuous and bounded from below, and there exists at least one minimizer . Suppose now there exists a and a corresponding such that () is a solution to the system C.1. The backward direction requires us to prove that must be a minimizer of m(h). By Lemma C.1 we get the following two cases.

We prove the backward direction by showing that the conditions in Equation C.2 determine the minimizer up to its norm. To this end we will use Lemma C.1 and Lemma C.2.

First we note that the function m(h) is continuous, bounded from below, and tends to +when , so there exists at least one minimizer

Suppose now there exists a and a corresponding such that () is a solution to the system (C.1). The backward direction requires us to prove that must be a minimizer of m(h). By Lemma C.1 we get the following two cases.

• If ) is the only solution to the system (C.1). By the proof of the forward direction we see that any minimizer of m(h) must satisfy system (C.1) and therefore must be the minimizer.

• If above is not the case, then be any minimizer of m(h). Lemma C.1 and the proof of the forward direction ensures that () also satisfies the system (C.1). By Lemma C.2 we get ) and therefore is a minimizer too.

is unique, and for every satisfying

Proof of Corollary 5.2. The uniqueness of follows from Lemma C.1. To prove the second part we first make some observations about the function

defined on the domain ). Note that p(y) is continuous and strictly increasing over the domain and

The corollary requires us to show that

We begin by showing the first equivalence. To see the backward direction note that if ), by the characterization of in Lemma C.1 we have that ) is a strictly increasing function. For the forward direction note that since p(y) is continuous and strictly increasing we see that the range of the function contains [0 there must exist a by the characterization in Lemma C.1 finishes the proof.

Now we will prove that . To see the forward direction note that if then 0 which contradicts the fact that p(y) is strictly increasing. For the backward direction we consider two cases. Firstly if ) the conclusion follows similarly by the monotonicity of then by Lemma C.1, we have that g has no component in the lowest eigenspace of H and therefore if we extend ) by defining

we get that p(y) is increasing in the domain ). Now from the characterization of the solution in Lemma C.1 we can see that 0 and therefore by monotonicity 0. This finishes the proof.

D Proof of Main Lemma 1

D.1 Proof of Claim 6.1

satisfy Case 1 and ˜satisfies (6.1), then

Proof of Claim 6.1. Note that by the conditions of the theorem we have that (and

according to Corollary 5.2 we must have

This also implies (using our assumption on ˜

Next, consider the value m(v)

We bound the two parts on the right hand side of (D.2) separately. The first part

Note that (0 by Equations (D.1) and (6.2). The second part of (D.2) can be bounded as follows

Above, inequality (owing to Proposition 5.3) and our assumption on ˜from (6.1). Putting these together we get that

D.2 Proofs for Claims 6.2 and 6.3

For notational simplicity, let us rotate the space into the basis in the eigenspace of H; let the i-th dimension correspond to the i-th largest eigenvalue denote the i-th coordinate of g in this basis.

Lemma 5.1 implies

where we denote by

We begin with a few auxiliary claims.

Claim D.1.

Proof of Claim D.1. We compute that

Above, x uses (D.5), and y follows because we have in the assumption and have in the assumption of Case 2 of Main Lemma 1.

Let us now consider the value of the vector . We have that

Above, x is because our assumption and assumption together imply follows from

Now, recall that the sign of is chosen so is non-positive, and therefore by our

Putting inequalities (D.8) and (D.7) together finishes the proof of Claim D.1.

We also show the following lemma, the proof of which can be seen from inequality (D.3), as part of the proof of Claim 6.1 above.

with ˜satisfying condition (6.1) then we have that

Claim D.3.

Proof of Claim D.3. We have that

Above, x is due to Lemma D.2; y uses our condition on v which gives z uses our condition (6.1) on ˜

We now bound . For this purpose first we note that if

Therefore, the sum

(Note that we have 0.) This finishes the proof of Claim D.3.

Proof of Claim 6.2. We derive that

Above, x uses equation (D.4), inequality y follows because we have in the assump- tion and have in the assumption of Case 2 of Main Lemma 1; inequality z uses Claim D.3 and Claim D.1; and inequality { uses (D.8). This finishes the proof of Claim 6.2.

Claim 6.3.

Proof of Claim 6.3. This time we lower bound slightly differently:

where x comes from the second to last inequality from (D.7) and y comes from using our assumption in Case 2 of Main Lemma 1.

Putting these together we get that

Above, x comes from (D.4), y uses Claim D.3, lower bound (D.10) and

E Proof of Main Lemma 2

Main Lemma 2. In the same setting as Main Lemma 1, suppose output vector v satisfies the following conditions:

Proof of Main Lemma 2. Let’s first note that from the value given in Lemma 5.1,

If Case 1 occurs, we have

Above, inequalities x and y both use the assumptions of Case 1; inequality z uses the fact that ] which again follows from the assumptions of Case 1 (see (D.1)); inequality { uses from Lemma 5.1 as well as our assumption (6.1) on ˜

As for the quantity , we bound it as follows

(D.1)); inequality ; and inequality } uses (E.1) together with our assumption (6.1) on ˜

If Case 2 occurs, we have

Above, inequalities x and y both use the assumptions of Case 2; inequality 1from our assumption of Case 2 as well as which comes from Lemma 5.1; inequality from Lemma 5.1 as well as our assumption (6.1) on ˜

The quantity can be bounded in an analogous manner as Case 1:

Above, inequality x uses our assumption (6.1) on ˜; inequality which appeared in (E.2); inequality z uses (E.1).

F Proof of Lemma 7.2

Lemma 7.2. The following statements hold for all i until FastCubicMin terminates

Moreover when FastCubicMin terminates at Line 20 we have

Proof of Lemma 7.2. The lemma follows via induction.

To see (a) and (b) at the base case together ensure

Suppose now for some 0 properties (a) and (b) hold. It is easy to check that and thus we have . This implies property (a) at iteration i + 1 also hold. We now proceed to show property (c) at iteration i and property (b) at iteration i + 1.

Recall that the algorithm ensures

and by the definition of ˜w we have

Now, since from the inductive assumption, it follows from the choice of ˆ

Inverting this chain of inequalities, we have

From this we derive the following implications:

If Condition (F.4) happens, our algorithm FastCubicMin outputs on Line 20; in such a case (F.4) implies our desired inequality . If Condition (F.5) happens, our choice ˜

Therefore, we conclude that property (c) at iteration i holds and property (b) at iteration i + 1 hold because . This finishes the proof of Lemma 7.2.

G Proof of Main Lemma 3: Running Time Half

Having proven the correctness of the algorithm, we now aim to bound the overall running time of FastCubicMin, completing the proof of Main Lemma 3. We prove in Appendix H the following lemma:

iterations.

Since in our FastCubicMin algorithm, it satisfies (see Lemma 7.2), taken together with our choice of ˜

Claim G.2. Each invocation of iterations.

Claim G.3. FastCubicMin ends in at most outer loops.

Proof. According to Lemma 7.2 we have ) so the quantity ) decreases by a constant factor per iteration (except possibly in which case we shall terminate in one more iteration). On one hand, we have began with . On the other hand, we always have according to Lemma 7.2. Therefore, the total number of outer loops is at most

G.1 Matrix Inverse

Since the key component of the running time is the computation of (for different vectors b we will first bound the condition number of the matrix (via the following lemma

Claim G.4. Through out the execution of FastCubicMin and BinarySearch whenever we compute (for some vector b it satisfies

Proof of Claim G.4. We first focus on Line 5 and Line 11 of FastCubicMin. There are two cases. If , then according to we can bound 3 because the left hand side is the largest when , then by Lemma 7.2 we know This implies

We now focus on Line 3 of BinarySearch. We claim that all values iterated over BinarySearch also satisfy (because the values according to Lemma 7.2). Therefore, the same case analysis (with respect to ) also gives

Claim G.5. Line 5 of FastCubicMin and Line 3 of BinarySearch runs in time ˜

Proof. Whenever we compute (for some vector v it satisfies ; therefore to find v satisfying it suffices to find . This costs a total running time ˜)) according to Theorem 2.4.

Therefore by Theorem 2.4, every time we need to multiply a vector , the time required to approximately solve such a system is ). We will state our running time with respect to as it is the dominant operation in the algorithm.

G.2 Power Method

We now bound the running time of Power Method in Line 11 of FastCubicMin. It is a folklore (cf. [3, Appendix A]) that getting any constant multiplicative approximation to the leading eigenvector of any PSD matrix requires only O(log d) iterations, each computing Mb for some vector b. In our case, we have so we cannot compute Mb exactly. Fortunately, folklore results on inaccurate power method suggests that, as long as each Mb is computed to a very good accuracy such as ˜, then we can still get a constant multiplicative approximate leading eigenvector that satisfies Line 11 of FastCubicMin. Ignoring all the details (which are quite standard and can be found for instance in [3, Appendix A]), we claim that

Claim G.6. Line 11 of FastCubicMin runs in time ˜

G.3 Lowest Eigenvector

We will now focus on the running time for the computation of the lowest eigenvector of the Hessian which is required in Line 18. We recall Theorem 2.5 from Section 2 which uses Shift and Invert to compute the largest eigenvalue of a matrix.

Since we are concerned with the lowest eigenvector of H and by assumption we can equivalently compute the largest eigenvector of which satisfies 0 Note that computing Mv is of the same time complexity as computing Hv. By setting in Theorem 2.5 and running AppxPCA, we obtain a unit vector w such that

Above, 1. Rearranging the terms we obtain desired. In sum,

Claim G.7. The approximate lowest eigenvector computation on Line 18 runs in time ˜

G.4 Putting It All Together

Running-Time Proof of Main Lemma 3. Putting together our bounds in Claim G.2 and Claim G.2 which bound the number of iterations, as well as our bounds in Claim G.6, Claim G.5, and Claim G.7 for power method, matrix inverse, and lowest eigenvectors, we conclude that our total running time of FastCubicMin is at most ˜contains factors polylogarithmic in

By putting together our choice of ˜in Line 2 as well as the running time of either accelerated gradient descent or accelerated SVRG from Theorem 2.4 into formula ), we finish the proof of the running time part for Main Lemma 3.

H Proof of Lemma G.1

iterations.

Proof of Lemma G.1. We first note that in all iterations of BinarySearch it always satisfies

This is true at the beginning. In each of the follow-up iterations, if we have set must satisfy but this implies according to triangle inequality and ; similarly, if we have set then it must satisfy

and therefore

Now, we notice that only increases and decreases through the execution of the algorithm. Therefore by the choice of ˆ

A completely analogous argument also shows that

which means BinarySearch will stop in this iteration. In sum, we have concluded that there will be no more than iterations.

Acknowledgements

We thank Ben Recht for helpful suggestions and corrections to a previous version.

References

[1] Naman Agarwal, Brian Bullins, and Elad Hazan. Second order stochastic optimization for machine learning in linear time. arXiv preprint arXiv:1602.03943, 2016.

[2] Zeyuan Allen-Zhu and Elad Hazan. Variance Reduction for Faster Non-Convex Optimization. In ICML, 2016.

[3] Zeyuan Allen-Zhu and Yuanzhi Li. Even Faster SVD Decomposition Yet Without Agonizing Pain. In NIPS, 2016.

[4] Zeyuan Allen-Zhu and Yang Yuan. Improved SVRG for Non-Strongly-Convex or Sum-of-Non- Convex Objectives. In ICML, 2016.

[5] Afonso S Bandeira, Nicolas Boumal, and Vladislav Voroninski. On the low-rank approach for semidefinite programs arising in synchronization and community detection. arXiv preprint arXiv:1602.04426, 2016.

[6] S. Bhojanapalli, B. Neyshabur, and N. Srebro. Global Optimality of Local Search for Low Rank Matrix Recovery. ArXiv e-prints, May 2016.

[7] Yair Carmon, John C. Duchi, Oliver Hinder, and Aaron Sidford. Accelerated methods for non-convex optimization. arXiv preprint 1611.00756, 2016.

[8] Coralia Cartis, Nicholas IM Gould, and Philippe L Toint. Adaptive cubic regularisation meth- ods for unconstrained optimization. part i: motivation, convergence and numerical results. Mathematical Programming, 127(2):245–295, 2011.

[9] Coralia Cartis, Nicholas IM Gould, and Philippe L Toint. Adaptive cubic regularisation meth- ods for unconstrained optimization. part ii: worst-case function-and derivative-evaluation complexity. Mathematical Programming, 130(2):295–319, 2011.

[10] Anna Choromanska, Mikael Henaff, Michael Mathieu, G´erard Ben Arous, and Yann LeCun. The loss surfaces of multilayer networks. In AISTATS, 2015.

[11] Yann N Dauphin, Razvan Pascanu, Caglar Gulcehre, Kyunghyun Cho, Surya Ganguli, and Yoshua Bengio. Identifying and attacking the saddle point problem in high-dimensional non-convex optimization. In Advances in neural information processing systems, pages 2933–2941, 2014.

[12] John Duchi, Elad Hazan, and Yoram Singer. Adaptive subgradient methods for online learning and stochastic optimization. The Journal of Machine Learning Research, 12:2121–2159, 2011.

[13] Dan Garber and Elad Hazan. Fast and simple PCA via convex optimization. ArXiv e-prints, September 2015.

[14] Dan Garber, Elad Hazan, Chi Jin, Sham M. Kakade, Cameron Musco, Praneeth Netrapalli, and Aaron Sidford. Robust shift-and-invert preconditioning: Faster and more sample efficient algorithms for eigenvector computation. In ICML, 2016.

[15] Rong Ge, Furong Huang, Chi Jin, and Yang Yuan. Escaping from saddle points—online stochastic gradient for tensor decomposition. arXiv:1503.02101, 2015.

[16] Rong Ge, Furong Huang, Chi Jin, and Yang Yuan. Escaping from saddle points—online stochastic gradient for tensor decomposition. In Proceedings of the 28th Annual Conference on Learning Theory, COLT 2015, 2015.

[17] Rong Ge, Furong Huang, Chi Jin, and Yang Yuan. Escaping from saddle points - online stochastic gradient for tensor decomposition. In Proceedings of The 28th Conference on Learning Theory, COLT 2015, Paris, France, July 3-6, 2015, pages 797–842, 2015.

[18] Rong Ge, Jason Lee, and Tengyu Ma. Matrix Completion has No Spurious Local Minimum. ArXiv e-prints, May 2016.

[19] Rong Ge and Tengyu Ma. On the optimization landscape of tensor decompositions, 2016.

[20] Saeed Ghadimi and Guanghui Lan. Accelerated gradient methods for nonconvex nonlinear and stochastic programming. Mathematical Programming, pages 1–26, feb 2015.

[21] I. J. Goodfellow, O. Vinyals, and A. M. Saxe. Qualitatively characterizing neural network optimization problems. ArXiv e-prints, December 2014.

[22] Elad Hazan and Tomer Koren. A linear-time algorithm for trust region problems. Mathematical Programming, pages 1–19, 2015.

[23] Christopher J. Hillar and Lek-Heng Lim. Most tensor problems are np-hard. J. ACM, 60(6):45, 2013.

[24] Jason D. Lee, Max Simchowitz, Michael I. Jordan, and Benjamin Recht. Gradient descent only converges to minimizers. In Proceedings of the 29th Conference on Learning Theory, COLT 2016, New York, USA, June 23-26, 2016, pages 1246–1257, 2016.

[25] Katta G Murty and Santosh N Kabadi. Some np-complete problems in quadratic and nonlinear programming. Mathematical programming, 39(2):117–129, 1987.

[26] Yurii Nesterov. A method of solving a convex programming problem with convergence rate , volume 269, pages 543–547, 1983.

[27] Yurii Nesterov. Introductory Lectures on Convex Programming Volume: A Basic course, volume I. Kluwer Academic Publishers, 2004.

[28] Yurii Nesterov and Boris T Polyak. Cubic regularization of newton method and its global performance. Mathematical Programming, 108(1):177–205, 2006.

[29] Barak A Pearlmutter. Fast exact multiplication by the hessian. Neural computation, 6(1):147– 160, 1994.

[30] Herbert Robbins and Sutton Monro. A stochastic approximation method. The annals of mathematical statistics, pages 400–407, 1951.

[31] David E Rumelhart, Geoffrey E Hinton, and Ronald J Williams. Learning representations by back-propagating errors. Cognitive modeling, 5(3):1, 1988.

[32] Mark Schmidt, Nicolas Le Roux, and Francis Bach. Minimizing finite sums with the stochastic average gradient. arXiv preprint arXiv:1309.2388, pages 1–45, 2013. Preliminary version appeared in NIPS 2012.

[33] Shai Shalev-Shwartz. SDCA without Duality, Regularization, and Individual Convexity. In ICML, 2016.

[34] Jonathan Richard Shewchuk. An introduction to the conjugate gradient method without the agonizing pain, 1994.

[35] Paul Werbos. Beyond regression: New tools for prediction and analysis in the behavioral sciences. 1974.

designed for accessibility and to further open science