Can We Find Near-Approximately-Stationary Points of Nonsmooth Nonconvex Functions?

2020·Arxiv

Abstract

Abstract

It is well-known that given a bounded, smooth nonconvex function, standard gradient-based methods can find -stationary points (where the gradient norm is less than iterations. However, many important nonconvex optimization problems, such as those associated with training modern neural networks, are inherently not smooth, making these results inapplicable. Moreover, as recently pointed out in Zhang et al. [2020], it is generally impossible to provide finite-time guarantees for finding an -stationary point of nonsmooth functions. Perhaps the most natural relaxation of this is to find points which are near such -stationary points. In this paper, we show that even this relaxed goal is hard to obtain in general, given only black-box access to the function values and gradients. We also discuss the pros and cons of alternative approaches.

1 Introduction

We consider optimization problems associated with functions is globally Lipschitz and bounded from below, but otherwise satisfies no special structure – in particular, it is not necessarily convex, and not necessarily differentiable everywhere. Clearly, in high dimensions, and for sufficiently complex , it is generally impossible to efficiently find a global minimum. However, if we relax our goal to finding (approximate) stationary points of , then the nonconvexity is no longer an issue. In particular, it is known that if – namely, differentiable and with a Lipschitz gradient – then for any , simple gradient-based algorithms can find , using only gradient computations, independent of the dimension (see for example Nesterov [2012], Jin et al. [2017], Carmon et al. [2019]).

Unfortunately, many optimization problems of interest are inherently not smooth. For example, when training modern neural networks, involving max operations and rectified linear units, the associated optimization problem is virtually always nonconvex as well nonsmooth. Thus, the positive results above, which crucially rely on smoothness, are inapplicable. Although there are positive results even for nonconvex nonsmooth functions, they tend to be either purely asymptotic in nature (e.g., Bena¨ım et al. [2005], Kiwiel [2007], Davis et al. [2018], Majewski et al. [2018]), or require additional structure which many problems of interest lack, such as weak convexity or some separation between nonconvex and nonsmooth components (e.g., Duchi and Ruan [2018], Davis and Drusvyatskiy [2019], Drusvyatskiy and Paquette [2019], Bolte et al. [2018], Beck and Hallak [2020]). This leads to the interesting question of developing black-box algorithms with non-asymptotic guarantees, for finding stationary points of general nonsmooth nonconvex functions.

In an elegant recent work, Zhang et al. [2020] raise this question, and provide several contributions. First, they point out that in a black-box model, where the algorithm accesses the function only by computing its values and gradients at various points, it is generally impossible in to find an approximately stationary point

Figure 1: The function used in the proof of Proposition 1, for . The origin (which fulfills the definition of a (1, 0)-stationary point) is marked with a red dot. Best viewed in color.

with finitely many queries, simply because the gradient can change abruptly and thus “hide” a stationary point inside some arbitrarily small neighborhood. Instead, they propose the following relaxation (based on the notion of -differential introduced in Goldstein [1977]): Letting denote the gradient set1 of at x, we say that a point -stationary point, if

where convis the convex hull. In words, there exists a convex combination of gradients at a -neighborhood of x, whose norm is at most . The authors then proceed to provide a dimension-free, gradient-based algorithm for finding -stationary points, using queries, as well as study related settings.

Although this constitutes a very useful algorithmic contribution to nonsmooth optimization, it is important to note that a -stationary point x (as defined above) does not imply that -close to an -stationary point of necessarily resembles a stationary point. Intuitively, this is because the convex hull of the gradients might contain a small vector, without any of the gradients being particular small. This is formally demonstrated in the following proposition:

Proposition 1. For any , there exists a differentiable function -Lipschitz on a ball of radius around the origin, and the origin is a -stationary point, yet

Proof. Fixing some , consider the function

(see Fig. 1 for an illustration). This function is differentiable, and its gradient satisfies

First, we note that

which implies that (0, 0) is in the convex hull of the gradients at a distance at most from the origin, hence the origin is a -stationary point. Second, we have that

For any (u, v) of norm at most , we must have , and therefore the above is at most

which implies that the function is -Lipschitz on a ball of radius around the origin. Finally, for any (u, v) of norm at most ) is at least

Remark 1. Although the function in the proof has a constant Lipschitz parameter only close to the origin, it can be easily modified to be globally Lipschitz and bounded, for example by considering the function

which is identical to in a ball of radius around the origin, but decays to 0 for larger x, and can be verified to be globally bounded and Lipschitz independent of

This result suggests that we should drop the convoperator from the definition of -stationarity in Eq. (1), or equivalently, try to find near-approximately-stationary points: Namely, getting -close to a point contains an element with norm at most . This is arguably the most natural way to relax the goal of finding -stationary points, while hopefully still getting meaningful algorithmic guarantees. Unfortunately, we will show in the following section that this already sets the bar too high: For a very large class of gradient-based algorithms (and in fact, all of them under a mild assumption), it is impossible to find near-approximately-stationary point with worst-case finite-time guarantees, for small enough constant . Thus, we cannot strengthen the notion of -stationarity in this manner, and still hope to get similar algorithmic guarantees. In Sec. 3, we further discuss the result and its implications.

2 Hardness of Finding Near-Approximate-Stationary Points

We begin by formalizing the setting in which we prove our hardness result (Subsection 2.1), followed by the main result in Subsection 2.2 and its proof in Subsection 2.3.

2.1 Preliminaries

Generalized Gradients and Stationary points. First, we formalize the notion of gradients and stationary points for nonsmooth functions (which may not be everywhere differentiable). Given a Lipschitz function and a point x in its domain, we let denote the set of generalized gradients (following Clarke [1990] and Zhang et al. [2020]), which is perhaps the most standard extension of the notion of gradients to nonsmooth nonconvex functions. For Lipschitz functions (which are almost everywhere differentiable by Rademacher’s theorem), one simple way to define it is

(namely, the convex hull of all limit points of , over all sequences of differentiable points of which converge to x). With this definition, a (Clarke) stationary point with respect to is a point x satisfying . Also, given some , we say that -stationary point with respect to , if there is some . To make the problem of getting near -stationary points non-trivial, and following Zhang et al. [2020], we will focus on functions that are both globally Lipschitz and bounded from below. In particular, we will assume that is upper bounded by a constant (this is without loss of generality, as 0 can be replaced by any other fixed reference point).

Oracle Complexity. We will study the algorithmic efficiency of the problem using the standard framework of (first-order) oracle complexity [Nemirovsky and Yudin, 1983]: Given a class of Lipschitz and bounded functions F as above, we associate with each , which for any x in the domain of , returns the value and a (generalized) gradient of . We focus on iterative algorithms which can be described via an interaction with such an oracle: At every iteration t, the algorithm chooses an iterate , and receives from the oracle a generalized gradient and value of . The algorithm then uses the values and gradients obtained up to iteration t to choose the point in the next iteration. This framework captures essentially all first-order algorithms for black-box optimization. In this framework, we fix some iteration budget T, and study the properties of the iterates as a function of T and the properties of F.

Algorithmic Families. We will focus on two broad families of algorithms, which together span nearly all algorithms of interest: The first is the class of all deterministic algorithms (denoted as are all oracle-based algorithms where where is chosen deterministically, and for all deterministic function of and the previously observed values and gradients. The second is the class of all linear-span algorithms (denoted as ), which are all deterministic or randomized oracle-based algorithms that initialize at some (which we will take to be 0 without loss of generality), and

with being the gradient provided by the oracle at iteration i, given query point

2.2 Main Result

Our main result is the following:

Theorem 1. There exist a large enough universal constant C > 0 and a small enough universal constant c > 0 such that the following holds: For any algorithm in is a function

Figure 2: Mesh and Contour plot of the function is defined in Lemma 2 (as part of the proof of Thm. 1). The origin is marked with a red dot. Best viewed in color.

• With probability at least over the algorithm’s randomness (or deterministically if the algorithm is deterministic), the iterates produced by the algorithm satisfy

The theorem implies that for a very large family of algorithms, it is impossible to obtain worst-case, finite-time guarantees for finding near-approximately-stationary points of Lipschitz, bounded-from-below functions

Before continuing, we note that the result can be extended to any oracle-based algorithm (randomized or deterministic) under a widely-believed assumption – see remark 5 in the proof for details. We also make two additional remarks:

Remark 2 (More assumptions on The Lipschitz functions used to prove the theorem are based on a simple composition of affine functions, the Euclidean norm function , and the max function. Thus, the result also holds for more specific families of functions considered in the literature, which satisfy additional regularity properties, as long as they contain any Lipschitz function composed as above (for example, Hadamard semi-differentiable functions in Zhang et al. [2020], Whitney-stratifiable functions in Davis et al. [2018], regular functions in Clarke [1990], etc.).

Remark 3 (Strengthening of Proposition 1). The proof of Thm. 1 uses a construction that actually strengthens Proposition 1: It implies that for any smaller than some constants, there is a Lipschitz, bounded-from-below function on , such that the origin is -stationary, yet there are no -stationary points even at a constant distance from the origin. See remark 6 in the proof for details.

The formal proof of the theorem appears in the next subsection, but can be informally described as follows: First, we construct a Lipschitz function on , specified by a small vector w, which resembles the norm function in “most” of , but with a “channel” leading away from a neighborhood of the origin in the direction of w, and reaching a completely flat region (see Fig. 2). We emphasize that the graphical illustration is a bit misleading due to the low dimension: In high dimensions, the “channel” and flat region contain a vanishingly small portion of . This function has the property of having -stationary points only in the flat region far away from the origin, in the direction of w, even though the function appears in most places like the norm function independent of w. As a result, any oracle-based algorithm, that doesn’t happen to hit the vanishingly small region where the function differs from no information about w, and thus cannot determine where the -stationary points lie. As a result, such an algorithm cannot return near-approximately-stationary points.

Unfortunately, the construction described so far does not work as-is, since the algorithm can always query sufficiently close to the origin (closer than roughly ), where the gradients do provide information about w. To prevent this, we compose the function with an algorithm-dependent affine mapping, which doesn’t significantly change the function’s properties, but ensures that the algorithm can never get too close to the (mapped) origin. We show that such an affine mapping must exist, based on standard oracle complexity results, which imply that oracle-based algorithms as above cannot get too close to the minimum of a generic convex quadratic function with a bounded number of queries. Using this mapping, and the useful property that w can be chosen arbitrarily small, leads to our theorem. The full details appear in the following subsection.

Remark 4 (Extension to higher-order algorithms). Our proof approach is quite flexible, in the sense that for any algorithm, we really only need some function which cannot be optimized to arbitrarily high accuracy, composed with a “channel” construction as above. Since functions of this type also exist for algorithms employing higher-order derivatives beyond gradients [Arjevani et al., 2019], it is unlikely that such higher-order algorithms will circumvent our impossibility result.

2.3 Proof of Thm. 1

We begin by stating the following theorem, which follows from well-known results in oracle complexity (see Nesterov [2018], Nemirovsky and Yudin [1983]):

Theorem 2. For any T > 1, any algorithm in and any dimension , there is a vector ) and a positive definite matrix (with minimal and maximal eigenvalues satisfying ), such that the iterates produced by the algorithm when ran on the strictly convex quadratic function

For completeness, we provide a self-contained proof in Appendix A. Basically, the theorem states that for any algorithm in , there is a relatively well-conditioned2 but still “hard” strictly convex quadratic function, whose minimum cannot be detected with accuracy better than

Remark 5 (Extension to any gradient-based algorithm). Up to the constants, a lower bound as in Thm. 2 is widely considered to hold (with high-probability) for all oracle-based algorithms, not just for deterministic or linear-span algorithms (see [Nemirovsky and Yudin, 1983, Simchowitz, 2018]). In that case, our Thm. 1 can be easily extended to apply to all oracle-based algorithms which utilize function values and gradients, since the only point in the proof where we really need to restrict the algorithm class is in Thm. 2. Unfortunately, we are not aware of a result in the literature which quite states this, explicitly and in the required form. For example, there are algorithm-independent lower bounds which rely on non-quadratic functions [Woodworth and Srebro, 2017], or apply to quadratics, but not in a regime where constant as in our case [Simchowitz, 2018].

Given the theorem, our first step will be to reduce it to a hardness result for optimizing convex Lipschitz functions of the form

Lemma 1. For any algorithm in and any dimension , there is a vector ) and a positive definite matrix 1), such that the convex function

satisfies the following:

• -Lipschitz, and

• If we run the algorithm on

Proof. We will start with the second bullet. Fix some algorithm , and assume by contradiction that for any satisfying the conditions in the lemma, the algorithm runs on duces iterates such that (either deterministically if the algorithm is deterministic, or for some realization of its random coin flips, if it is randomized). But then, we argue that given access to gradients and values of , we can use A to specify another algorithm in that runs on and produces points , contradicting Thm. 2. To see why, note that given access to an oracle returning values and gradients of , we can simulate an oracle returning gradients and values of at x via the easily-verified formulaes

(and for is not differentiable, we can just return the value 0 and the generalized gradient 0). We then feed the responses of this simulated oracle to A, and get the resulting . This give us a new algorithm , which is easily verified to be in if the original algorithm

which by the conditions on M, is at most

Next, we define a function with two properties: It is identical to in parts of as we will see later, in “almost” all of ), yet unlike the function , it has no stationary points, or even -stationary points.

Lemma 2. Fix some vector , and define the function

where for any vector -Lipschitz, and has no -stationary points for any

Proof. In the proof, we will drop the w subscript and refer to

Lipschitz, 4-Lipschitz, 2-Lipschitz and 1-Lipschitz, from which it immediately follows that 7 Lipschitz. Thus, it only remains to show that -stationary points.

(or equivalently, exclude ), which are all measure-zero sets in . At any other point, is differentiable and the gradient satisfies

Moreover, at those differentiable points, if

Thus, no differentiable point of g is even 0.99-stationary. It remains to show that even the non-differentiable points of -stationary for any . To do so, we will use the facts that and that if is univariate, (see Clarke [1990]).

• At x = 0, we have

Any element in this set has a norm of triangle inequality. Thus, -stationary for any

• At

For any element in the set (corresponding to some v, u), its inner product with

Thus, any element in the convex hull of this set, which contains , has an inner product of at least is a unit vector, it follows that the norm of any element in is at least 1, so this point is not -stationary for any

• At any x in the set

– If , then by Eq. (5), we must have . But then, for any by Eq. (3) and our assumption that

This implies that . Thus, if there was some norm at most , we get that , which cannot hold if

where is as defined in Lemma 2, actually strengthens Proposition 1 from the introduction: According to the lemma, -Lipschitz and has no -stationary points for . Therefore, it is easily verified that for any -Lipschitz, bounded from below, and any -stationary point is at a distance of at least 1/7 from the origin3. However, we also claim that the origin is a -stationary point for any . To see this, note first that for such , by the Lipschitz property of -neighborhood of the origin. Fix any be any vector of norm orthogonal to w. It is easily verified that , in which case

and therefore

Lemma 3. Fix any algorithm in . Define the function

where are as defined in Lemma 1, is as defined in Lemma 2, and w is a vector of norm

• -Lipschitz, and satisfies

• Any -stationary point

• There exists a choice of w, such that if we run the algorithm on , then with probability at least over the algorithm’s randomness (or deterministically if the algorithm is deterministic), the algorithm’s iterates

Proof. The Lipschitz bound follows from the facts that -Lipschitz, is -Lipschitz, and that -Lipschitz by Lemma 2. Moreover, we clearly have , and by definition of

Combining the two observations establishes the first bullet in the lemma.

As to the second bullet, let ). It is easily verified that if and only if -stationary point for , which implies that -stationary points for any since , it follows that any -stationary points of must be arbitrarily close to the region where is different than , namely where it takes a value of is Lipschitz, it follows that its value is -stationary point as well.

We now turn to establish the third bullet in the lemma. A crucial observation here is that

where is the “hard function” defined in Lemma 14. To see why, note first that by definition of which satisfies the condition in the displayed equation above, we have . On the other hand, since this is a non-negative function, it follows that it also equals , establishing the displayed equation above.

Next, we will show that Eq. (6) also holds over a set of x’s which have a more convenient form. To do so, fix some x which satisfies the opposite condition . Then multiplying

both sides by

For , the condition above is trivially satisfied. For , dividing both sides by (which is between ) and simplifying a bit, we get that

Noting that any x which does not satisfy the condition in Eq. (6) satisfy the condition above, we get that Eq. (6) implies

With this equation in hand, let us first establish the third bullet of the lemma, assuming the algorithm we consider is deterministic. In order to do so, let be the (fixed) iterates produced by the algorithm when ran on , and choose to be any vector orthogonal to possible since the dimension d is larger than T). By Lemma 1, we know that for all in which case we have

Thus, satisfies the condition in Eq. (7), and as a result, . Moreover, using the fact that is bounded away from , it is easily verified that the condition in Eq. (7) also holds for x in a small local neighborhood of , so actually is identical to on these local neigborhoods, implying the same values and gradients at . As a result, if we run the algorithm on rather than , then the iterates produced are identical to as required.

We now turn to establish the third bullet of the lemma, assuming the algorithm is randomized. As before, we let denote the iterates produced by the algorithm when ran on (only that now they are possibly random, based on the algorithm’s random coin flips). The proof idea is roughly the same, but here the iterates may be randomized, so we cannot choose w in some fixed manner. Instead, we will pick w independently and uniformly at random among vectors of norm , and show that for any realization of the algorithm’s random coin flips, with probability at least . This implies that there exists some fixed choice of w, for which with the same high probability over the algorithm’s randomness, as required5. To proceed, we collect two observations:

1. By Lemma 1, we know that for any realization of the algorithm’s random coin flips,

2. If we fix some unit vectors , and pick a unit vector u uniformly at random, then by a union bound and a standard large deviation bound (e.g., Tkocz [2012]), realization of , it follows that for any realization of the algorithm’s random coin flips, with probability at most over the choice of w.

Combining the two observations, we get that for any realization of the algorithm’s coin flips, with probability at least over the choice of w, it holds for all

as well as . Using the same argument as in the deterministic case, it follows from Eq. (7) that coincide in small neighborhoods around , with probability at least . Since the algorithm’s iterates depend only on the local values/gradients returned by the oracle, it follows that for any realization of the algorithm’s coin flips, with probability at least over the choice of w, the iterates are going to be identical, and satisfy

This holds for any realization of the algorithm’s random coin flips, which as discussed earlier, implies the required result.

The theorem is now an immediate corollary of the lemma above: With the specified high probability (or deterministically), , even though all -stationary points (for any ) have a value of -Lipschitz, we get that the distance of any -stationary point must be at least . Simplifying the numerical terms by choosing a large enough constant C and a small enough constant c, the theorem follows.

3 Discussion

Thm. 1 implies that at least with black-box oracle access to the function, it is probably impossible to design algorithms with finite-time guarantees for finding near-approximately-stationary points. This raises the question of what alternative notions of stationarity can we consider when trying to efficiently optimize generic nonconvex nonsmooth functions.

One very appealing notion is the -stationarity of Zhang et al. [2020] that we discussed in the introduction, which comes with clean finite-time guarantees. Our negative result in Thm. 1 provides further motivation to consider it, by showing that a natural strengthening of this notion will not work. However, as we showed in Proposition 1 and remark 3, we need to accept that this stationarity notion can have unexpected behavior, and there exist cases where it will not resemble a stationary point in any intuitive sense.

Another possible direction is to replace the convex hull in the definition of -stationarity by some fixed convex combination of gradients in the -neighborhood of our point. For example, we might define a point -stationary with respect to a function formly distributed in the unit origin-centered ball. For Lipschitz functions (which are almost everywhere differentiable), this operation is well-defined, and is generally equivalent to finding -stationary points of the smoothed function , which can be done efficiently given access to gradients of (see Duchi et al. [2012], Ghadimi and Lan [2013]). However, it is important to note that the gradient Lipschitz parameter of is generally dimension-dependent, and thus we will not get dimension-free guarantees if we simply plug in existing results for smooth functions. Moreover, this notion of -stationarity still does not rule out counter-intuitive behaviors similar to Proposition 1, where a point is -stationary without actually having a near-zero gradient in its -neighborhood.

In a related direction, one might consider finding -stationary points of other smooth approximations of the original function, which have a better behavior. For example, for nonconvex functions, a well-known smoothing operation with dimension-free guarantees is the Lasry-Lions regularization [Lasry and Lions, 1986, Attouch and Aze, 1993], which is closely related to the Moreau-Yosida regularization for smoothing convex functions. However, this operation does not appear to be efficiently computable in general.

Finally, it is important to step back and point out that when considering optimization of nonsmooth nonconvex functions, it is generally difficult to relate stationarity properties to any meaningful local optimality properties, even more so than in the smooth case. For example, consider the simple nonsmooth bivariate function studied in [Warga, 1981, Czarnecki and Rifford, 2006],

which is illustrated in Fig. 3. It can be shown that the origin is a (Clarke) stationary point. However, it is not even approximately-stationary with respect to standard smooth approximations of the function, regardless of how tight we make them. Also, it is clearly not a point we would like our algorithm to converge to, if we actually try to minimize the function. This suggests looking for notions beyond stationarity, for which we can still provide algorithmic guarantees even for nonconvex nonsmooth functions.

Figure 3: Mesh and Contour plot of the function from Eq. (8). Best viewed in color.

References

Yossi Arjevani, Ohad Shamir, and Ron Shiff. Oracle complexity of second-order methods for smooth convex optimization. Mathematical Programming, 178(1-2):327–360, 2019.

H´edy Attouch and Dominique Aze. Approximation and regularization of arbitrary functions in hilbert spaces by the lasry-lions method. In Annales de l’Institut Henri Poincare (C) Non Linear Analysis, volume 10, pages 289–312. Elsevier, 1993.

Amir Beck and Nadav Hallak. On the convergence to stationary points of deterministic and randomized feasible descent directions methods. SIAM Journal on Optimization, 30(1):56–79, 2020.

Michel Bena¨ım, Josef Hofbauer, and Sylvain Sorin. Stochastic approximations and differential inclusions. SIAM Journal on Control and Optimization, 44(1):328–348, 2005.

J´erˆome Bolte, Shoham Sabach, Marc Teboulle, and Yakov Vaisbourd. First order methods beyond convex- ity and lipschitz gradient continuity with applications to quadratic inverse problems. SIAM Journal on Optimization, 28(3):2131–2151, 2018.

Yair Carmon, John C Duchi, Oliver Hinder, and Aaron Sidford. Lower bounds for finding stationary points i. Mathematical Programming, pages 1–50, 2019.

Frank H Clarke. Optimization and nonsmooth analysis, volume 5. Siam, 1990.

Marc-Olivier Czarnecki and Ludovic Rifford. Approximation and regularization of lipschitz functions: convergence of the gradients. Transactions of the American Mathematical Society, 358(10):4467–4520, 2006.

Damek Davis and Dmitriy Drusvyatskiy. Stochastic model-based minimization of weakly convex functions. SIAM Journal on Optimization, 29(1):207–239, 2019.

Damek Davis, Dmitriy Drusvyatskiy, Sham Kakade, and Jason D Lee. Stochastic subgradient method converges on tame functions. Foundations of computational mathematics, pages 1–36, 2018.

Dmitriy Drusvyatskiy and Courtney Paquette. Efficiency of minimizing compositions of convex functions and smooth maps. Mathematical Programming, 178(1-2):503–558, 2019.

John C Duchi and Feng Ruan. Stochastic methods for composite and weakly convex optimization problems. SIAM Journal on Optimization, 28(4):3229–3259, 2018.

John C Duchi, Peter L Bartlett, and Martin J Wainwright. Randomized smoothing for stochastic optimiza- tion. SIAM Journal on Optimization, 22(2):674–701, 2012.

Saeed Ghadimi and Guanghui Lan. Stochastic first-and zeroth-order methods for nonconvex stochastic programming. SIAM Journal on Optimization, 23(4):2341–2368, 2013.

AA Goldstein. Optimization of lipschitz continuous functions. Mathematical Programming, 13(1):14–22, 1977.

Chi Jin, Rong Ge, Praneeth Netrapalli, Sham M Kakade, and Michael I Jordan. How to escape saddle points efficiently. In Proceedings of the 34th International Conference on Machine Learning-Volume 70, pages 1724–1732. JMLR. org, 2017.

Krzysztof C Kiwiel. Convergence of the gradient sampling algorithm for nonsmooth nonconvex optimiza- tion. SIAM Journal on Optimization, 18(2):379–388, 2007.

Guanghui Lan and Yi Zhou. An optimal randomized incremental gradient method. Mathematical programming, 171(1-2):167–215, 2018.

Jean-Michel Lasry and Pierre-Louis Lions. A remark on regularization in hilbert spaces. Israel Journal of Mathematics, 55(3):257–266, 1986.

Szymon Majewski, Bła˙zej Miasojedow, and Eric Moulines. Analysis of nonsmooth stochastic approxima- tion: the differential inclusion approach. arXiv preprint arXiv:1805.01916, 2018.

Arkadii Semenovich Nemirovsky and David Borisovich Yudin. Problem complexity and method efficiency in optimization. Wiley, 1983.

Yurii Nesterov. How to make the gradients small. Optima. Mathematical Optimization Society Newsletter, (88):10–11, 2012.

Yurii Nesterov. Lectures on convex optimization, volume 137. Springer, 2018.

Max Simchowitz. On the randomized complexity of minimizing a convex quadratic function. arXiv preprint arXiv:1807.09386, 2018.

Tomasz Tkocz. An upper bound for spherical caps. The American Mathematical Monthly, 119(7):606–607, 2012.

Jack Warga. Fat homeomorphisms and unbounded derivate containers. Journal of Mathematical Analysis and Applications, 81(2):545–560, 1981.

Blake Woodworth and Nathan Srebro. Lower bound for randomized first order convex optimization. arXiv preprint arXiv:1709.03594, 2017.

Jingzhao Zhang, Hongzhou Lin, Suvrit Sra, and Ali Jadbabaie. On complexity of finding stationary points of nonsmooth nonconvex functions. arXiv preprint arXiv:2002.04130, 2020.

A Proof of Thm. 2

In the proof, we let bold-faced letters (e.g., x) denote vectors, and denote the i-th coordinate of the vector x. Also, we let denote the standard basis vectors.

Our proof will closely follow the analysis employed in Lan and Zhou [2018, Theorem 3] for a slightly different setting.

Fix an iteration budget T and some dimension be the symmetric tridiagonal matrix defined as

Also, for some constant b to be determined later, define the quadratic function

It is easily verified that this function can be equivalently written as

We first collect a few useful facts about , stated in the following two lemmas:

Lemma 4. . As a result, M is positive definite, and strictly convex and has a unique minimum.

Proof. A is symmetric, and for any

This is non-negative, which establishes that A is a positive semidefinite matrix. Hence, by definition of M, , which implies that M is positive definite. As a result, strictly convex and has a unique minimum. Also, by the displayed equation above,

where we use the fact that This establishes that , and therefore

Lemma 5. The minimum is of the form

Proof. By the previous lemma and the fact that is differentiable, is the unique point satisfying . Thus, it is enough to verify that the formula for stated in the lemma indeed satisfies this equation. Computing the gradient of using the formulation in Eq. (9)), we just need to verify that

which is easily verified to hold for the value of q stated in the lemma. Finally, we have

can be written in the form

With this construction in hand, we now turn to prove the theorem. We will start with the family of linear-span algorithms , using any dimension as the “hard” function on which we will prove a lower bound (note that by the lemmas above and Eq. (10), it satisfies the conditions stated in the theorem).

Consider any algorithm in , and note that by the structure of as specified in Eq. (9), when the algorithm picks its first query , it receives a gradient in span. Because of the linear-span assumption, it means that , which again by Eq. (9) means that the returned gradient is in span. Continuing this process, it is easily seen by induction that

for all t, and in particular, . As a result,

which by Lemma 5 is at least

as stated in the theorem.

We now turn to prove the theorem for deterministic algorithms. This time, we will let the dimension be any . Fixing an algorithm, and letting be orthonormal vectors to be specified shortly, we prove the lower bound for the function

where is the function defined previously in Eq. (9), and U is an orthogonal matrix whose first T rows are , and the rest of the rows are some arbitrary completion of the first T rows to an orthonormal basis. Thus, is equivalent to up to a rotation of the coordinate system specified by U. In particular, using Eq. (10), it follows that

where . Thus, we see that has the form required in the theorem, with a matrix whose spectrum is identical to M, and a minimizer whose norm is the same as (and therefore satisfying the conditions in the theorem).

We now specify how to choose in the function definition, so as to get the lower bound on : Since the algorithm is deterministic, its first query point is known in advance. We therefore choose to be some unit vector orthogonal to . Assuming that are orthogonal to (which we shall justify shortly), we have by Eq. (11) that depend only on , and not on . As the algorithm is deterministic and depends only on the observed values and gradients, this means that even before choosing , we can already simulate its next iteration, and determine the next query point . We now pick to be some unit vector orthogonal to as to . Again by the same considerations, if we assume are orthogonal to have that depend only on , and independent of . So again, we can simulate it and determine the next query point . We continue this process up to iteration T, where we fix orthogonal to (this process is possible as long as the dimension d is at least , as we indeed assume).

As a result of this process, we get that . Also, since also have . Using Lemma 5, we get that for all

which implies that

as required.