b

DiscoverSearch
About
My stuff
Sharp worst-case evaluation complexity bounds for arbitrary-order nonconvex optimization with inexpensive constraints
2018·arXiv
Abstract
Abstract

We provide sharp worst-case evaluation complexity bounds for nonconvex minimization problems with general inexpensive constraints, i.e. problems where the cost of evaluating/enforcing of the (possibly nonconvex or even disconnected) constraints, if any, is negligible compared to that of evaluating the objective function. These bounds unify, extend or improve all known upper and lower complexity bounds for unconstrained and convexly-constrained problems. It is shown that, given an accuracy level  ǫ, a degree of highest available Lipschitz continuous derivatives p and a desired optimality order q between one and p, a conceptual regularization algorithm requires no more than  O(ǫ− p+1p−q+1 )evaluations of the objective function and its derivatives to compute a suitably approximate q-th order minimizer. With an appropriate choice of the regularization, a similar result also holds if the p-th derivative is merely H¨older rather than Lipschitz continuous. We provide an example that shows that the above complexity bound is sharp for unconstrained and a wide class of constrained problems; we also give reasons for the optimality of regularization methods from a worst-case complexity point of view, within a large class of algorithms that use the same derivative information.

Since the seminal paper by Vavasis [21] on the complexity of finding first-order critical points in unconstrained nonlinear optimization was published 25 years ago, the question of the optimal worst-case complexity of optimization methods has been of interest to mathematicians and also, because of its strong connection with deep learning, to computer scientists. Of late, there has been a growing interest in this research field, both for convex and nonconvex problems. This paper focusses on the latter class and follows a now subtantial(1) trend of research where bounds on the worst-case evaluation complexity (or oracle complexity) of obtaining first- and (more rarely) second-order-necessary minimizers(2) for nonlinear nonconvex unconstrained optimization problems [21, 17, 14, 19, 5]. These papers all provide upper evaluation complexity bounds: they show that, to obtain an  ǫ-approximate first-order-necessary minimizer (for unconstrained problem, this is a point at which the gradient of the objective function is less than  ǫ in norm), at most O(ǫ−2) evaluations of the objective function(3) are neededif a model involving first derivatives is used, and  at most O(ǫ−3/2) evaluations are needed if using second derivatives is permitted. This result was extended to convexly-constrained problems in [6]. A broader framework allowing the use of Taylor series of degree p was more recently proposed in [2], in which case the worst-case evaluation complexity bound for  ǫ-first-order-necessary unconstrained minimizer is shown to be  O(ǫ− p+1p), thereby generalizing the previous results for this case. Complexity for obtaining  ǫ-approximate second-order-necessary unconstrained minimizers was considered in [19, 5], where a bound of  O(ǫ−3) evaluations was proved to obtain an  ǫ-second-order-necessary minimizer using a Taylor’s model of degree two, and a bound of  O(ǫ− p+1p−1) evaluations was shown in [8] for the case where a Taylor model of degree p is used. Defining q-th-order-necessary minimizers for q > 2 was considered in [11], where the difficulty of stating and verifying necessary optimality was discussed. In particular, it was concluded in this latter reference that defining and computing  ǫ-approximate q-th-order-necessary minimizers for q > 2 is likely to remain elusive, essentially because of the nonlinearity and lack of continuity of the kernels of the derivatives involved. A more general Taylor-based definition of optimality was introduced instead, which allowed to show an upper bound of  O(ǫ−(q+1)) on evaluation complexity for convexly-constrained problems, in particular improving on the bound of  O(ǫ−9/2) stated in [1] for the case p = q = 3.

The unconstrained and convexly-constrained cases where the assumption of Lipschitz continuity is replaced by the weaker  β-H¨older continuity (β ∈ (0,1]) have also been studied for

q = 1 in [18, 7, 9]. These references show that  at most O(ǫ− p+βp−1+β) evaluations are needed for obtaining an  ǫ-first-order-necessary minimizer.

While upper complexity bounds are important as they provide a handle on the intrinsic difficulty of the considered problem, they do so at the condition of not being overly pessimistic. To address this last point, lower bounds on the evaluation complexity of unconstrained nonconvex optimization problems and methods were derived in [4, 17] and [12], where it was shown that the known upper complexity bounds are sharp (irrespective of problem’s dimension) for most known methods using Taylor’s models of degree one or two. That is to say that there are examples for which the complexity order predicted by the upper bound is actually achieved. More recently, Carmon et al. [3] provided an elaborate construction showing that at least

a multiple of  ǫ− p+1pfunction evaluations may be needed to obtain an  ǫ-first-order-necessary unconstrained minimizer where derivatives of order at most p are used. This result, which matches in order the upper bound of [2], covers a very wide class of potential optimization methods(4) but has the drawback of being only valid for problems whose dimension essentially exceeds the number of iterations needed, which can be very large and quickly grows when  ǫtends to zero.

Contributions. The present paper aims at unifying and generalizing all the above results in a single framework, providing, for problems with inexpensive or no constraints, provably optimal evaluation complexity bounds for arbitrary optimality order, all relevant model degrees and levels of smoothness of the objective function. By “inexpensive constraints”, we mean general set constraints whose enforcement and evaluation(5) cost is negligible compared to the cost of evaluating the objective function. As a consequence, the evaluation complexity for such problems is meaningfully captured by focusing of the number of evaluations of this latter function. This class of minimization problems contains important cases such as bound-constrained problems and convexly-constrained problems (when the projection onto the feasible set is inexpensive), but also allows possibly nonconvex or even disconnected feasible sets.

In order to achieve these objectives, we first revisit the Taylor-based optimality measure of [11] and define (ǫ,δ)-q-th-order-necessary minimizers, a notion extending the standard  ǫ-first-and  ǫ-second-order cases to arbitrary orders. We then present a conceptual regularization algorithm using degree p models and show that this algorithm requires at most  O(ǫ− p+βp−q+β )evaluations of f and its derivatives to find such an (ǫ,δ)-q-th-order-necessary minimizer when the p-th derivative of f is assumed to be  β-H¨older continuous. (If the p-th derivative is assumed to be Lispchitz continuous, the bound becomes  O(ǫ− p+1p−q+1).) This bound matches the best known lower bounds for first- and second-order, and improves on the bound in  O(ǫ−(q+1))given by [11]. We then show that this bound is sharp in order for unconstrained problems with Lipschitz continuous p-th derivative by completing and extending the result of [3] in two ways. The first is to show that the lower worst-case bound of order  ǫ− p+1pevaluations for obtaining a first-order-necessary minimizer using at most p derivatives is also valid for problems of every dimension, and the second is to show that this bound can be generalized to a multiple of ǫ− p+1p−q+1for obtaining a q-th-order-necessary minimizer of any order q. In particular, this result matches in order the upper bound obtained in the first part of the paper and subsumes or improves known lower bounds for first- and second-order-necessary minimizers. While our lower bounds are derived for regularization algorithms applied to unconstrained problems, we also indicate that they may be extended to a much wider class of minimization methods and to a significant class of constrained problems.

The paper is organized as follows. Section 2 introduces the (possibly constrained) minimization problem of interest and the concept of (ǫ,δ)-approximate q-th-order-necessary minimizers. It also presents a variant of the Adaptive Regularization algorithm using degree p Taylor’s models (ARp) whose purpose is to find such minimizers. Section 3 then provides an upper bound on the evaluation complexity for the ARp algorithm to achieve this task. Section 4 then discusses specialization of this result to the case where  ǫ-approximate second-order-necessary minimizers are sought. The complexity upper bound of Section 3 is then proved to be sharp in Section 5 for the Lipschitz-continuous cases where the feasible set contains a ray. Some conclusions are finally presented in Section 6.

Notation. Throughout the paper,  ∥v∥denotes the standard Euclidean norm of a vector v ∈ IRn. For a symmetric tensor  S of order p, S[v1, . . . , vp] is the result of applying S to the vectors  v1, . . . , vp, S[v]p is the result of applying S to p copies of the vector v and

image

(where the second equality results from Theorem 2.1 in [23]) is the associated induced norm for such tensors. If  S1 and S2are tensors,  S1 ⊗ S2is their tensor product and  Sk⊗1 is theproduct of  S1 ktimes with itself. For a real, sufficiently differentiable univariate function f, f (i) denotes its i-th derivative and  f (0) is a synonym for f. For an integer k and a real β ∈ (0,1], we define (k + β)! def= �kℓ=1(β + ℓ) (this coincides with the standard factorial if β= 1). As is usual, we also define 0! = 1. If M is a symmetric matrix,  λmin(M) is itsleft-most eigenvalue. If  αis a real,  ⌈α⌉ and ⌊α⌋denote the smallest integer not smaller than αand the largest integer not exceeding  α, respectively. Finally globminx∈S f(x) denotes the smallest value of  f(x) over x ∈ S.

image

Given  p ≥1, this paper considers the set-constrained optimization problem

image

where we assume that  F ⊆ IRn is closed and nonempty, and where  f ∈ Cp,β(IRn), namely,that:

f is p-times continuously differentiable,

f is bounded below by  flow, and

the p-th derivative tensor of f at x is globally H¨older continuous, that is, there exist constants  L ≥ 0 and β ∈ (0,1] such that, for all  x, y ∈ IRn,

image

Observe that convexity or even connectedness of F is not requested. Observe also that the more usual case of  Lipschitz continuous p-th derivative corresponds to β= 1. We note that our assumption covers the continuous range of objective function’s smoothness from H¨older continuous gradients to Lipschitz continuous p-th derivatives. In what follows, we assume that  β is known.

If  Tp(x, s) is the standard p-th degree Taylor’s expansion of f about x computed for the increment s, that is

image

(2.2) provides crucial approximation bounds, whose proof can be found in the appendix.

image

image

and  j ≤ p,

image

which can be interpreted as the  magnitude of the largest decrease achievable on the Taylor’s

image

shown in [11] that  φδf,j(x) is a proper generalization of well-known unconstrained optimality measures for low orders, in that, for  δ = 1,

image

At variance with other optimality measures,  φδj,f(x) is well-defined for any order  j ≥ 1 andvaries continuously when x varies continuoulsy in F. The role of the “optimality radius”  δin (2.6) merits some discussion. While the choice of  δ= 1 is adequate for retrieving known optimality conditions in the unconstrained case for j = 1, j = 2 provided  ∇1xf(x) = 0, andj = 3 provided additionnaly  ∇2xf(x) is positive semi-definite (as we have just seen),  δ becomesimportant in other cases. Corollary 3.6 in [11] indicates that, when F is convex, q-th-order necessary “path-based” optimality conditions hold if

image

The limit for  δ →0 is necessary to capture the notion of local minimizer for (2.1). However, considering  φδf,j(x) for non-vanishing  δhas substantial advantages from the point of view of optimization: while it may fail to indicate that x is a local minimizer, it does so only by providing a direction leading to values of f below f(x), thereby helping to avoid local but non-global approximate solutions. We refer the reader to [11] for a further discussion, but conclude that considering fixed  δhas strong advantages when solving (2.1).

A special case is when x is an isolated feasible point, that is a point which is the sole intersection between F and any sufficiently small neighbourhood of x. Such a point is clearly a local minimizer, and this is reflected by the fact that  φδf,q(x) = 0 for any f, any q and any sufficiently small  δ.

The main drawback of using  φδf,j(x) is, of course, that its computation requires the global minimization of  Tp(x, d) in the intersection of the ball of radius  δ with F. We are not aware of an easy way to do this in general(6) when n >1, which is why our analysis remains of an essentially theoretical nature, as was the case for [11]. Note however that, albeit potentially very difficult, solving this global minimization problem does not involve calculating the value of f or of any of its derivatives. In that sense, this drawback is thus irrelevant for the worst-case evaluation complexity which solely focuses on these evaluations.

Observe now that, if we were to relax the first-order condition  ∇1xf(x) = 0 for uncon- strained problems to  ∥∇1xf(x)∥ ≤ ǫand, at the same time, relax the second-order condition to��min[0, λmin�∇2xf(x)��� ≤ ǫ, we then deduce that

image

A natural generalization of this observation is to define an (ǫ, δ)-approximate q-th-order-necessary minimizer of f as a point x such that

image

where

image

Because (2.12) is a new way to look at approximate optimality and is crucial for the rest of this paper, it is worthwhile to motivate and discuss it further.

1. When  ǫ= 0, (2.12) implies that the complicated path-based necessary optimality conditions derived in [11] do hold. This results from the fact that these latter conditions merely express that the Taylor’s model of order q cannot decrease close enough to x along any feasible polynomial path emanating from x, which is clearly the case if x is a global minimizer of the same models in the intersection of the feasible set and a ball of radius  δcentered at x. By continuity, these path-based conditions must therefore hold in the limit under (2.12) when  ǫtends to zero. The role of (2.12) as a condition for approximate minimization is thus coherent and consistent with known necessary conditions.

2. Inspired by (2.10), the stronger approximate optimality condition

image

was used in [11] instead of (2.12). Our main reason to prefer (2.12) is the following. Observe that (2.14) implies in particular that  φδf,q(x) ≤ ǫδq, which in turn im- plies, for  δsmall enough for the first-order term to dominate, that  φδf,1(x) ≤ ǫδq. Inthe unconstrained case (for example), this requires  ∥∇1xf(xk)∥ ≤ ǫδq−1, imposing an inordinate level of first-order optimality, much stronger than the standard condition ∥∇1xf(xk)∥ ≤ ǫ. No such difficulty arises with (2.12) because the right-hand side of the condition involves all powers of  δ, which is not the case of the right-hand side of (2.14). Note however that the vital continuity properties of  φδf,q are not affected by the choice of the right-hand side, and are thus inherited by (2.12).

3. For given  δ ∈ (0,1], (2.12) does not imply that  φδf,j(x) ≤ ǫχj(δ) for j ∈ {1, . . . , q − 1},although the violation of this condition tends to zero with  δ(7). This slight blemish can be cured by requiring that  φδf,j(x) ≤ ǫχj(δ) for j ∈ {1, . . . , q}instead of (2.12), but we claim that the benefit of this stronger definition is outweighted by the need to perform q −1 additional constrained global minimizations, and therefore focus our exposition to the case using the simpler (2.12).

In order to further justify (2.12), we now make more explicit the “minimizing guarantees” provided by this approximate optimality condition, by formulating a result analogous to Theorem 3.7 in [11]. This result gives a lower bound on the value of f(x) in the feasible neighbourhood of an (ǫ, δ)-approximate q-th-order-necessary minimizer.

image

In order to find (ǫ, δ)-approximate q-th-order-necessary minimizers, we consider applying a variant of the ARp algorithm to (2.1). This algorithm, described as Algorithm 2.1 on the following page, is of the regularization type in that, at each iterate  xk, a step skis computed which approximately minimizes (in a sense defined below) the model

image

subject to  xk +s ∈ F, where pin an integer such that  p ≥ q and σk ≥ σminis a “regularization parameter”.

A few comments are useful at this stage.

1. Since  σk ≥ σminby (2.22), we have that  mk(s) is bounded below as a function of s and the existence of a constrained global minimizer  s∗k is guaranteed.

2. Step 2 requires, that, for  sk ̸= 0, we also compute  δk.This is easy for orders one and two. If q = 1, the formula for a global minimizer  s∗k is analytic and  δk = 1 is

image

always acceptable. The situation is similar for q = 2, where  s∗k can be assessed using a trust-region method whose radius is  δk= 1 (more details are provided at the end of Section 3). The task is more difficult for higher orders where one may have to rely on the arguments of Lemma 2.5 below, or use different subproblems with decreasing values of  δ. However, none of these computations involve the evaluation of f or its derivatives, and therefore the evaluation complexity bound discussed in this paper is unaffected.

3. That one needs to consider the second case in Step 2 (where no step exists satisfying (2.18) (2.20)) can be seen by examining the following one-dimensional example. Let p = q = 3 and β= 1, and suppose that  δk−1 = 1, Tq(xk, s) = s2 − 2s3 and σk= 4! = 24. Then  mk(s) = s2 − 2s3 + s4 = s2(s − 1)2 and the origin is a global minimizer of the model (and a local minimizer of  Tq(xk, s)) but yet Tq(xk, δ) = −1, yielding that φδk−1f,q (xk) = 1 > ǫχq(1) for ǫ ≤ 1/χq(1) = 47. Thus, Step 1 with  δk−1= 1 has failed to identify that termination was possible. In addition, we see that, at variance with the cases q = 1 and q = 2, a global minimizer of the model (2.16) may not, for  q ≥ 3, be aglobal minimizer of its q-th order Taylor’s expansion in the intersection of F and a ball of arbitrary radius: we may have to restrict this radius (to  δk−1 = 12 in our example) for this important property to hold (see Lemma 2.5 below).

4. If (2.19) holds, the possibly expensive computation of  φδkmk,q(sk) in (2.20) is unnecessary and  δkmay be chosen arbitrarily in (0, 1].

5. We assume the availability of a feasible starting point, which is without loss of generality for inexpensive constraints.

6. Before termination, each successful iteration requires the evaluation of f and its first p derivative tensors, while only the evaluation of f is needed at unsuccessful ones.

7. The mechanism of the algorithm ensures the non-increasing nature of the sequence {f(xk)}k≥0.

Iterations for which  ρk ≥ η1(and hence  xk+1 = xk +sk) are called “successful” and we denote by  Skdef= {0 ≤ j ≤ k | ρj ≥ η1}the index set of all successful iterations between 0 and k. We immediately observe that the total number of iterations (successful or not) can be bounded as a function of the number of successful ones (and include a proof in the appendix).

image

We also verify that the algorithm is well-defined in the sense that either a step  sksatisfying (2.18)(2.20) can always be found, or termination is justified. For unconstrained problems with  q ∈ {1, 2}, the first possibility directly results from the observation that  φδmk,j(sk) (asgiven by (2.7)-(2.9) for  f = mk and j ∈ {1, 2, 3}) can be made suitably small at a global minimizer of the model. The situation is more complicated for other cases. In order to clarify it, we first state a useful technical lemma, whose proof is in the appendix.

image

We now provide reasonable sufficient conditions for a nonzero step  skand an optimality radius  δkto satisfy (2.18)(2.20).

image

Proof. Let s∗k be the global minimizer of the model  mk(s) over all s such that xk+s ∈ F.Since  mk(s∗k) < mk(0), we have that  s∗k ̸= 0. By Taylor’s theorem, we have that, for all d,

image

(2.27) Since  s∗k ̸= 0, we may then choose  δk < ∥s∗k∥such that, for every  d with ∥d∥ ≤ δk,∥s∗k + ξd∥ ≥ 12∥s∗k∥ > 0 and

image

image

where the last inequality follows from (2.13). Continuity of  mkand its derivatives and the inequality  mk(s∗k) < mk(0) then imply that there exists a neighbourhood of  s∗k ̸= 0 suchthat (2.18) holds and

image

for all s in this neighbourhood and all  d with ∥d∥ ≤ δk. This yields that, for all such s with  xk + s ∈ F,

image

As can be seen in the proof of this lemma,  δkmay need to be small if any of the tensors

image

for  ℓ ∈ {1, . . . , p + 1}has a large norm. This may occur in particular if  β and ∥s∗k∥ are bothclose to zero, as is shown by the last term in the left-hand side of (2.28). We also note that (2.20) obviously holds for  sk = s∗k if xk + s∗k is an isolated feasible point. It now remains to verify that it is justified to terminate in Step 2 when no suitable nonzero step can be found.

image

Lemma 2.6 Suppose that the algorithm terminates in Step 2 of iteration  k with xǫ = xk.Then there exists a  δ ∈ (0,1] such that (2.12) holds for  x = xǫ and xǫ is an (ǫ, δ)-approximate qth-order-necessary minimizer.

image

Proof. Given Lemma 2.5, if the algorithm terminates within Step 2, it must be because every global minimizer  s∗k of mk(s) under the constraints  xk + s ∈ Fis such that mk(s∗k) ≥ mk(0). In that case,  s∗k = 0 is one such global minimizer and we have that, for all d,

image

We may now choose  δ ∈ (0,1] small enough to ensure that, for all  d with ∥d∥ ≤ δ,������ p�

image

image

Observe that, in this proof, we could have chosen  δsmall enough to ensure σk(p + β)!∥d∥p+β ≤ ǫχp(δ)

instead of (2.29), yielding  φδf,p(xk) ≤ ǫχp(δ), which is a stronger necessary optimality condition than (2.12). Together, Lemmas 2.5 and 2.6 ensure that Algorithm 2.1 is well-defined.

The proofs of the following two lemmas are very similar to corresponding results in [2] and hence we again defer them to the appendix (but still include them for completeness, as the algorithm has changed).

image

We are now in position to prove the crucial lower bound on the step length.

image

Let the global minimum in the definition of  φδkf,q(xk+1) be achieved at  d with ∥d∥ ≤ δk.Since  φδkf,q(xk+1) >0, we have from (2.6) that

image

Then, successively using (2.6) for  f at xk+1, the triangle inequality, (2.16), (1.1) and (2.25), we deduce that

image

(3.6) Now, since  ∥d∥ ≤ δk, and using (2.6) for  mk at sk,

image

image

The bound given by this lemma is another indication that choosing  θof the order of L (when this is known a priori) makes sense.

We now combine all the above results to deduce an upper bound on the maximum number of successful iterations, from which a final complexity bound immediately follows.

image

where we used (2.21), (3.1) and (2.22). Moreover we deduce from (3.9), (3.3) and (3.2) that

image

until termination. The desired bound on the number of successful iterations follows from combining (3.11). Lemma 2.3 is then invoked to compute the upper bound on the total number of iterations. ✷

In particular, if the p-th derivative of f is assumed to be globally Lipschitz rather than merely H¨older continuous (i.e. if  β= 1), the bound (3.8) on the maximum number of evaluations becomes

image

where

image

This worst-case evaluation bound generalizes known bounds for q = 1 (see [2]) or q = 2 (see[8]) and significantly improve upon the bounds in  O(ǫ−(q+1)) given by [11] for a more stringent termination rule. It also extends the results obtained in [6] for convexly-constrained problems with q = 1 by allowing the significantly broader class of inexpensive constraints.

We also note that it is possible to weaken the assumption that  ∇pxfmust satisfy the H¨older inequality (2.2) for every  x, y ∈ IRn (as required in the beginning of Section 2). The weakest possible smoothness assumption is to require that (2.2) holds only for points belonging to the same segment of the “path of iterates”  ∪k≥0[xk, xk+1] (this is necessary for the proof of Lemma 2.1). As this path joining feasible iterates may be hard to predict a priori, one may instead use the monotonic character of Algorithm 2.1 and require (2.2) to hold for all x, y in the intersection of F with the level set  {x ∈ IRn | f(x) ≤ f(x0)}. Again, it may be hard to determine this set and to ensure that it contains the path of iterates, and one may then resort to requiring (2.2) to hold in the whole of F, which must then be convex to ensure the desired H¨older property on every segment [xk, xk+1].

We now discuss the particular and much-studied case where second-order minimizers are sought for unconstrained problems with Lipschitz continuous Hessians (that is  p ≥ q = 2,F = IRn and β= 1). As we now show, a specialization of Algorithm 2.1 to this case is very close (but not identical) to well-known methods. Let us consider Step 1 first. The computation of  φδk−1f,2 (xk) then reduce to

image

which amounts to solving a standard trust-region subproblem with radius  δk−1(see [13]). Hence verifying (4.1) or testing the more usual approximate second-order criterion

image

have very similar numerical costs (remember that finding the leftmost eigenvalue of the Hessian is the same as finding the global minimizer of the associated Rayleigh quotient). If we now turn to the computation of  skin Step 2, Algorithm 2.1 then computes such a step by attempting to minimize the model

image

as has already been proposed before for general p [2, 8]. Moreover, the failure of (2.12) in Step 1 is enough, when  q ≤2, to guarantee the existence of nonzero global minimizers of Tp(xk, s) and mk(s), and thus to ensure that a nonzero  skis possible. The approximate model minimization is stopped as soon as (2.19) or (2.20) holds, the latter then reducing to checking that

image

for some  δ ∈ (0,1]. For each potential  sk, finding δ ∈ (0,1] requires solving (possibly approximately)

image

While this could be acceptable without affecting the overall evaluation complexity of the algorithm, a simpler alternative is available for q = 2. We may consider terminating the model minimization when either (2.19) holds, or

image

The inequality is guaranteed to hold when  skis close enough to  s∗k, a global minimizer of the model  mk(s), since then  ∇1smk(s∗k) = 0 and ∇2smk(s∗k) is positive semi definite, and then d = 0 provides the global minimizer of the second-order Taylor model of  mk(s) around sk.Verifying (4.5) only requires at most one trust-region calculation for each potential step and ensures (4.4) with  δ= 1, making the choice  δk= 1 acceptable. The cost this technique is comparable to that that proposed in [8] where an eigenvalue computation is required for each potential step. Combining these observations, Algorithm 2.1 then becomes Algorithm 4.1.

If p = q = 2, computing  skin Step 2 amounts to approximately minimizing the now well-known cubic model of [15, 19, 22, 5]. In addition, if  skis the exact global minimizer of this model, the above argument shows that (4.5) automatically holds at  skand checking this inequality by solving a trust-region subproblem is thus unnecessary. The only difference between our proposed algorithm and the more usual cubic regularization (ARC) method with exact global minimization is that the latter would check (4.2) for termination, while the algorithm presented here would instead check (4.1) with  δk−1= 1 by solving a trust-region subproblem. As observed above, both techniques have comparable numerical cost.

The bound (3.12) then ensures that Algorithm 4.1 terminates in at most  O�ǫ− p+1p−1�eval-

uations of f, its gradient and Hessian. This algorithm thus shares(8) the upper complexity bounds stated in [8] for general p with different values of  ǫfpr fisrt- and second-order, and in [19, 5] for p = 2.

image

We now intend to show that the upper bound on evaluation complexity of Theorem 3.4 is tight in terms of the order given for unconstrained and a broad class of constrained problems with Lipschitz continuous p-th derivative (i.e.  β = 1(9)). This objective is attained by defining a variant of the high-degree Hermite interpolation technique developed in [11], and then using this technique to build, for any number p of available derivatives of the objective function and any optimality order q, an unconstrained univariate example of suitably slow convergence (i.e. for which the order in  ǫgiven by (3.12) is achieved). This example is then embedded in higher dimensions to provide general lower bounds.

5.1 High-degree univariate Hermite interpolation

We start by investigating some useful properties of Hermite interpolation. Let us assume that we wish to construct a univariate Hermite interpolant  πof degree 2(p + 1) of the form

image

on the interval [0, s] satisfying the 2(p + 1) conditions

image

where  f (i)0 and f (i)1are given. The values of the coefficients  c0, . . . , cpmay then be obtained by

image

(5.3) where

image

with  Apis the matrix whose (i, j)-th entry is  ai,j, which only depends on p. It was show in [11, Appendix] that  Apis nonsingular. Therefore

image

We therefore deduce that, for any  τ ∈ [0, s] ,

image

The mean-value theorem then implies that, for any 0  ≤ τ2 ≤ τ1 ≤ s and some ξ ∈ [τ2, τ1] ⊆[0, s],

image

This development thus leads us to the following conclusion.

image

Observe that (5.5) is identical to (2.5) when  β = 1 and n= 1. This means that the conditions of Theorem 5.1 automatically hold if the interpolation data  {f (j)i }is itself extracted from a function having a Lipschitz continuous p-th derivative.

Applying the above results to several interpolation intervals then yields the existence of a smooth Hermite interpolant.

image

Proof. We first use Theorem 5.1 to define a Hermite interpolant  πk(s) of the form (5.1) on each interval [xk, xk+1] = [xk, xk + sk] (k ∈ {0, . . . , ke}) using f (j)0 = f (j)k andf (j)1 = f (j)k+1 for j ∈ {0, . . . , p}, and then set

image

image

and where  s−1 and skeare chosen sufficiently large to ensure that (5.6) also holds on intervals -1 and  ke. We next set

image

5.2 Slow convergence to (ǫ,δ)-approximate q-th-order-necessary minimizers

We now consider an unconstrained univariate instance of problem (2.1). Our aim is first to show that, for each choice of  p ≥ 1 and q ∈ {1, . . . , p}, there exists an objective function f for problem (2.1) with  f ∈ Cp,1(IR) (i.e. β= 1) such that obtaining an (ǫ, δ)-approximate q-th-order-necessary minimizer may require at least

image

evaluations of the objective function and its derivatives using Algorithm 2.1, matching, in order of  ǫ ∈ (0,1], the upper bound (3.12). Our development follows the broad outline of [12] but extends it to approximate minimizers of arbitrary order. Given a model degree  p ≥ 1 andan optimality order  q ∈ {1, . . . , p}, we first define the sequences  {f (j)k } for j ∈ {0, . . . , p} and

image

by

image

as well as

image

and

image

Thus

image

and, assuming  δk−1= 1 for all k (we verify below that this is acceptable),

image

We also set  σk = p! for all k ∈ {0, . . . , kǫ}(we again verify below that is acceptable). Note that

image

(and (2.12) fails at  xk), while

image

(and (2.12) holds at  xkǫ). It is easy to verify using (5.11) that the model (2.16) is then globally minimized for

image

Hence this step satisfies (2.19) if we choose  ̟= 1. Because of this fact, we are free to choose δkarbitrarily in (0, 1] and we choose  δk= 1. Thus, provided we make the choice  δ−1 = 1ensuring (5.12) for k = 0 , the value  δk= 1 is admissible for all k. The step (5.15) yields that

image

where

image

Thus  mk(sk) < mk(0) and (2.18) holds. We then define

image

which provides the identity

image

(ensuring that iteration k is successful because  ρk= 1 in (2.21) and thus that our choice of a constant  σkis acceptable). In addition, using (5.18), (5.13), (5.17), the equality  δk = 1 andthe inequality  kǫ ≤ 1 + ǫ− p+1p−q+1from (5.7) gives that, for  k ∈ {0, . . . , kǫ},

image

We also set

image

Then (5.19) and (2.16) give that

image

Now note that, using (5.11) and the first equality in (5.15),

image

where  δ[·]is the standard indicator function. We may now verify that, for  j ∈ {1, . . . , q − 1},

image

while, for j = q, we have that

image

Combining (5.21), (5.22), (5.23) and (5.24), we deduce that (5.6) holds with  κf = (q − 1)!.We may thus apply Theorem 5.2 with  β = 1, κf = (q − 1)! and κs= 1, and deduce the existence of a p times continuously differentiable function f from IR to IR with Lipschitz continuous derivatives of order 0 to p which interpolates the  {f (j)k } at {xk} for k ∈ {0, . . . , n}and  j ∈ {0, . . . , p}. Moreover, (5.20) and Theorem 5.2 imply that the range of f only depends on p and q. In addition, (5.19) ensures that every iteration is successful and thus, because of (2.22), that the value  σk = p! may be used at all iterations.

This argument allows us to state the following lower bound on the complexity of the regularization algorithm using a p-th degree model.

image

This implies the following important consequence for higher dimensional problems.

image

Proof. The first conclusion directly follows from Lemma 5.3 since it is always possible to include the unimodal example as an independent component of a multivariate one.

The second conclusion follows from the observation that our univariate example of slow convergence is only defined on IR+ (even if Theorem 5.2 provides an extension to the complete real line). As a consequence, it may be used on any feasible ray. ✷

We now make a few observations.

1. Theorem 5.4 generalizes to arbitrary q the bound obtained in [3] for the case q = 1 and also shows that, at variance with the result derived in this reference, the generalized bound applies for arbitrary problem’s dimension, but depends on  ǫ, p and q.

2. For simplicity, we have chosen, in the above example, to minimize the model  mk(s)globally at every iteration, but we might consider other pairs (sk, δk). A similar example of slow convergence may in fact be constructed along the lines used above(10) for anysequence of acceptable(11) model reducing steps and associated optimality radii (in the sense of Lemma 2.5), provided the optimality radii remain bounded away from zero. This means that our example of slow convergence applies not only to Algorithm 2.1 but also to a much broader class of minimization methods. Moreover, it is also possible to weaken the constraints on the step further by relaxing (5.19) and only insisting on acceptable decrease of the objective function value in Step 3 of the algorithm.

In [3], the authors derive their upper bound for q = 1 for the general class of “zero-preserving” algorithms, which are algorithms that “never explore (from  xk) coordinates which appear not to affect the function”, that is directions d along which  Tp(xk, ·)is constant. This property is obviously shared by Algorithm 2.1 because it attempts to reduce the Taylors’ expansion of f around the current iterate (the presence of the isotropic regularization term is irrelevant for this).

3. Our example does not apply, for instance, to a linesearch method using global univariate minimization in a direction of search computed from the Taylor’s expansion of f, which

is another zero-preserving method. Note however that this method, just as every other linesearch method (including possibly randomized coordinate searches), is bound to fail when attempting to compute approximate minimizers of order beyond three, because the Taylor’s expansion at a non-optimal point then needs no longer decrease along lines. This is demonstrated by the following old example [16, 20]. Let

image

Then f(0, 0) = 0 and the origin is not a minimizer since f decreases along the arc x2 =34x1. Yet the origin is the global minimizer along every line passing through the origin, preventing any linesearch method to progress away from (0, 0).

Let us now consider an alternative unconstrained minimization method which would attempt to reduce the unregularized model (that is (2.16) with  σk= 0) in order to find an unconstrained first-order minimizer. It is easy to see that if one chooses

image

the same reasoning as above yields that the largest obtainable decrease with this model occurs at

image

and is given by

image

This then implies that at least a multiple of  ǫ− pp−1evaluations may be needed to find approximate first-order-necessary minimizers, which is worse than the bound in  ǫ− p+1pholding for the regularized algorithm. This is consistent with the known lower  O(ǫ−2) bound for first-order points that holds for the (unregularized) Newton method (and hence the trust-region method), both of which use p = 2. Adding the regularization term thus not only provides a mechanism to limit the stepsize and make the step well-defined when  Tp(xk, s) is unbounded below, but also amounts to increasing the ’useful degree’ of the model by one, improving the worst-case complexity bound. Summing up the above discussion, we conclude that an example of slow convergence requiring at least (5.25) evaluations can be built for any method whose steps decrease the regularized (σk ≥ σmin) or unregularized (σk= 0) model (2.16) and whose approximate local optimality can be measured by (2.20) for some constant  θ and δk= 1 (which we can always enforce by adapting  ̟and (5.9)). For orders up to two, this includes most variants of steepestdescent and Newton’s methods including those globalized with regularization, trust-region, a linesearch or a mixture of these (see [12] for a discussion). General linesearch methods are excluded for high-order optimization as they may fail to converge to approximate minimizers of order four and beyond. Finally, one may wonder at what would happen if, for the interpolation data (5.9)-(5.10), the model

image

were used for some m > p + 1, resulting in a shorter step. The global model minimizer would then occur at  s = [q(ǫ + ωk)χq(1)]1/(m−1) and give an optimal model decrease equal to [q(ǫ+ωk)χq(1)]m/(m−1)(m−q)/m. However, (5.6) would then fail for j = 0 and the argument leading to an example of slow convergence would break down.

For any optimality order  q ≥1, we have provided the concept of an (ǫ, δ)-approximate q-th-order-necessary minimizer for the very general set-constrained problem (2.1). We have then proposed a conceptual regularization algorithm to find such approximate minimizers and have shown that, if  ∇pxf is β-H¨older continuous, this algorithm requires at most  O(ǫ− p+βp−q+β )evaluations of the objective function and its p first derivatives to terminate. When  ∇pxf isLipschitz continuous, we have used an unconstrained univariate version of the problem to show that this bound is sharp in terms of the order in  ǫfor any feasible set containing a ray and any problem dimension.

In view of the results in [7, 18], one may wonder at what would happen if the regularization power (i.e. the power of  ∥s∥used in the last term of the model (2.16)) is allowed to differ from p+β. The theory presented above must then be re-examined and the crucial point is whether a global upper bound  σmaxon the regularization parameter can still be ensured as in Lemma 3.2. One easily verifies that this is the case for regularization powers  r ∈ (p, p + β]. Arguments parallel to those presented above then yield an upper bound of  O(ǫ− rr−q) evaluations(12),recovering the bound given in Section 3.3 of [7] for q = 1. The situation is however more complicated (and beyond the scope of the present paper) for  r > p+βand the determination of a suitable general complexity upper bound for this latter case has not been formalized at this stage, but the analysis for q = 1 discussed in Section 3.2 of [7] suggests that an improvement of the bound for larger r is unlikely.

Although the results presented essentially solve the question of determining the optimal evaluation complexity for unconstrained problems and problems with general inexpensive constraints, some interesting issues remain open at this stage. A first such issue is whether an example of slow convergence for all  ǫ ∈ (0,1) can be found for feasible domains not containing a ray. A second is to extend the general complexity theory for problems whose constraints are not inexpensive: the discussion in [10] indicates that this is a challenging research area.

[1] A. Anandkumar and R. Ge. Efficient approaches for escaping high-order saddle points in nonconvex optimization. Proceedings of the Machine Learning Society, 49:81–102, 2016.

[2] E. G. Birgin, J. L. Gardenghi, J. M. Mart´ınez, S. A. Santos, and Ph. L. Toint. Worst-case evaluation complexity for unconstrained nonlinear optimization using high-order regularized models. Mathematical Programming, Series A, 163(1):359–368, 2017.

[3] Y. Carmon, J. C. Duchi, O. Hinder, and A. Sidford. Lower bounds for finding stationary points I. arXiv:1710.11606, 2018.

[4] C. Cartis, N. I. M. Gould, and Ph. L. Toint. On the complexity of steepest descent, Newton’s and regularized Newton’s methods for nonconvex unconstrained optimization. SIAM Journal on Optimization, 20(6):2833–2852, 2010.

[5] C. Cartis, N. I. M. Gould, and Ph. L. Toint. Adaptive cubic overestimation methods for unconstrained optimization. Part II: worst-case function-evaluation complexity. Mathematical Programming, Series A, 130(2):295–319, 2011.

[6] C. Cartis, N. I. M. Gould, and Ph. L. Toint. An adaptive cubic regularization algorithm for nonconvex optimization with convex constraints and its function-evaluation complexity. IMA Journal of Numerical Analysis, 32(4):1662–1695, 2012.

[7] C. Cartis, N. I. M. Gould, and Ph. L. Toint. Universal regularization methods – varying the power, the smoothness and the accuracy. Technical Report naXys-7-2016, Namur Center for Complex Systems (naXys), University of Namur, Namur, Belgium, 2016.

[8] C. Cartis, N. I. M. Gould, and Ph. L. Toint. Improved second-order evaluation complexity for uncon- strained nonlinear optimization using high-order regularized models. arXiv:1708.04044, 2017.

[9] C. Cartis, N. I. M. Gould, and Ph. L. Toint. Worst-case evaluation complexity of regularization methods for smooth unconstrained optimization using H¨older continuous gradients. Optimization Methods and Software, 6(6):1273–1298, 2017.

[10] C. Cartis, N. I. M. Gould, and Ph. L. Toint. Optimality of orders one to three and beyond: characterization and evaluation complexity in constrained nonconvex optimization. Journal of Complexity, (to appear), 2018.

[11] C. Cartis, N. I. M. Gould, and Ph. L. Toint. Second-order optimality and beyond: characterization and evaluation complexity in convexly-constrained nonlinear optimization. Foundations of Computational Mathematics, 18(5):1073–1107, 2018.

[12] C. Cartis, N. I. M. Gould, and Ph. L. Toint. Worst-case evaluation complexity and optimality of second- order methods for nonconvex smooth optimization. To appear in the Proceedings of the 2018 International Conference of Mathematicians (ICM 2018), Rio de Janeiro, 2018.

[13] A. R. Conn, N. I. M. Gould, and Ph. L. Toint. Trust-Region Methods. MPS-SIAM Series on Optimization. SIAM, Philadelphia, USA, 2000.

[14] S. Gratton, A. Sartenaer, and Ph. L. Toint. Recursive trust-region methods for multiscale nonlinear optimization. SIAM Journal on Optimization, 19(1):414–444, 2008.

[15] A. Griewank. The modification of Newton’s method for unconstrained optimization by bounding cubic terms. Technical Report NA/12, Department of Applied Mathematics and Theoretical Physics, University of Cambridge, Cambridge, United Kingdom, 1981.

[16] H. Hancock. The Theory of Maxima and Minima. The Athenaeum Press, Ginn & Co, NewYork, USA, 1917. Available on line at https://archive.org/details/theoryofmaximami00hancroft.

[17] Yu. Nesterov. Introductory Lectures on Convex Optimization. Applied Optimization. Kluwer Academic Publishers, Dordrecht, The Netherlands, 2004.

[18] Yu. Nesterov and G. N. Grapiglia. Globally convergent second-order schemes for minimizing twice-differentiable functions. Technical Report CORE Discussion paper 2016/28, CORE, Catholic University of Louvain, Louvain-la-Neuve, Belgium, 2016.

[19] Yu. Nesterov and B. T. Polyak. Cubic regularization of Newton method and its global performance. Mathematical Programming, Series A, 108(1):177–205, 2006.

[20] G. Peano. Calcolo differenziale e principii di calcolo integrale. Fratelli Bocca, Roma, Italy, 1884.

[21] S. A. Vavasis. Black-box complexity of local minimization. SIAM Journal on Optimization, 3(1):60–80, 1993.

[22] M. Weiser, P. Deuflhard, and B. Erdmann. Affine conjugate adaptive Newton methods for nonlinear elastomechanics. Optimization Methods and Software, 22(3):413–431, 2007.

[23] X. Zhang, C. Ling, and L. Qi. The best rank-1 approximation of a symmetric tensor and related spherical optimization problems. SIAM Journal on Matrix Analysis, 33(3):806–821, 2012.

Proof of Lemma 2.1. We first establish the identity

image

To see this, integrating by parts, we have that

image

and thus, recursively, that

image

As in [11], consider the Taylor identity

image

involving a given univariate  Ck function ψ(t) and its k-th order Taylor approximation

image

expressed in terms of the value  ψ(0) = ψ and ith derivatives  ψ(i), i = 1, . . . , k. Then, picking ψ(t) = f(x + ts), for given  x, s ∈ IRn, and k = p, the identity (A.2), and the relationships ψ(p)(t) = ∇pxf(x + ts)[s]p and τp(1) = Tp(x, s) give that

image

and thus from the definition of the tensor norm (1.1), the H¨older bound (2.2) and the identity (A.1) when k = p that

f(x + s) − Tp(x, s) ≤ 1(p − 1)!

image

for all  x, s ∈ IRn, which is the required (2.4).

image

k = p − j,it follows from (A.2), the relationships  ψ(p−j)(t) = ∇pxf(x + ts)[v1, . . . , vj][s]p−j and

image

Then picking  v1, . . . , vjto maximize the absolute value of left-hand size of (A.3) and using the tensor norm (1.1), the H¨older bound (2.2) and the identity (A.1) when  k = p − j, we findthat

image

for all  x, s ∈ IRn, which gives (2.5). ✷

Proof of Lemma 2.3. The regularization parameter update (2.22) gives that, for each k,

image

where  Ukdef= {0, . . . , k}\Sk. Thus we deduce inductively that  σ0γ|Sk|1 γ|Uk|2 ≤ σk. We therefore obtain, using (2.23), that

image

which then implies that

image

since  γ2 >1. The desired result (2.24) then follows from the equality  k + 1 = |Sk| + |Uk| andthe inequality  γ1 <1 given by (2.17). ✷

Proof of Lemma 2.4. We first observe that  ∇js�∥s∥p+β�is a j-th order tensor, whose norm is defined using (1.1). Moreover, using the relationships

image

defining

image

and proceeding by induction, we obtain that, for some  µj,i ≥ 0 with µ1,1 = 1,

image

where the last equation uses the convention that  µj,0= 0 for all j. Thus we may write

image

with

image

where we used the identity

image

to deduce the second equality. Now (A.6) gives that

image

It is then easy to see that the maximum in (1.1) is achieved for  v = s/∥s∥, so that

image

with

image

Successively using this definition, (A.7), (A.8) (twice), the identity  µj−1,j= 0 and (A.10) again, we then deduce that

image

Since  π1 = p+βfrom the first part of (A.4), we obtain that  πj = (p+β)!/(p−j +β)!, which,combined with (A.9) and (A.10), gives (2.25). We obtain (2.26) from (A.9) and (A.10), the observation that  πp = (p + β)! and (A.11) for  j = p + 1. ✷

Proof of Lemma 3.1. (See [2, Lemma 2.1]) Observe that, because of (2.18) and (2.16),

image

which implies the desired bound. Note that  sk ̸= 0 as long as we can satisfy condition (2.18), and so (3.1) implies (2.21) is well defined. ✷

Proof of Lemma 3.2. (See [2, Lemma 2.2]) Assume that

image

Using (2.4) and (3.1), we may then deduce that

image

and thus that  ρk ≥ η2. Then iteration k is very successful in that  ρk ≥ η2 and σk+1 ≤ σk. Asa consequence, the mechanism of the algorithm ensures that (3.2) holds. ✷


Designed for Accessibility and to further Open Science