b

DiscoverSearch
About
Optimization with Momentum: Dynamical, Control-Theoretic, and Symplectic Perspectives
2020·arXiv
Abstract
Abstract

We analyze the convergence rate of various momentum-based optimization algorithms from a dynamical systems point of view. Our analysis exploits fundamental topological properties, such as the continuous dependence of iterates on their initial conditions, to provide a simple characterization of convergence rates. In many cases, closed-form expressions are obtained that relate algorithm parameters to the convergence rate. The analysis encompasses discrete time and continuous time, as well as time-invariant and time-variant formulations, and is not limited to a convex or Euclidean setting. In addition, the article rigorously establishes why symplectic discretization schemes are important for momentum-based optimization algorithms, and provides a characterization of algorithms that exhibit accelerated convergence.

Keywords: Gradient-based optimization, convergence rate analysis, Nesterov acceleration, symplectic integration, nonconvex optimization

Optimization problems lie at the heart of many machine-learning formulations. As a result, a better understanding of optimization algorithms, combined with implementations that target distinctive properties of machine-learning problems, have contributed significantly to the recent progress in the field.

One of the most popular methods for large-scale optimization is the (stochastic) gradient method, due to its simplicity, wide applicability, and efficiency. However, in the deterministic setting, where gradients are evaluated exactly, it has been shown that better convergence rates can often be achieved by leveraging two successive gradients (Nesterov, 1983). Based on analogies to mechanical systems these methods are referred to momentum-based optimization algorithms, and can be viewed as particular discretizations of continuous-time harmonic oscillators, (Polyak, 1964). Yet, even for the class of strongly convex functions, most proofs that establish the superior convergence are algebraic and provide little qualitative understanding. As a consequence, there is little guide to the generality or robustness of the acceleration phenomenon across instances of optimization problems.

This article characterizes the convergence rate of momentum-based optimization algorithms by taking fundamental topological properties into account. In many cases our analysis leads to closed-

image

form expressions that relate the convergence rate to the different algorithm parameters, such as the step size and the damping. The analysis provides insight into the design of algorithms that, for example, require little tuning when applied to large-scale and ill-conditioned optimization problems. For simplicity of notation, we focus on the Euclidean setting, but we note that the scope of our analysis is not limited to Euclidean problems.

We will derive convergence rates of the form

image

where  |x(t) − x∗|is a distance measure between the iterate x(t) and the isolated local minimum  x∗,t refers to time or the iteration number,  Ccis a constant which does not depend on  x(0), and ρc(t) ismonotonically decreasing, satisfies  ρc(0) = 1, and characterizes the convergence rate. An important aspect of the analysis is to characterize how the convergence rate  ρcis affected by the shape of the objective function about an isolated local minimum. The local shape is summarized by the (condition) number  κ, which is defined as the ratio between the maximum and minimum curvature about a local minimum. We call an isolated minimum degenerate if the minimum curvature vanishes in at least one direction. The main results and insights, which will be rigorously derived in the remainder of the article, are summarized as follows:

• The convergence rate, characterized by  ρcin (1), is uniquely determined by the local shape of the objective function about an isolated minimum. The global shape of the objective function determines the constant  Ccand the set of initial conditions x(0) for which (1) holds. In other words, the global shape (for example described by convexity) determines the stability and region of attraction of a local minimum, whereas the local shape, i.e., the curvature at an isolated minimum, determines the convergence rate.

• An algorithm is called accelerated if the convergence rate scales favorably for ill-conditioned optimization problems, meaning that the convergence rate scales favorably for large  κ. Ac-celeration is therefore a statement about the robustness of the convergence rate with respect to changes in the curvature, which is well defined for continuous-time as well as for discrete-time formulations.

• Accelerated convergence is generic to momentum-based optimization algorithms, provided that the damping scales with  1/√κ for large κ. Neither the evaluation of the gradient at a shifted position, nor a specifically engineered damping parameter, as for example proposed in Nesterov (2004, Sec. 2.2), are necessary.

• From a physics perspective, a momentum-based optimization algorithm can be thought of as a mass-spring-damper system, where the spring potential is given by the objective function. The algorithm design specifies the damping. The fact that the system has inertia (due to the second-order dynamics) implies that the convergence rate is robust to small changes in the spring, i.e., robust to small changes in the curvature of the objective function. The inertia gives the system the tendency to keep its velocity, which, provided that the damping is chosen appropriately, implies that small changes in the spring will not slow down convergence. This captures the mechanism behind accelerated convergence.

• The convergence rate typically becomes arbitrarily small for large  κ(if the dynamics are time-varying this statement applies for  t → ∞). The underlying dynamics are therefore close to

conservative, which means that great care is required for the discretization. A discretization that introduces an artificial energy drift, such as the explicit forward Euler discretization, for example, might even lead to instability, which clearly makes a favorable scaling of the convergence rate with  κimpossible. This motivates the use of symplectic discretization schemes, which preserve the Hamiltonian structure of the underlying conservative part of the dynamics.

• By introducing time-varying dynamics, which are obtained by adjusting the damping parameters with the number of iterations, the linear convergence rate is improved with an additional sublinearly converging term. If the local minimum is close to degenerate, the convergence is dominated by this term, as the rate of the linear convergence becomes arbitrarily small. In that way, the analysis relates the non-degenerate and degenerate cases.

Related work: The phenomenon of acceleration, which fundamentally motivates the use of momentum, has puzzled many researchers for almost 40 years. We do not attempt to give a full overview of the literature, but highlight some of the most recent work. Important contributions were made by Bubeck et al. (2015), who proposed an accelerated algorithm that has a geometric interpretation, by Allen-Zhu and Orecchia (2014), who show that coupling of gradient and mirror descent can lead to acceleration, and Lessard et al. (2016), who propose a general control-theoretic analysis framework. The framework has subsequently been extended and refined, for example by Michalowsky et al. (2019), who analyzed and quantified robustness and convergence trade-offs. Other work includes Diakonikolas and Orecchia (2018), who unify the analysis of first-order methods by imposing certain decay conditions, and Scieur et al. (2017), who interpret the accelerated gradient method as a multi-step discretization of gradient flow.

The work of Su et al. (2016) and Krichene et al. (2015) showed that the trajectories of the accelerated gradient method approach the solutions of a certain second-order ordinary differential equation. The resulting differential equation was analyzed in further detail in Attouch et al. (2018) and placed within a variational framework by Wibisono et al. (2016). This motivated further research on structure-preserving integration schemes for discretizing continuous-time optimization algorithms (Betancourt et al., 2018). While the continuous-time formulation of Su et al. (2016) is based on an accelerated optimization algorithm for smooth and convex objective functions (i.e., the case where the minimum could be degenerate), an alternative for strongly convex objective functions (non-degenerate case) was proposed in D¨urr and Ebenbauer (2012). In Franc¸a et al. (2019) and Muehlebach and Jordan (2019), important geometric properties of the underlying dynamics are highlighted and corresponding structure-preserving discretization schemes are analyzed. In addition, Franc¸a et al. (2019) propose relativistic dynamics, as these naturally bound the rate of change of the position by the speed of light, which is supposed to prevent overshoot. The work of Maddison et al. (2018) suggests that convergence can be improved by a suitable choice of the kinetic energy (for example choosing the kinetic energy to be the convex conjugate of the objective function).

While this line of work suggests strong parallels between dynamical systems and optimization algorithms, a main part of the research focuses on engineering Lyapunov functions or estimate sequences that arise separately from the dynamical system. These typically arise from educated guesses, or from the solutions of semidefinite programs, and enable the precise characterization of the convergence rate for a given algorithm. The aim of our work is different: We exploit fundamental topological properties of dynamical systems in order to provide an explicit characterization of the convergence rate up to constants, which, in most cases, can be evaluated in closed form. Hence, our analysis does not aim at precisely bounding the number of iterations needed to reach a local minimum with a certain tolerance, but provides a qualitative and quantitative characterization of the convergence rate up to constants. The analysis does not rely on convexity and applies both to continuous-time and discrete-time formulations. This provides a tool to understand how the convergence rate depends on various algorithm parameters and the shape of the objective function, and it characterizes the mechanism leading to acceleration. Our characterization of the convergence rate is both necessary and sufficient.

In the context of nonconvex optimization, the aim is typically to find a local minimum of a twice continuously differentiable function that satisfies certain non-degeneracy conditions. It has been shown that gradient descent converges to a local minimum from almost every initial condition, which is due to the fact that local maxima and saddle points are unstable equilibria (Lee et al., 2016). The same reasoning applies to gradient descent with momentum. However, even though gradient descent and gradient descent with momentum are guaranteed (in an almost everywhere sense) to ultimately reach a local minimum, it may take an arbitrarily long time to escape saddle points and local maxima. A major concern arises due to the fact that an objective function with n decision variables might have  n−1different isolated saddle points, which have to be traversed before reaching a local minimum. This implies that even for generic initialization strategies, the worst-case convergence rate depends on n (Du et al., 2017). This article is concerned with characterizing the convergence of momentum-based algorithms up to a constant factor. We will not analyze how this factor scales with n; the example from Du et al. (2017) suggests that the factor scales exponentially in n. However, the results from Ge et al. (2015) indicate that adding small random perturbations to gradient descent can improve the convergence rate and reduce the dimension-dependence of the convergence rate on n to polylog factors. A similar result applies likewise to gradient descent with momentum (Jin et al., 2017). A recent account of the state-of-the art is given in Jin et al. (2019), for example.

The poor scaling in the dimension n in the nonconvex case can be avoided by explicitly leveraging curvature information, as for example proposed in Nesterov and Polyak (2006) and Curtis et al. (2017). However, the computational cost per iteration of these methods is generally higher and increases with n. For simplicity, this article focuses on first-order algorithms with momentum, even though many ideas generalize in straightforward ways.

Outline: This article views optimization algorithms as dynamical systems. Two cases are distinguished: Section 2 assumes that the algorithm parameters are fixed, leading to time-invariant dynamics. The more general case, where the parameters are allowed to vary with the number of iterations, is discussed in Section 3.

Section 2 starts by introducing the notation and defining the scope of the analysis. A momentum-based optimization algorithm is understood as a second-order dynamical system,1 where the local minima of the objective function correspond to asymptotically stable equilibria. Section 2.2 and Section 2.3 introduce two prototypical examples of momentum-based optimization algorithms. These are generalized versions of Nesterov’s acclerated gradient scheme (Nesterov, 2004, Ch. 2.2), and also include Heavy-Ball methods (Polyak, 1964). By focusing on smooth dynamical systems based on smooth objective functions, our analysis excludes, for example, the treatment of constraints via (non-differentiable) indicator functions, as well as optimization algorithms with restart schemes. The subsequent result derived in Section 2.4 highlights our assertion that fundamental topological properties can be exploited for characterizing the convergence rate of momentum-based optimization algorithms. The results are applied to the analysis of the two prototypical examples in Section 2.6 and Section 2.7, leading to a broad characterization of the phenomenon of acceleration. In addition, Section 2.7 rigorously motivates the use of symplectic discretization schemes in the context of optimization: The symplectic discretization enables the computation of a modified energy function that can be used for stability analysis.

The structure of Section 3 is analogous to Section 2. The general result for characterizing the convergence rate in the time-varying case is presented in Section 3.1 and illustrated with two subsequent examples in Section 3.2 and Section 3.3. The results highlight the fact that time-varying damping parameters can speed up the convergence rate by an additional sublinearly converging factor. For ill-conditioned problems, this additional gain in the convergence rate becomes significant. The analysis connects the non-degenerate case with the degenerate case and motivates the update rules for the parameters of well-known accelerated gradient schemes, such as Nesterov (2004, p. 90, Constant Step Scheme II).

The article concludes with a summary and final remarks in Section 4.

2.1 Introduction

Throughout the article we consider the problem of minimizing the function  f : Rn → R, whichsatisfies the following assumption:

Assumption 1 The function f has a Lipschitz-continuous gradient. The critical points are non-degenerate and isolated.1

The Lipschitz continuity of the gradient is important for ensuring that the resulting continuous-time trajectories exist and are unique. The fact that f has isolated non-degenerate critical points excludes pathological cases where the function has multiple connected local minima. The less pathological case, where the local minima are isolated but have vanishing curvature in certain directions, can be obtained via a limit argument, as will be discussed in Section 3.

Due to the Lipschitz continuity of the gradient, the Hessian exists almost everywhere and is essentially bounded. We will summarize the (essential) upper and lower bounds on the Hessian with the two constants  Cf ≥ 0 and ¯Cf ≥ 0:

image

where  | · |denotes the Euclidean norm. Thus for  Cf = 0the function is convex, for  Cf > 0 it isnonconvex. In order to simplify our exposition, we will consider functions that satisfy:

Assumption 2 In addition to Assumption 1, f has a second derivative that is Lipschitz continuous.

None of the subsequent results will explicitly depend on a Lipschitz constant related to the second derivative. Hence, under very mild conditions, all our results characterizing the convergence rate ρc(t)in (1) apply when f satisfies Assumption 1 instead of Assumption 2.1 This is further discussed in Appendix F.

Without loss of generality, we further assume that f has a local minimum at  x∗ = 0 and thatf(x∗) = f(0) = 0. The convergence rate of a momentum-based optimization algorithm will depend on the local shape of the local minimum in question, which is determined by the constants

image

In case Assumption 2 is replaced with Assumption 1, the above constants are defined via the essential supremum and essential infimum of the Hessian in a neighborhood of  x∗.

We model a momentum-based optimization algorithm either as a continuous-time or discrete-time dynamical system of the form

image

where the superscript + denotes either differentiation with respect to t (continuous-time setting), in which case  I = R≥0, or a unit time-shift (i.e.,  q+(t) = q(t + 1), discrete-time setting), in which case I = {0, 1, 2, . . . }. The nonnegative real numbers are denoted by  R≥0, whereas the positive real numbers are denoted by  R>0. The dynamics

image

are implicitly dependent on  ∇fand are assumed to satisfy the following assumption.2

Assumption 3 The dynamics  gq and gpare continuously differentiable in both arguments and the derivatives are Lipschitz continuous.

In the continuous-time case, Assumption 3 implies that the resulting trajectories exist and are unique for all times  t ∈ I(Arnol’d, 1992, p. 93, Corollary 3). Differentiability also implies that the dynamics can be linearized about an equilibrium, which typically provides a means to study the local behavior of the resulting trajectories. In addition, Assumption 3 implies that, over a finite time interval, trajectories are continuously dependent on their initial conditions (Arnol’d, 1992, p. 93, Corollary 4). These topological properties will be exploited in the following.

In order to simplify notation, we define

image

and introduce  z(t) := (q(t), p(t)) for all t ∈ I. Moreover, the map  (q0, p0) → (q(t), p(t)) isdenoted by  ϕt : R2n → R2n, t ∈ I, and is referred to as the flow of the dynamical system (4) - (5). Next, we provide a formal definition of a momentum-based optimization algorithm.

Definition 1 We call the dynamical system (4) - (5) satisfying Assumption 3 a momentum-based optimization algorithm for the function  f if x∗ = 0is an asymptotically stable equilibrium in the sense of Lyapunov.

This means that  ϕt(0) = 0, for all t ∈ I, ϕtis continuous at 0, uniformly in  t, and limt→∞ ϕt(z0) =0 for any z0in a neighborhood of the origin.

Even though the dynamics  gq and gpcan capture gradient flow, as a special case (e.g.,  gp = 0,gq = −∇f(q)in continuous time), we are interested in analyzing momentum methods, which arise from a nontrivial choice of  gq and gp.

Throughout the article we will illustrate our ideas with two examples, which are prototypical versions of momentum-based optimization algorithms.

2.2 Example 1

The first example is based on the following continuous-time dynamics

image

where  ∇fdenotes the gradient of  f and fdthe dissipative forces. These dynamics can be viewed as a mass-spring-damper system, where f represents the spring potential and  fdthe damping. The dissipative forces are assumed to take the form

image

where the parameters  d > 0, β ≥ 0are constant. Ideally, the parameters  d and βare designed to take into account additional information about the the function f; for example, upper and lower bounds on the curvature  d2f/dx2. If βis chosen to be zero, the dissipative forces reduce to  −2dp.In that case, the dynamics (8) describe a continuous-time heavy ball method (Polyak, 1964). In case  β > 0, the dynamics are related to Nesterov’s accelerated gradient method (Muehlebach and Jordan, 2019).

An intuitive interpretation of the dissipative forces (9) can be given in the following way: For β = 0, (9) describes linear isotropic damping. For  β > 0, (9) can be rewritten as

image

which implies that (9) includes an additional damping term that averages the local curvature in the interval between  q and q + βp. As a result, the damping increases if the local curvature is large, and reduces if the local curvature is small. As the velocity p is larger, the interval over which the average is taken is increased. The two forms of damping, linear isotropic and curvature dependent, are balanced by the coefficients  d and β.

The equilibria of (8) are given by the critical points of f. Moreover, if a given critical point is a non-degenerate local minimum, the corresponding equilibrium is asymptotically stable. This follows by evaluating the total energy,

image

along the trajectories of (8),

image

which shows that the energy necessarily decreases in a neighborhood of the equilibrium. Combined with the fact that H is locally positive definite about a non-degenerate local minimum, this implies stability in the sense of Lyapunov. Asymptotic stability of the non-degenerate local minimum can then be concluded from La Salle’s theorem (see, for example, Sastry, 1999, Ch. 5.4), which is based on examining the invariant sets satisfying dH/dt = 0. Thus, the dynamical system (8) is a momentum-based optimization algorithm for f according to Definition 1.

However, analyzing how the energy evolves along the trajectories of (8) also reveals global properties of the dynamics. For  β = 0, d > 0it follows that the set of initial conditions that do not converge to a local minimum is a set of measure zero, given by the critical points of f that are not local minima. The analysis holds without assuming convexity of f. However, if f happens to be convex and has a unique global minimum, then the corresponding equilibrium is globally asymptotically stable for any choice of parameters  β ≥ 0, d > 0.

The same reasoning applies in case f satisfies Assumption 1 instead of 2, or when f has degenerate local minima.

2.3 Example 2

The second example is obtained by discretizing (8) in the following way:

image

where T > 0 is the step size. The discretization consists of a forward Euler update of the momentum coordinates, and uses the newly computed momentum for the position update. As will be further discussed in the remainder of the article, the fact that the newly computed momentum coordinate is used for the position update makes the scheme symplectic for  fd = 0. This means that the transformation (13) from  (qk, pk) → (qk+1, pk+1)preserves the symplectic form (for  fd = 0), whichhas important consequences. One of these consequences concerns the spectrum of the linearization of (13). In case  fd = 0, the corresponding eigenvalues are guaranteed to lie on the unit circle (for T ≤ 2/√L). This is in sharp contrast to the standard explicit Euler discretization, which is not symplectic, and where the eigenvalues lie outside the unit circle even for arbitrarily small T > 0. Indeed, we will exploit the fact that the map (13) is symplectic (for  fd = 0) to construct a modified energy function, which will be used for a stability analysis that extends beyond the linearization. Additional background information on symplectic integration can be found in Sanz-Serna (1992) or Hairer et al. (2002), for example.

For  β = 0, the resulting algorithm is referred to as gradient descent with momentum (Polyak, 1964). For  β > 0, Nesterov’s accelerated gradient scheme (Nesterov, 2004, Constant step scheme III, p. 81) is obtained by choosing the parameters as follows:

image

where  κ and Lcharacterize the local shape of a local minimum, (Muehlebach and Jordan, 2019). Moreover, as pointed out in Muehlebach and Jordan (2019), the “Constant step scheme II” algorithm of Nesterov (2004) is obtained by a particular choice of time-varying coefficients  β and d. Thegeneralization to time-varying coefficients will be discussed in Section 3.1.

The equilibria of (13) are again the stationary points of f. In order to determine the stability of an isolated local minimum, we linearize the dynamics,

image

where  He := df/dx2|x=0(without loss of generality we consider  x∗ = 0). An eigenvalue analysis then reveals that the corresponding non-degenerate local minimum is asymptotically stable if

image

holds for all eigenvalues  h of He, i.e. µ ≤ h ≤ L.1 Asymptotic stability of the linearized dynamics implies that the same equilibrium is asymptotically stable under the nonlinear dynamics (13) (Sastry, 1999, p. 215).

For example, provided that the parameters  d and βare chosen according to (14), we obtain that the given equilibrium is asymptotically stable if

image

We thus conclude that (13) is an optimization algorithm for f according to Definition 1 provided that the constants  d, β, and Tare chosen such that (16) is satisfied.

Unlike Example 1, our analysis for Example 2 is valid only in a neighborhood about the equilibrium, which corresponds to a local minimum. However, the specific structure of the discretization and the ideas from Example 1 can be exploited for obtaining a nonlinear analysis that is valid beyond a neighborhood of the equilibrium. This will be illustrated in Section 2.7.

2.4 Characterizing the convergence rate

Guaranteeing mere convergence is often not enough, as we are primarily interested in how quickly the trajectories of the nonlinear dynamics (4) and (5) converge to a local minimum. In the following, we argue that in most cases, a linear analysis characterizes the convergence rate up to constants. The linear analysis typically reduces to the computation of eigenvalues.

Our main proposition is based on the following assumption and definition.

Assumption 4 Let the linearized dynamics

image

be such that there exists an estimate

image

where  Cl ≥ 1 and α > 0are constant.

Definition 2 The region of attraction of the equilibrium at the origin (of the nonlinear dynamics (4)) is defined as the set

image

Proposition 3 Let Assumption 4 be satisfied. Then, for any compact set  A ⊂ Rthere exists a finite constant ˆC ≥ 1such that for all  z0 ∈ A

image

Proof If A is empty the statement is trivial. We therefore assume that A is non-empty. Lemma 14 (see Appendix A) implies that there exists an open ball  Bδ of radius δ > 0, centered at the origin, such that any trajectory starting in  Bδconverges with rate  α. We make the following claim. Claim: There exists a finite time  Tm > 0, Tm ∈ Isuch that for all  z0 ∈ A, ϕt(z0) ∈ Bδ, for allt ∈ I, t ≥ Tm.Proof of the claim: The origin is a stable equilibrium. Hence there exists a constant  ϵ > 0 such thatall trajectories starting in  Bϵ, the open ball of radius  ϵcentered at the origin, remain in  Bδ for alltimes. In addition, each  z0 ∈ A satisfies limt→∞ ϕt(z0) = 0, since A ⊂ R. Thus, for each  z0 ∈ Athere exists a time  T(z0) such that ϕT(z0)(z0) < ϵ/2. The continuity assumptions on g imply that ϕT(z0)is continuous, which can be verified by the Gr¨onwall inequality. Therefore, for each  z0 ∈ Athere exists an open ball  B(z0)centered about  z0such that for all  ¯z0 ∈ B(z0), ϕT(z0)(¯z0) < ϵ,which implies  ϕt(¯z0) ∈ Bδ for all t ≥ T(z0). The collection of all  B(z0), z0 ∈ Ais an open cover for A. Due to the fact that A is compact, there exists a finite sub-cover (according to the HeineBorel theorem), which we denote  B(ˆzi), i = 1, 2, . . . , N, where Nis finite. As a result, choosing Tm := maxi∈{1,2,...,N} T(ˆzi) implies ϕt(z0) ∈ Bδ for all z0 ∈ A, t ≥ Tm, which proves the claim.

Moreover, the continuity of  ϕt for all t ∈ I, 0 ≤ t ≤ Tm, implies further that  ϕt(A)is bounded for any  t ∈ I, 0 ≤ t ≤ Tm. Combined with Lemma 14 (see Appendix A), this yields the following bound

image

for all  z0 ∈ A, where CA ≥ δ and ˜C ≥ 1are positive constants. We fix  z0 ∈ A, consider the trajectory  z(t) := ϕt(z0), t ∈ I, and apply the mean value theorem,

image

where  ξ(t)lies between z(t) and the origin. Due to the fact that the dynamics are assumed to have Lipschitz-continuous derivatives, we obtain the following bound:

image

where ¯CAdenotes a Lipschitz constant of  ∂g/∂z on A. According to (19), the trajectory |z(t)| is integrable (in continuous time) and absolutely summable (in discrete time). We obtain, by virtue of Lemma 13 (see Appendix A),

image

where  Czis constant. The constant  Czis related to an upper bound on the integral (in continuous time) or the sum (in discrete time) of  |z(t)| over t ∈ I, which according to (19), is guaranteed to be finite.

Remarks:

• Proposition 3 characterizes the convergence rate and states that the number of iterations required to obtain an  ϵ-accuracy approaches  log(1/ϵ)/α for small ϵ. This does not provide a tight bound on the number of iterations required to achieve a certain accuracy, as the constant ˆCmight depend on  αor grow rapidly with the size of A. Nevertheless, it enables a qualitative and quantitative discussion of the convergence rate  α. In particular Proposition 3 highlights the fact that the convergence rate  αis determined by the local properties of the dynamics, which depend on the local shape of the objective function f.

• The assumptions required for invoking Proposition 3 are often straightforward to verify, as Assumption 4 hinges on an eigenvalue analysis of  ∂g/∂z at z = 0. Proposition 3 can also be generalized to the case where f satisfies Assumption 1 instead of Assumption 2, as shown in Appendix F.

• The proof highlights the following alternative statement of Proposition 3: There exists a finite time  Tm > 0after which all trajectories starting in A are guaranteed to be within a small neighborhood of the origin. Within this neighborhood the convergence is exponential with rate  α.

• Given the smoothness properties of g, the assumption that the linearized dynamics converge with rate  α > 0(Assumption 4) is necessary and sufficient for the convergence of the nonlinear dynamics with rate  α, as can be shown with the arguments of Lemma 14 in Appendix A.

2.5 Implications for optimization algorithms

Proposition 3 enables a characterization of the convergence rate based on a linear analysis of the dynamics about an equilibrium. In the following, we will use Proposition 3 to discuss the convergence rate of the optimization algorithms given in Example 1 and Example 2. In particular, this provides conditions guaranteeing accelerated convergence.

2.6 Example 1

The total energy of the dynamical system is given by (11), and according to (12), energy is dissipated along trajectories. As we will show in the following, the energy function can therefore be used to characterize the region of attraction of the equilibrium  z∗ = 0, whereby the topology of the level sets of H and f will play an important role. This will be discussed next.

By assumption, f has a local non-degenerate minimum at  x∗ = 0, which implies that  f−1([0, c])contains a compact, connected component containing the origin for sufficiently small c > 0. The set  f−1([0, c])describes all values  x ∈ Rn such that 0 ≤ f(x) ≤ c. Morse theory (Milnor, 1963) concludes that the topology of the set  f−1([0, c])is determined by the critical points of f. Thus, the set  f−1([0, c])includes a compact, connected component that contains the origin (and no other

image

Figure 1: The figure illustrates the set  f−1([0, c]), with c > 0, when the function f is scalar. In this example,  f−1([0, c])contains two connected components  C1 and C2. The function f has a local minimum at the origin, a saddle point at  ˆx, and a local maximum at  xmax. Theorigin is contained in the set  C1. Morse theory states that  C1is guaranteed to be compact provided that  0 < c < f(ˆx). Due to the fact that f is one-dimensional  C1 remainscompact for  0 < c < f(xmax); this is, however, no longer true in higher dimensions.

critical point), as long as  c < ˆf = f(ˆx), where ˆxis any other critical point. The situation is illustrated with an example in Figure 1.

From the definition of H we infer that the critical points of H are all of the form  (q∗, 0), whereq∗ corresponds to a critical point of f. The above reasoning therefore also applies to the total energy H, and concludes that the set  H−1([0, c])includes a compact, connected component that contains the origin (and no other critical point), as long as  c < ˆf. This motivates the following definition, which will be used throughout the remainder of the article.

Definition 4 The set  Afis defined as the connected component of  H−1([0, ˆf))that contains the origin.

Analyzing the rate of change of the energy along the trajectories of (8) (cf. (12)),

image

reveals that by a suitable choice of the parameters  d and β, the energy necessarily decays. In particular, this is the case for  d > 0, 0 ≤ β < 2d/Cf, where Cf ≥ 0denotes a lower bound on the Hessian of f, as defined in (2). We therefore conclude as follows.

Proposition 5 Provided that  d > 0 and 0 ≤ β ≤ 2d/Cf, the origin is an asymptotically stable equilibrium in the sense of Lyapunov. Its region of attraction contains the set  Af.

Proof We consider any initial condition  z(0) ∈ Af and define H0 := H(z(0)). For d > 0 and0 ≤ β ≤ 2d/Cf, the energy necessarily decays along the trajectories of (8), c.f. (23). The trajectory z(t) = (q(t), p(t)) is therefore confined to the connected component of  H−1([0, H0])that contains the origin, which, according to the above discussion, is necessarily compact. This implies that the origin is stable. In addition, the energy strictly decreases except when p(t) = 0. Hence, according to La Salle’s theorem, (see, for example, Sastry, 1999, Ch. 5.4), q(t) necessarily converges to a critical point of f, whereas p(t) converges to zero. The origin is the only critical point contained in Af, which implies asymptotic stability of the origin.

An eigenvalue analysis of the linearized dynamics (about the equilibrium) reveals that the eigenvalues are given by

image

where h is any eigenvalue of  d2f/dx2|x=0. Thus, Proposition 3 asserts that the convergence rate for all initial conditions in  Afis directly determined by the real part of the eigenvalues, provided that  d > 0 and 0 ≤ β < 2d/Cf.

The eigenvalues h satisfy the upper and lower bounds  1/κ ≤ h/L ≤ 1, cf. (3). An appropriate normalization of the constants  d and βreduces the analysis to the case L = 1, which we consider in the following. Figure 2 shows how the eigenvalues vary as a function of  d, β and h, according to the formula (24).

We consider first the case  β = 0: For small  d—i.e., 0 < d ≤ 1/√κ—the eigenvalues are complex conjugates, have real part  −d(independent of h), and their imaginary part varies between �1/κ − d2 and√1 − d2 (as hchanges). As d is increased above  1/√κthe eigenvalues can be

image

Figure 2: This figure shows how the eigenvalues of the linearization change as a function of the damping parameters  d and β, and as a function of the curvature at the equilibrium. The top left plot shows the behavior of the eigenvalues for  β = 0, as the curvature h is varied from 0 to 1 (the different colors represent the different values of d). The top right plot shows the behavior of the eigenvalues for  β = 0.2and the bottom right plot the behavior for  β = 0.4, where the curvature h is varied from 0 to 1. The figure indicates that for  β = 0, the eigenvalues are real for large values of d and small values of h. As h is increased the eigenvalues may become complex conjugated, in which case a further increase in h affects only the imaginary part. The additional parameter  βhas the effect of reducing the real part for larger values of h. This increases the convergence rate for large values of h.

both real or complex conjugates, depending on h. However, their worst-case real part is given by −d+�d2 − 1/κ, which rapidly increases for larger d. The worst-case convergence rate (for a fixed d) is given by the maximum real part as h is varied between  1/κ and 1. Thus, in that sense, the optimal worst-case convergence rate is  1/√κ for d = 1/√κ.

Increasing  βhas the effect of increasing the convergence rate for larger values of h, since the parameter  βintroduces additional damping as h becomes large. Provided that  β ≤ 1, the qualitative behavior of the eigenvalues remains the same: For small  d, i.e., 0 < d ≤ 1/√κ − β/(2κ), theeigenvalues are complex conjugates with worst-case convergence rate  −d − β/(2κ). As d is in-creased above  1/√κ + β/(2κ), the eigenvalues can be both real or complex conjugated, depending on h. Their worst-case real part is again achieved for  h = 1/κand is rapidly increasing for larger d. Again, the optimal worst-case convergence rate is  1/√κobtained for  d = −β/(2κ) + 1/√κ.

In continuous time, any desired convergence rate can be realized, by a simple reparametrization of time. A linear reparameterization, ˆt = ctt, where ct > 0is constant, will simply scale the real parts and imaginary parts of the eigenvalues with  ct. A general, nonlinear, but diffeomorphic transformation will lead to time-varying dynamics, which will be discussed in Section 3. Thus, it might seem that any discussion of continuous-time convergence rates in the context of optimization is pointless. However, the analysis above tells us something different; it reveals how the convergence rate is affected by the condition number, which characterizes the shape of a local minimum. The analysis should be interpreted in the following way: provided that the time scale is fixed such that a convergence rate of  1 (or 1s−1 if the physical units are kept) is achieved for  κ = 1, the analysis reveals how the convergence rate of any optimization algorithm of the type (8) deteriorates as  κincreases.

For the following analysis we introduce the notation  u1 ≻ u2if the function  u1dominates the function  u2for large arguments; that is,  limκ→∞ u1(κ)/u2(κ) = ∞, where u1 and u2 are real-valued functions that are positive for large arguments. In the same way, the notation  u1 ≺ u2implies that the function  u2 dominates u1for large arguments. We further use  u1 ∼ u2 to implythat neither  u1 nor u2are dominant for large arguments; that is,

image

We will analyze the performance of the algorithm given in Example 1 not only on a single function, but on a whole class of functions. In order to make the following statements precise, we will fix a compact set  A ⊂ R2n that contains the origin and introduce the following class of functions (parametrized by the constants  ¯κ ≥ 1, Cf > 0, ¯Cf > 0).

Definition 6 Let  F¯κ,Cf, ¯Cfdenote the set of all functions such that each element  f ∈ F¯κ,Cf, ¯Cf satisfiesthe following conditions: 1) f has an isolated local minimum at the origin with condition number κ ≤ ¯κ, 2) fsatisfies Assumption 2 and the bounds in (2), and 3)  Af ⊃ A, where Af is definedaccording to Definition 4.

These conditions are motivated as follows. The first prescribes the local geometry about the local minimum at the origin, which influences the convergence rate in significant ways. The second ensures smoothness of the gradient, and the third guarantees that there are no critical points in A other than the origin.1

Acceleration is obtained whenever the convergence rate (again, relative to the convergence rate achieved for  κ = 1) scales with  1/√κfor large values of  κ. More precisely:

Definition 7 A momentum-based algorithm is accelerated if there exists constants  κ0 ≥ 1 andca > 0, such that for any  κ ≥ κ0 and f ∈ Fκ,Cf, ¯Cf, the following bound holds (for some constant

image

We will now proceed to derive conditions on the parameters  d and βof the algorithm given in Example 1 that guarantee accelerated convergence. We start with the case  β = 0. For d ≻ 1/√κit follows that the real part of the worst-case convergence rate scales with  −d +�d2 − 1/κ ≈−1/(2dκ), which makes acceleration impossible. For  d ≺ 1/√κthe damping is too small; i.e., the real part of (24) scales worse than  1/√κ. Hence, acceleration is only achieved for  d ∼ 1/√κ. Asimilar argument applies to the case  0 < βand yields  d + β/(2κ) ∼ 1/√κ.

The above analysis is summarized with the following proposition.

Proposition 8 In the nonconvex case (Cf > 0), the algorithm given in Example 1 is accelerated for the set of parameters  d ∼ 1/√κ, 0 < d, and 0 ≤ β < 2d/Cf.

In the convex case (Cf = 0), the algorithm given in Example 1 is accelerated for the set of parameters  d + β/(2κ) ∼ 1/√κ, 0 < d, 0 ≤ β.

We would like to emphasize that the bound in Definition 7 is checked for each function  f ∈Fκ,Cf, ¯Cfindividually, whereby the constant  Camay depend on the specific function f. Given the (potentially large) compact set A, Proposition 8 therefore answers the following question: What are the algorithm parameters ensuring that for any  f ∈ Fκ,Cf, ¯Cfthe time for reaching an  ϵ accuracyscales with √κln(1/ϵ)up to constants, i.e., for small  ϵ? One important aspect of the analysis is that the convergence rate depends only on the local geometry of the objective function, as characterized by the constant  κ.

The basic notion of local analysis, obtained from a derivation of eigenvalues, is well known (see, for example, Polyak (1987)). But Proposition 8 goes beyond classical local analysis in that, by virtue of Proposition 3, the local rate can be guaranteed for a large portion of the region of attraction of a given equilibrium. Key for this result is a nonlinear and global stability analysis. In the strongly convex setting, the results regarding the case  β = 0have been derived in Gadat et al. (2018). Similar results regarding the case  β ̸= 0can be found in Attouch et al. (2018).

The bounds that we establish throughout the manuscript are restricted to initial conditions that are contained in a compact set within the region of attraction of a non-degenerate and isolated local minimum. This can be further motivated by considering the function shown in Figure 1. Provided that the algorithm dissipates energy, we infer from the above discussion that the region of attraction of the origin (an open set) will comprise points in the state space that are arbitrarily close to the saddle  (ˆx, 0)and the maximum  (xmax, 0)(see Figure 1). Saddle points or maxima are equilibria, since any gradient-based algorithm cannot distinguish between them and minima. Hence, when initializing the algorithm close enough to  (ˆx, 0) and (xmax, 0), convergence of the algorithm can potentially take an arbitrarily long time (see Du et al. (2017) for a formal analysis). In that sense, the restriction of initial conditions to a compact set within the region of attraction of a given equilibrium appears to be natural.

2.7 Example 2

In discrete time, the stability analysis is not as straightforward, since the energy H is in general not dissipated along trajectories. However, the specific structure of the discretization can be exploited for obtaining a nonlinear stability analysis that is valid beyond a neighborhood of a local minimum.

As remarked in Muehlebach and Jordan (2019), the discretization (13) can be divided into two parts, an energy dissipation step and a symplectic Euler step (Hairer et al., 2002, p. 3),

image

with intermediate state  ¯zk = (¯qk, ¯pk). In order to simplify notation, we introduce the maps  Φd,T andΦTto denote the energy dissipation and the symplectic Euler step, respectively. The symplectic Euler step is a well-known structure-preserving first-order integration scheme. Due to the symplectic integration, there exists a modified energy function that is nearly (up to exponentially small terms) conserved by the symplectic Euler step (Hairer et al., 2002, Chapter VI). The modified energy function can be computed by means of truncated Taylor-series expansions. In order to make the analysis rigorous, f is assumed to be analytic, which enables the estimation of higher-order derivatives using Cauchy’s integral formula. It will be shown that for the subsequent stability analysis the assumption of f being analytic poses essentially no restriction, as any continuous function can be approximated arbitrarily closely by an analytic function on a compact domain (Stone-Weierstrass Theorem; Rudin, 1976, p. 159). The modified energy function is characterized by the following result.

Proposition 9 Let f be analytic on  Bcr, the closed n-dimensional ball of radius r centered at the origin, and let  LHbe a Lipschitz constant of  ∇H on Bcr × Bcr. Then there exists a perturbed Hamiltonian ˜H : Bcr × Bcr → R such that

image

for all  0 < T ≤ T0/3and for all  |z0| ≤ r2(1 + 3.63LHT(1 + eT0/3))−1, where

image

are constant. The perturbed Hamiltonian has the form

image

where  F : Bcr × Bcr → Ris an analytic function. The perturbed Hamiltonian  ˜Hhas the same critical points as H and satisfies

image

Remarks:

• A more general version of Proposition 9 is stated and proved in Appendix C.

• The proof follows the reasoning of Hairer et al. (2002, p. 307), which enables the construction of ˜Hby means of a truncated series expansion. Cauchy’s integral formula is used to assert the convergence of the truncated series. The result from Hairer et al. (2002, p. 307) is extended by exploiting the specific structure of the underlying dynamics, which leads to the additional statements about the modified energy function ˜H. These are crucial in the context of stability analysis.

• The constants  T0 and C∆ ˜Hare determined as a function of the Lipschitz constant of  ∇H,which can be regarded as the natural time constant for the dynamics governed by the Hamiltonian  H. For LH = 1we obtain the following values:  T0 ≈ 0.071, C∆ ˜H ≈ 8.4,

image

for all  0 < T ≤ 0.023 and all z0 such that |z0| ≤ 0.45r. Choosing, for example, a time step T = 0.001 leads to the bound  1.2·10−33 on the right-hand side of expression (29), indicating that ˜His virtually exactly conserved by the symplectic Euler scheme.

• The estimate for the maximum time step  T0/3is typically conservative. However, the proposition rigorously establishes that for a small enough time step, the perturbed Hamiltonian will be almost exactly conserved. The importance lies in the fact that the upper bound on the time step is only dependent on the Lipschitz constant of  ∇H, which is directly related to the Lipschitz constant of  ∇f. Due to the fact that Proposition 9 is a statement about the integration of the conservative part of the dynamics, the maximum time step  T0/3is independent of the parameters  d, β, and κ.

• The bounds (28) enable a nonlinear stability analysis, based on the modified Hamiltonian ˜H.Due to the fact that  15LHT < 1 for all T ≤ T0/3, ˜His necessarily positive in a neighborhood of the origin, cf. (28). Combined with the fact that the perturbed Hamiltonian has the same critical points as H, this concludes that the level sets of ˜Hare compact in a region about the origin. The size of this region is determined by the critical points of H, as argued in the analysis of Example 1; see Section 2.6.

In the following, we will use the modified energy function ˜H, whose existence is ensured by Proposition 9, for analyzing the stability of the dynamics (13) in the large. As pointed out, the dynamics (13) can be subdivided into a dissipation step and a symplectic Euler step.

In order to simplify the presentation we focus on the case where T is small and the map  Φd,Tis close to the identity (little damping). For stability analysis this is the most challenging setting, since, due to the almost vanishing damping, the convergence can be arbitrarily slow (the equilibrium is almost non-attractive). In case  Φd,Tis not close to the identity, which is, for example, obtained for a constant parameter  β > 0, independent of  κ, stability can be analyzed by means of the unperturbed energy function H, as shown in Appendix E. We will thus concentrate on the following result.

Proposition 10 Let the dissipative forces  fd(qk, pk)be such that  −d2|pk|2 ≤ pTk fd(qk, pk) ≤−d1|pk|2 with 0 < d1 ≤ d2.

Then, there exists a maximum time step  Tmax > 0, such that the origin is an asymptotically stable equilibrium of the dynamics (13) for all  T ≤ Tmax, with domain of attraction at least  Af ∩A,where A is any compact set. Up to exponentially small terms due to (27), the maximum time step Tmaxdepends on an upper bound on  d2, d1/d2, and LH, the Lipschitz constant of  ∇H.

Remarks:

• The proof of Proposition 10, which is included in Appendix D, is based on the Lyapunov function V (q, p) = ˜H(q, p) + Td12 ∇f(q)Tp,

where ˜Hdenotes the perturbed energy function introduced in Proposition 9. The function V is motivated by the following observation. As the damping parameter  d1decreases and Φd,Tapproaches the identity, V reduces to the perturbed energy function ˜H, which (up to exponentially small terms) is known to be conserved by  ΦT. The correction  Td1pT∇f(q)/2is required due to the fact that  Φd,Tis a pure contraction in the momentum variable, and as a result ˜His not necessarily decreasing through the application of  Φd,T. A similar correction has been previously used in the literature on dissipative systems; see, for example, Hale (1988) or Haraux (1991), and in the context of the heavy-ball method in Gadat et al. (2018).

• Proposition 10 ensures that the region of attraction is the same as the continuous-time counterpart, and that the requirements on the time step for guaranteeing asymptotic stability is independent of the minimum damping  d1(up to the exponentially small terms). This will be important in the following, where we will analyze how the convergence rate scales with  κ.

• Provided that  β > 0and the function f is convex, the stability analysis simplifies substantially. In particular, a sufficient condition for stability is obtained by analyzing the total energy H along the trajectories of (13), as shown in Appendix E.

Next we will analyze the specific implications for the dynamics (13). We recall that the contraction step  Φd,Tis given by

image

which concludes that the upper and lower bounds  d2 and d1of Proposition 10 are given by  d2 =2d+ ¯Cfβ and d1 = 2d−Cfβ (assuming β ≥ 0). The upper and lower bounds  d2 and d1are therefore functions of  κ.

Hence, according to Proposition 10, provided that  d1 > 0 and d2/d1 and d2are bounded with respect to  κ, there exists a time step T, independent of  κ, such that the origin is asymptotically stable, where the region of attraction includes any compact subset of  Af. Then, according to Proposition 3, the convergence rate of any trajectory starting in  A ⊂ Af, Acompact, is determined by the magnitude of the eigenvalues of the linearized dynamics. The eigenvalues are given by

image

enabling the calculation of the convergence rate for a given choice of  T, d, β and h.

For the following discussion, we again assume that  d and βare normalized such that  1/κ ≤h ≤ 1. We fix d, β, and Tand analyze how the eigenvalues change as a function of h, as done in Figure 3. The worst-case convergence rate is determined by the maximum magnitude of the eigenvalues (over  1/κ ≤ h ≤ 1). For very small values of h, the eigenvalues are real, and one eigenvalue is very close to 1. As h is increased, the eigenvalues become complex conjugates, where for  β = 0, the eigenvalues are located along circles centered at the origin, whereas for  β > 0their magnitude slightly decreases. If h is increased further, the eigenvalues become real again and their magnitude increases. The worst-case convergence rate, i.e., the largest magnitude of  |λ1,2| istherefore either achieved for  h = 1/κ or h = 1.

We are interested in determining the conditions on  d and βsuch that the convergence rate scales with  1/√κ. We consider first the case  h = 1/κ and β = 0, and assume that d is large enough such that the eigenvalues are real. Then, provided that  d ∼ 1/√κit follows that  λ1,2 ∼ 1 − T/√κ.However, if d is chosen to be larger, i.e.,  d ≻ 1/√κ, it follows that for large  κ,

image

that is, the convergence rate scales worse than  1/√κ. In case dis small enough, such that the eigenvalues are complex conjugates, their magnitude is given by

image

which indicates that a scaling of the convergence rate with  1/√κis only achieved for  d ∼ 1/√κ.For  h = 1 and β = 0it follows from  d ∼ 1/√κ that λ1,2 approach

image

for large  κ. Thus, for  T ≤ 1, for example, the eigenvalues are complex conjugates and the worst-case convergence rate scales with  1/√κ. The condition  d ∼ 1/√κis therefore necessary and sufficient for ensuring that the worst-case magnitude of  |λ1,2|scales with  1/√κ. A similar analysis applies to the case where  β > 0, as shown below.

As in the discussion of Section 2.6 we say that the optimization algorithm (13) is accelerated if the convergence rate scales with  1/√κ(see Definition 7), where exponentially small terms resulting from the application of Proposition 9 are neglected. As a consequence, the above discussion allows us to translate the results from Proposition 8 almost verbatim to the discrete-time setting. This results in a broad characterization of the parameters  d and βleading to acceleration.

Proposition 11 In the nonconvex case, (Cf > 0), there exists a time step T > 0, such that the algorithm given in Example 2 is accelerated provided that  d ∼ 1/√κ, 0 < d, 0 ≤ β < 2d/Cf.

Example 2 is accelerated provided that  d + β/(2κ) ∼ 1/√κ, 0 < d, 0 ≤ βand either  β ∼ 1,β ∼ 1/√κ, or β ≺ 1/√κ.

Proof The requirements on  βare needed for guaranteeing asymptotic stability of the origin for a time step T that is small enough, but independent of  κ, as implied by Proposition 10 and Appendix E. It therefore remains to analyze the behavior of the eigenvalues  λ1,2as a function of h for 1/κ ≤ h ≤ 1.

image

and if h is increased further they become real again. In case the eigenvalues are complex conjugates, their magnitude is given by

image

We consider first the case  h = 1/κ: The eigenvalues are real provided that  d + β/(2κ) +T/(2κ) ≥ 1/√κ, which, given the assumptions on  β, implies that  d ≥ 1/√κ for large κ. In cased ∼ 1/√κit follows that  λ1,2 ∼ 1 − T/√κ, which therefore yields accelerated convergence. In case  d ≻ 1/√κit follows that

image

due to the fact that  1/(d2κ) → 0 for large κ. Thus, accelerated convergence is impossible for d ≻ 1/√κ. In case the eigenvalues are complex conjugates for  h = 1/κ, it follows from (33) that d ∼ 1/√κis necessary for accelerated convergence.

We therefore proceed to the case h = 1, where due to the fact that d vanishes for large  κ, thetwo eigenvalues approach constant values,

image

for  β∞ := limκ→∞ β, which is either constant or zero. This results in a convergence rate that is independent of  κ, since Tis chosen small enough to guarantee convergence.

In particular, this leads to the conclusion that, for example, the following heavy-ball scheme,

image

leads to accelerated convergence for a sufficiently small value of T. In case the Lipschitz constant of  ∇fis bounded by one, the bound  T ≤ 0.001for guaranteeing acceleration is obtained by following the proof of Proposition 10.1 Note that compared to Nesterov’s original scheme, a gradient evaluation at  qk + βpkis not required.

We would like to emphasize that our analysis captures the computational complexity up to constants. For a given objective function, choosing  β ̸= 0typically allows larger steps T, which is likely to reduce the number of iterations needed for convergence. On strongly convex functions, for example, the “Constant step scheme III” presented in Nesterov (2004, p. 81), seems to provide a good compromise between a straightforward implementation and fast convergence (considering constants).

Local results akin to Proposition 11 which are based on a quadratic objective function are well known. A summary can be found for example in Lessard et al. (2016). In addition to unifying the continuous-time and discrete-time analysis, an important aspect of Proposition 11 is to establish a rate that is valid for a large portion of the region of attraction of a given equilibrium. Furthermore, the fact that the heavy-ball scheme leads to accelerated convergence for a sufficiently small value of T appears to be new. If T is chosen to be too large, however, for example by tuning T on quadratic functions, this may lead to convergence issues even on strongly convex functions, as has been reported in Lessard et al. (2016).

image

Figure 3: This figure shows how the eigenvalues of the linearization change as a function of the damping parameters  d and β, and as a function of the curvature at the equilibrium, h. The top left plot shows the behavior of the eigenvalues for  β = 0, as the curvature h is varied from 0.1 to 1 (the different colors represent the different values of d). The top right plot shows the behavior of the eigenvalues for  β = 0.2and the bottom right plot the behavior for  β = 0.4, as the curvature h is varied from 0.1 to 1. For all the plots the time step is set to T = 0.8. The figure indicates that for  β = 0, the eigenvalues move first along the real axis and then along concentric circles (as h changes). The additional parameter  β has theeffect of reducing the magnitude of the eigenvalues for larger values of h. While this can increase the convergence rate, it bears also the risk of instability, when  d, β, and T arechosen to be too large.

Next we will consider the case where the dynamics g are time dependent. The main motivation for introducing time-dependent dynamics lies in improving the convergence rate for functions that have an almost vanishing curvature at a local minimum.

We slightly abuse notation and reuse the variables introduced in Section 2.4. It will be clear from context whether we refer to the time-varying or time-invariant case.

Following Section 2, we model a momentum-based optimization algorithm as a continuous-time or discrete-time dynamical system of the form

image

where the dynamics satisfy the following assumptions:1

Assumption 5 The dynamics  gq and gpare continuously differentiable in all their arguments. The derivative in the second and third arguments is Lipschitz continuous, uniformly in t.

We define the map from  (q0, p0, t0) → (q(t), p(t)) as ϕt : R2n × I → R2n, for t ∈ I, t ≥ t0,and define the dynamical system (37)-(38) to be a momentum-based optimization algorithm for the function  f if x∗ = 0is an asymptotically stable equilibrium in the sense of Lyapunov (uniformly in the initial time  t0). Due to the continuity assumptions on the dynamics  gq and gp, the continuous-time existence and uniqueness results, as well as the continuity of trajectories with respect to their initial conditions continues to hold (Arnol’d, 1992, p. 93, Corollaries 3 and 4).

The examples presented in Section 2.2 and Section 2.3 are modified by allowing the constants d > 0 and β ≥ 0to be time varying. We restrict ourselves to the case where  d and β convergefor large time, which leads to asymptotically time-invariant dynamics. This has the advantage that the asymptotic convergence rate is independent of the initial time  t0, and, under mild continuity conditions, the (optimal) convergence rates from Section 2.4 will be recovered for large t. Thus, (8) and (13) are extended by replacing the constants  d and βwith the functions  d : R≥0 → R>0 andβ : R≥0 → R≥0, which are assumed to be continuously differentiable and convergent,

image

The stability analysis of Example 1 (Section 2.2) translates to the time-varying case.2 This implies that the algorithm (8) with time-varying parameters  d and βis a momentum-based optimization algorithm according to the above definition.

The stability analysis of Example 2 is extended to the time-varying case by rewriting the dynamics (37) in the following way:

image

where the linear time-varying part vanishes for  t → ∞. The remainder  rNLis of second order in zt(uniformly in  t ∈ I). Thus, if the linear time-invariant part has all eigenvalues strictly within the unit circle, the dynamics (40) are asymptotically stable, uniformly in  t0.1 Consequently, the analysis from Section 2.3 yields the following conditions, cf. (16):

image

for all eigenvalues  h of He = d2f/dx2��x=0.

3.1 Characterizing the convergence rate

As in Section 2.4 we argue that a linear analysis can be used to characterize the convergence rate of the nonlinear dynamics up to constants. The results are derived only for non-degenerate isolated local minima. We believe, however, that the discussion also provides insights into the case where the curvature vanishes at a local minimum; see Section 3.2 for further discussion of this point.

A similar analysis of the role of choosing a time-varying damping parameter can be found in Attouch et al. (2018). The emphasis of the following section lies in providing intuition through a unified treatment between the degenerate and non-degenerate case.

We make the following assumption.

Assumption 6 Let the linearized dynamics

image

be such that there is an estimate

image

where  Cl ≥ 1 and α > 0are constant, and  ρ : R≥0 → R≥0is continuous, and monotonically decreasing.

Proposition 12 Let Assumption 6 be satisfied and let the region of attraction of the equilibrium at the origin of the nonlinear dynamics (37) be denoted by  R ⊂ R2n. Then, for any compact set A ⊂ Rand any initial time  t0 ∈ I, there exists a finite constant ˆC ≥ 1such that for all  z0 ∈ A,

image

Proof The set A is compact, nonempty, and contains the origin. Lemma 16 (see Appendix B) implies that there is a ball  Bδ of radius δ > 0, centered at the origin, such that any trajectory starting in  Bδ at time τ ∈ Iconverges with rate  (ρ(t)/ρ(τ)) exp(−α(t − τ)). The following claim can be verified with the arguments of Proposition 3.

Claim: There exists a finite time  Tm > t0, Tm ∈ Isuch that for all  z0 ∈ A, ϕTm(z0, t0) ∈ Bδ.

Moreover, the uniform continuity of  ϕt for all t ∈ I, t0 ≤ t ≤ Tm, implies further that  ϕt(A, t0)is bounded for any  t ∈ I, t0 ≤ t ≤ Tm. Combined with Lemma 16, this yields the following bound

image

for all  z0 ∈ A, where CA ≥ δ and ˜C ≥ 1are positive constants. We fix  z0 ∈ A, consider the trajectory  z(t) := ϕt(z0, t0), t ∈ I, and apply the mean value theorem,

image

where  ξ(t)lies between z(t) and the origin. Due to the fact that the dynamics are assumed to have Lipschitz continuous derivatives (uniformly in t), we obtain the following bound

image

where ¯CAdenotes the Lipschitz constant of  ∂g/∂z on A(uniformly in t). According to (42), the trajectory |z(t)| is integrable (in continuous time) and absolutely summable (in discrete time). We obtain, by virtue of Lemma 15,

image

where  Czis constant. The constant  Czis related to an upper bound on the integral (in continuous time) or the sum (in discrete time) of  |z(t)| over t ∈ I, which according to (42), is guaranteed to be finite.

3.2 Example 1

The energy function, as defined in (11), is not explicitly dependent on time, even if the parameters d and βare chosen to be time-varying. Following the discussion of Section 2.6, we conclude that Afis contained in the region of attraction of the equilibrium at the origin, as long as

image

As a result, Proposition 12 implies that the convergence rate of the trajectories starting from  A ⊂Af, Acompact, is given by the linearization of (8) about the origin.

After a change of coordinates that diagonalizes  He, the linearized dynamics of a single coordinate are given by

image

image

Figure 4: The figure shows d(t) defined according to (49) with  d(0) = 1 and d∞ = 1/√10 as anexample. Compared to the time-invariant dynamics (obtained in the asymptotic limit), which converge with rate  d∞, the time-varying terms improve the convergence rate by the shaded area, as indicated with (47).

where  δq(t) ∈ R, δp(t) ∈ R, and hdenotes the corresponding eigenvalue of  He. As in Section 2.6, upper and lower bounds on the eigenvalues of  Heare given by  1/κ ≤ h ≤ 1. The analysis simplifies by expressing the dynamics using the coordinates  δ˜q, which are defined by

image

with ¯d(t) := d(t) + β(t)h/2, and yields

image

The dynamics (48) represent an undamped linear harmonic oscillator whose frequency varies with time. This time-varying frequency can be interpreted as the time-varying generalization of the square-root term in (24). The non-square-root term in (24) is captured by the coordinate change (47).

In the following, the choice

image

is discussed. This leads to a time-invariant right-hand side of (48), and simplifies the subsequent analysis. The more general case (no restriction on ¯d) can be analyzed numerically, or by approximating ¯dwith a rational function and applying tools from asymptotic analysis (White, 2010). The solutions of (49) monotonically approach  d∞ as tincreases, whereby the time-invariant case discussed in Section 2.6 is recovered with  d(0) = d∞ (d(0) = d∞ implies d(t) = d∞ for all t ≥ 0).The coordinate transformation (47) therefore implies that, compared to the time-invariant case, the convergence rate can be improved by choosing  d(0) > d∞, as this increases the area underneath the curve d(t). Figure 4 illustrates the situation.

This intuition can be made quantitative by noticing that the solutions of (49) are given by

image

where  Cdis constant and related to  d(t0). This implies

image

Due to the fact that (48) reduces to a time-invariant harmonic oscillator, the solutions to (46) can be expressed in closed form with elementary functions. The fundamental solutions to (46) are therefore given by1

image

The rate of the exponential decay behaves in the same way as the real part of the expression (24) (d∞corresponds to d in (24)) and as a result, the discussion for the time-invariant case applies here in the same way. This concludes that accelerated convergence occurs as long as d∞ ∼ 1/√κ. Compared to the time-invariant case, however, the convergence is improved by the factor  ρtv(t)/ρtv(t0). For large values of  d∞the difference is not substantial (at most a constant factor of size  limt→∞ ρtv(t)/ρtv(t0) = 2d∞/(d∞ + d(t0))can be gained). However, the improvement becomes crucial for small values of  d∞, since ρtv(t)/ρtv(t0)behaves as

image

for small t. Thus choosing  d∞ ∼ 1/√κ, ensures convergence roughly with  1/(1+d(t0)(t−t0)) forsmall t, and exponential convergence with rate  1/√κ for large t; the transition occurs for  (t−t0) ≈1/d∞. The situation is shown in Figure 5.

Summarizing, we therefore conclude according to Proposition 12 that choosing  d∞ ∼ 1/√κimplies that the trajectories of the time-varying generalization of (8) satisfy the estimate

image

for some constant  Ctv(that might strongly depend on  κ) and for all  (q(0), p(0)) ∈ A ⊂ Af, whereA is compact. For any  κ ≥ 1, the right-hand side can be bounded by

image

which implies that the convergence rate is at least O(1/t) for any fixed function f that has non-degenerate isolated local minima (the minima can have an arbitrarily small curvature). We believe that this provides useful insights for the case where the isolated minimum at the origin is degenerate. For example, given a function with an isolated minimum that is degenerate, we can add a regularization of the type  ϵ|x|2/2, where ϵ > 0is small. The bound (54) applies to the resulting regularized function, and implies a convergence rate of the order O(1/t).

image

Figure 5: The plot shows the convergence rate of the fundamental solutions (52) as a function of  κ

image

It is important to notice that the results in this section are stated in terms of the distance between the current iterate and the optimizer, i.e.,  |(q(t) − x∗, p(t))|. In the case of smooth and convex functions, this yield the following bound on the function value:

image

provided that the algorithm is initialized with p(0) = 0. Thus, our analysis recovers the well-known optimal rate  O(1/t2)(in terms of function value) for smooth and convex functions.

3.3 Example 2

We consider the time-varying generalization of (13), where the parameters  d > 0 and β ≥ 0are assumed to be time varying. We focus on the case where f is convex and  βkis chosen to be  βk = T(1 − 2dkT), as this simplifies the stability analysis and enables explicit closed-form evaluation of certain expressions. The stability analysis carried out in Appendix E applies likewise to the time-varying case; it suffices to replace  d with dk and β with βk. This concludes that the origin is asymptotically stable as long as  0 < dkT < 1, 0 < T ≤ 1/√L.

After a change of coordinates that diagonalizes  He, the linearized dynamics of a single coordinate are given by

image

where  δq(k) ∈ R, δp(k) ∈ R, and hdenotes the corresponding eigenvalue of  Hewith upper and lower bounds  1/κ ≤ h ≤ 1. In analogy to the transformation (47) in the continuous-time case, we define

image

which transforms (55) into

image

This transformation is well known in the literature on linear difference equations (see, for example, Elaydi, 2005, p. 369). As in the continuous-time case, the analysis is considerably simplified if  βkis chosen in such a way that the coefficient multiplying  δ˜q(k)in (57) remains constant. This can be achieved, for example, with

image

where  αtvis constant and independent of h. This equation defines a recurrence relation for  βk, andthrough  βk = T(1 − 2dkT)a relation for  dk. The choice (58) is motivated by the fact that 1) the recurrence relation is independent of h and 2) the difference equations (57) is time-invariant, which enables closed-form solutions. Thus, any trajectory  δq(k)satisfying (55) is a linear combination of the two fundamental solutions

image

This explicitly characterizes the convergence rate of (55) and, by virtue of Proposition 12, the convergence rate of (13), (for time-varying  dk and βk). Similar to the continuous-time discussion in Section 3.2, the convergence rate can be improved by choosing  dkto be close to one for small k. For the specific choice  T = 0.5, β0 = 0, d0 = 1, limk→∞ dk = d∞ = 1/√2κ, L = 1, thedependence of the convergence rate on  κis illustrated in Figure 6. As in the continuous-time case, the introduction of the time-varying parameter  dkensures that the fundamental solutions converge approximately with  1/(1 + d0Tk) for small k, whereas for large k the convergence is linear with rate  1 − T/√2κ. The transition occurs for  k ∼√2κ/T. This indicates that even for very large  κ,the fundamental solutions converge roughly with  1/(1 + d0Tk).

We have presented a convergence-rate analysis of momentum-based optimization algorithms from a dynamical systems point of view. Our analysis emphasizes the importance of the curvature properties about an isolated local minimum, which, up to a multiplicative constant, dictate the convergence rate. In addition, we find that reducing the damping over time improves the convergence rate by sublinear terms, which is important for objective functions that have minima with almost vanishing curvature. The use of momentum ensures robustness of the convergence rate against small changes in the curvature and leads, in many cases, to acceleration.

We also provided a rigorous motivation for the use of symplectic discretization schemes, showing that they enable the computation of a modified energy function that is (almost exactly) preserved by the conservative parts of the dynamics. The modified energy function provides a means for stability analysis, which implies, for example, that certain heavy-ball-type methods are in fact accelerated. Thus, an evaluation of the gradient at a shifted position is not necessary for achieving convergence rates that scale with  1/√κ for large κ.

image

Figure 6: The plot shows the convergence rate of the fundamental solutions (59) as a function of  κ.For small values of k, the convergence rate is roughly  1/(1 + d0Tk), whereas for larger values of k the convergence is linear with rate  1 − T/√2κ.

Our analysis emphasizes fundamental similarities between convergence rate analysis in continuous time and discrete time, and provides intuitive and rigorous explanations for various phenomena encountered in optimization, without resorting to convexity.

We thank the Branco Weiss Fellowship, administered by ETH Zurich, for the generous support and the Office of Naval Research under grant number N00014-18-1-2764. This work was also funded by the Deutsche Forschungsgemeinschaft (DFG, German Research Foundation)–MU 4710/2-1.

For proving the proposition we rely on the following lemmas.

Lemma 13 Consider the trajectories of a time-varying dynamical system

image

with  A ∈ R2n×2n and B : I → R2n×2n, where Bis either continuous (in the continuous-time case) or A + B(t) is invertible (in the discrete-time case). Let A be such that the trajectories w+(t) = Aw(t)converge exponentially with rate  α, i.e., |w(t)| ≤ Cw|w(0)| exp(−αt), for allw(0) ∈ R2n, t ∈ I, and for some constant  Cw ≥ 1. Then, x(t)satisfies the estimate

image

Proof We consider the continuous-time case first. Due to the continuity assumptions on B(t), the dynamics are guaranteed to have a unique solution, x(t), satisfying

image

where exp denotes the matrix exponential. The two-norm of  exp(A(t−τ))is, by assumption, upper bounded by  Cw exp(−α(t − τ)), for all t, τ ∈ I, t ≥ τ, yielding the following estimate:

image

Applying the Gr¨onwall inequality to (64) yields the desired result. The discrete-time case is analogous. Due to the assumption of A + B(t) being full rank, the dynamics are guaranteed to have a unique solution x(t), which satisfies

image

see, for example, Agarwal (2000, p. 59). By assumption, the two-norm of  At−τ is bounded by Cw exp(−α(t − τ)), for all t, τ ∈ I, t ≥ τ. As a result, the following estimate is obtained:

image

Applying the Gr¨onwall inequality (Agarwal, 2000, p. 186), yields

image

Lemma 14 Let Assumption 4 be fulfilled. Then there exists an open ball  Bδ(with radius  δ > 0)centered at the origin and a constant ˜C ≥ 1, such that for all  z0 ∈ Bδ,

image

Proof We consider the continuous-time case first. Assumption 4 implies that the dynamics (4) and (5) are (asymptotically) stable. This implies that for any  ε > 0there exists a constant  δ(ε) > 0such that  |ϕt(z0)| < ε for all z0 ∈ R2n with |z0| < δ(ε). For any ε > 0, we consider the trajectory starting at  z0 ∈ R2n, with |z0| < δ(ε)and reformulate the dynamics by applying the mean value theorem:

image

where  ξ(t) ∈ R2n lies between the origin and z(t). The continuity assumption on the dynamics leads to the following estimate:

image

for all  t ∈ I and where Cgis a Lipschitz constant of  ∂g/∂zin a neighborhood of the origin. Applying Lemma 13 then yields

image

We fix  ε > 0 such that ClCgε < α/2. This leads, in turn, to a refinement of the estimate (72), ��� ∂g∂z

image

Applying Lemma 13 once more therefore implies

image

for all  z0 ∈ R2n with |z0| < δ(ε), which leads to the desired result. The same arguments (with slightly modified constants) apply to the discrete-time case.

For proving the proposition we rely on the following lemmas.

Lemma 15 Consider the trajectories of a time-varying dynamical system

image

with  A : I → R2n×2n and B : I → R2n×2n, where A and Bare either continuous (in the continuous-time case) or A(t) + B(t) is invertible for all  t ∈ I(in the discrete-time case). Let φ(t, τ) ∈ R2n×2n, t, τ ∈ I, t ≥ τ, be defined by  φ(t, τ)+ = A(t)φ(t, τ)(for a fixed  τ) andφ(τ, τ) = I, and assume  φ(t, τ) satisfies

image

where  Cφ ≥ 1is constant,  ρ : I → R≥0is continuous, monotonically decreasing, and  limt→∞ ρ(t) =0. Then, x(t) satisfies the estimate

image

Proof We consider the continuous-time case first. Due to the continuity assumptions on A(t) and B(t), the dynamics are guaranteed to have a unique solution x(t), satisfying

image

The bound on the fundamental solution matrix  φ(t, τ)yields the following estimate:

image

Applying the Gr¨onwall inequality to (81) yields the desired result. The discrete-time case is analogous. Due to the assumption of A(t) + B(t) being full rank, the dynamics are guaranteed to have a unique solution x(t), which satisfies

image

see, for example, (Agarwal, 2000, p. 59). As a result, the bound on  φ(t, τ)leads to the following estimate

image

Applying the Gr¨onwall inequality (Agarwal, 2000, p. 186), yields

image

where the fact that  1 + x ≤ ex, for all x ≥ 0has been used.

Lemma 16 Let Assumption 6 be fulfilled. Then there exists an open ball  Bδ(with radius  δ > 0)centered at the origin and a constant ˜C ≥ 1, such that for all  z0 ∈ Bδ, and all τ ∈ I,

image

Proof We consider the continuous-time case first and fix the initial time  τ ∈ I. Assumption 6 implies that the dynamics (37) are (asymptotically) stable, uniformly in  τ. This implies that for any ε > 0there exists a constant  δ(ε) > 0(independent of  τ) such that  |ϕt(z0, τ)| < ε for all t ≥ τ andfor all  z0 ∈ R2n with |z0| < δ(ε). For any ε > 0, we consider the trajectory starting at  z0 ∈ R2n attime  τ ∈ I, with |z0| < δ(ε)and reformulate the dynamics by applying the mean value theorem:

image

where  ξ(t) ∈ R2n lies between the origin and z(t). The assumptions on the dynamics, leads to the following estimate:

image

for all  t ∈ I, t ≥ τ, and where  Cgis a time-independent Lipschitz constant of  ∂g/∂zin a neighborhood of the origin.

By assumption, the linear dynamics decay at least with  exp(−α(t − τ)).Thus, applying Lemma 15 yields

image

We fix  ε > 0 such that ClCgε < α/2. This leads, in turn, to a refinement of the estimate (88),

image

Applying Lemma 15 once more therefore implies

image

for all  z0 ∈ R2n with |z0| < δ(ε), which leads to the desired result. The same arguments (with slightly modified constants) apply to the discrete-time case.

We prove the following generalization of Proposition 9:

Proposition 17 Let f be analytic on  Bcr, the closed ball of radius r about the origin, and let  LHbe a Lipschitz constant of  ∇H, i.e., |∇H(z)| ≤ LH|z| for all z ∈ Bcr × Bcr. Then there exists a perturbed Hamiltonian ˜H : Bcr × Bcr → R, whose trajectories

image

are such that

image

for all  0 < T ≤ T0/3 and all z0 such that

image

where  T0 and CEare constants given by

image

The perturbed Hamiltonian has the form

image

where ˜H and F : Bcr × Bcr → R satisfy

image

with  CF := 356L2H. Moreover,  ˜Hhas the same critical points as H and satisfies

image

Finally, at time T, the difference of the perturbed energy is bounded by

image

where

image

Proof We will complexify the position and momentum variables; i.e., we take  q ∈ Cn, p ∈ Cn. Dueto the fact that f is analytic, it can be interpreted as a mapping from  Cn to C; a similar reasoning applies to  H. Let Bcρ ⊂ C2nbe the closed ball of radius  ρ > 0, centered about the origin, and let the norm  || · ||ρbe defined as

image

where  ˜z(t) := (˜q(t), ˜p(t))is a trajectory satisfying (93), and  fiare analytic functions, i = 1, 2, . . . . Performing a Taylor expansion of  ˜z(t) about 0and evaluating at t = T yields, after rearranging terms,

image

where the Lie derivative  Di : C∞ → C∞ is defined by

image

where  C∞ denotes the set of infinitely differentiable functions mapping from  C2n to C2n (see alsoHairer et al., 2002, Ch. IX). The notation on the left-hand side of the previous equation should be read in the following order: The operator  Diis applied to  ¯g, yielding the function  Di¯g, which isthen evaluated at z. Applying the operator  Dito the function  Di¯gis denoted by  D2i ¯g.

We require  ˜z(T) = (qk+1, pk+1), which leads due to (26) to

image

Equating equal powers of T in (103) and (105) yields the following recursive scheme for computing the functions  fi:

image

where the last sum ranges over all strictly positive integers  k1, . . . , kithat sum up to j. We will construct the perturbed Hamiltonian ˜Hby appropriately truncating the series � fi(z)T i, and willshow that the resulting truncated series has the desired properties.

We start by explicitly computing the function  f2:

image

and note that it is of the form  A2(z)f1(z), where A2(z) ∈ C2n×2n. We further note that  f2(z) is infact a Hamiltonian vector field with corresponding Hamiltonian  −pT∇f(q)/2. From an induction argument relying on (107) and the definition of the Lie derivative, it follows that each  fj can bewritten as  fj(z) = Aj(z)f1(z), where A1(z)is the identity,  A2(z)is defined in (108), and

image

Moreover, as shown in Hairer et al. (2002, p. 295,Theorem 3.2), the vector fields  fjare guaranteed to be Hamiltonian for all  j ≥ 1.

image

integral formula. More precisely, Cauchy’s integral formula provides us with the following bound on  ∂¯g/∂zfor any function  ¯ganalytic in  Bcr:

image

for any two constants  ρ, δ such that 0 ≤ δ < ρ ≤ r. A similar argument yields the following bound on the Lie-derivative of a function  ¯ganalytic in  Bcr (Hairer et al., 2002, p. 308),

image

where  0 ≤ σ < ρ ≤ r. In the following these two inequalities are used to bound the right-hand side of (109). Claim: The function  Ajis bounded above by

image

for all  j ≥ 1.

Proof of the claim: Explicit calculations show that the claim holds for j = 1. We thus fix the index J > 1 and estimate  ||Aj||r−(j−1)δ for all 1 ≤ j ≤ J, where δ = r/(2(J −1))is fixed, which, for j = J, yields the desired bound on  AJ. To simplify notation we denote  || · ||r−(j−1)δ as || · ||j.The right-hand side of (109) is upper bounded by

image

Due to the fact that  k1 + · · · + ki = j and k1 > 0, . . . ki > 0, it follows  ki ≤ j − (i − 1), whichimplies  ||¯g||j−(i−1) ≤ ||¯g||ki for all i ≤ jand for any function  ¯ganalytic on  Bcr. Therefore, the above bound reduces to

image

By introducing the constant ˆδ := δ/r, which, by definition of  δ satisfies ˆδ ≤ 1/2, we can reformulate the above bound as

image

By exploiting the structure of  A2(z)it can be verified that

image

which will simplify the following development. Following Hairer et al. (2002, p. 309), we introduce the constants  bj for all j ≥ 1in the following way

image

It can be verified with an induction argument taking (115) and (116) into account, that

image

The variables  bjare well-defined for any  j ≥ 1and not just  1 ≤ j ≤ J. In order to bound the constants  bj, we formally introduce the generating function  b(ζ), b : C → C, ζ ∈ C, and note that

image

where the order of summation has been interchanged to arrive at (118). It can be verified by means of the implicit function theorem that, given any  v ∈ C, the equation  2b − exp(b) + 1 = v can beuniquely solved for b as long as  exp(b) ̸= 2. This implies that  b(ζ)is well-defined and analytic for |b1ζ| ≤ 2ln(2) − 1, or equivalently,  |ζ| ≤ (2ln(2) − 1)/b1. Furthermore, explicit calculations show that  |b(ζ)|is bounded by ln(2) for |ζ| ≤ (2ln(2) − 1)/b1(see Hairer et al., 2002, p. 310). Cauchy’s inequality therefore implies

image

Evaluating the above bound for j = J yields

image

where N is chosen to be the largest integer such that  NT ≤ T0, with T0 := (2ln(2) − 1)/(2eLH).The choice for the integer N is motivated by the following observation: Due to the bound on ||Aj||r/2the terms in the above sum behave as  (Tj/(eT0))j, which attains a minimum for  j ≈T0/T. The fact that all  fj, j ≥ 1, are Hamiltonian vector fields proves (96).

image

which proves the first bound in (97). The second bound is proved analogously; that is,

image

The fact that ˜Hhas the same critical points is derived from the following observation. First, all critical points of H are necessarily critical points of ˜H, which follows from (124) with  ∇H(z) =0, and second,  ∇ ˜H(ˆz) = 0implies, according to (124), that  |∇H(ˆz)| ≤ 15LHT|∇H(ˆz)| ≤0.36|∇H(ˆz)| (for T ≤ T0/3), concluding  ∇H(ˆz) = 0.

The first bound in (98) is derived by noticing that

image

for any path  γ : [0, 1] → C2n that connects the origin with  z; i.e., γ(0) = 0, γ(1) = z. We choose γ(t) such that ∇H(γ(t))T ˙γ(t) = |∇H(γ(t))||˙γ(t)|. This can be done by considering the gradient flow  ˜γ(t),

image

which is guaranteed to converge to the origin for any  z ∈ Af ∩(Bcr/2 ×Bcr/2). As a result, choosing

image

The same argument (with the same path) applies to  |∇F(z)|, which yields the second bound in (98).

image

(the choice becomes evident), and as above, N to be the largest integer such that  NT ≤ T0. Forsmall enough  ζ ∈ C, we denote the map from  ˜z(0) = z0 to ˜z(t = ζ), where ˜z(t) satisfies ˙˜z(t) =�Nj=1 fj(˜z(t))ζj, by ˜ϕζ : C2n → C2n. The symplectic Euler step  Φζ(z0) and ˜ϕζ(z0) are bothanalytic in  ζand by the above construction, their Taylor expansion in  ζ about ζ = 0agrees for the first  N terms (z0is fixed). Thus, the function  ˜g(ζ)/ζN+1, where ˜g(ζ) := Φζ(z0) − ˜ϕζ(z0) isanalytic. Applying the maximum modulus principle implies

image

where  ϵis chosen as  ϵ = eT0/N(again, the choice becomes evident subsequently). It remains to bound  |˜g(ζ)| for |ζ| = ϵ, which will be done by estimating  |Φζ(z0) − z0| and | ˜ϕζ(z0) − z0|separately. For the first term we obtain (by definition of (105))

image

where  N ≥ 3has been used for the last inequality. For bounding the second term we first derive a bound on  ∇ ˜H(z), for all z ∈ Bcr/2, similar to (124) (unlike (124) we have  ϵ = eT0/N). We obtain

image

for all  z ∈ Bcr/2. The gradient of  ∇ ˜Hhas therefore a Lipschitz constant of less than 2.8LH forz ∈ Bcr/2, which implies that

image

as long as the corresponding trajectory stays within  Bcr/2. Note that the last inequality stems from the fact that  T0 < T(N + 1) and thus ϵ < eT(N + 1)/N ≤ 4eT/3. Due to the condition (132), the trajectory  ˜ϕϵ(z0)necessarily stays within  Bcr/2. Moreover, from  ϵ ≤ eT0/3 (N ≥ 3 sinceT ≤ T0/3), it can be inferred that

image

which implies that  | ˜ϕϵ(z0) − z0| ≤ 3.36ϵLH|z0|. Combined with (134), this leads to  |˜g(ζ)| ≤ ϵ ˜Cgfor  |ζ| = ϵ and ˜Cg := (4.36 + eT0/3)LH|z0|, which, according to (133), implies

image

The fact that  N ≤ T0/T < N + 1has been used to obtain the last inequality. The above result justifies the choice  ϵ = eT0/N.

We derive the bound on ˜H( ˜ϕT (z0)) − ˜H(ΦT (z0))in a similar way. Consider the analytic function  ˆg(ζ) = ˜H( ˜ϕζ(z0)) − ˜H(Φζ(z0)), where z0 ∈ Bcr/2, T, and Nare fixed. The modified Hamiltonian ˜His dependent on  ζand has the form

image

where  H1 = H, and Ω∇Hj(z) = fj(z) for all j with 1 ≤ j ≤ N. The function  ˆg(ζ)/(ζN+1) isanalytic, as can be seen, for example, by a Taylor expansion of ˜H in z about z = Φζ(z0), where˜ϕζ(z0) and Φζ(z0)agree up to Nth order. We invoke the maximum principle, similar to (133), which implies,

image

and where  ϵis again set to  ϵ = eT0/N. A bound for  ˆg(ζ)is obtained in the following way (applying Taylor’s theorem):

image

where  ηlies between  Φζ(z0) and z0. The gradient of ˜His necessarily bounded by  |∇ ˜H(z)| ≤2.8LH|z|, which therefore implies that the Hessian of ˜Hin the expression is bounded by  2.8LHprovided that  η is in Bcr/2. The bound (134) implies that this is the case for

image

where  ϵ ≤ eT0/3 and ϵ < 4eT/3have been used. Combining (142) with the bound from (134) yields

image

Applying the same chain of arguments as in (138) allows us to conclude:

image

Proof We choose  r > 0 such that A ⊂ Bcr×Bcr, where Bcr is the closed n-dimensional ball of radius r centered at the origin. We use the Stone-Weierstrass theorem (Rudin, 1976, p. 159) to approximate

image

where  ϵ > 0will be chosen subsequently. We integrate  d2 ˜f/dx2 and impose ∇ ˜f(0) = 0, whichyields the function  ∇ ˜f on Bc2r that is ϵ-close to ∇f. We integrate once more and impose  ˜f(0) = 0.Moreover, by choosing  ϵsmall enough, the origin is guaranteed to be the only critical point of H(z) (with f replaced with ˜f) for all z ∈ A, since A ⊂ Af and A ⊂ Bcr × Bcr. Due to the fact that  ˜f isa polynomial and therefore analytic, we can invoke Proposition 9 with f replaced by ˜f. Next, wewill show that the origin is an asymptotically stable equilibrium of the dynamics (13), where f is replaced by ˜f, with region of attraction at least A.

image

in the following way:

image

where by assumption  Λ(qk, pk)is symmetric and positive definite. Due to the assumption on the dissipative forces, the singular values of  Λ(qk, pk)are upper and lower bounded by  d2 and d1,respectively. Provided that the time step T is small enough,  T ≤ Tmax, we conclude from Proposition 9 that there exists a modified energy function ˜Hthat has the same critical points as H, and such that the bound (27) holds for all  z0 ∈ Bcr × Bcr. Morse theory implies that  ˜H(z)is locally positive definite and has compact level sets for  z ∈ A. The same applies therefore to the function

image

provided that  T ≤ Tmax, where Tmaxis chosen to be small enough (independently of  fd; this can be done since the positive definiteness of ˜His not affected by the dissipative forces  fd).

image

simple structure of  Φd,T. In order to simplify the notation we hide all constants that may depend on d2/d1, an upper bound on  d2, LH, or higher orders of T with the notation  O(·). It will be shown that  V (qk, pk)necessarily decays along the step  Φd,T ◦ΦT. To that extent, we define  z2 = Φd,T (z1)and  z1 = ΦT (z0)and consider  V (z2) − V (z0), which, due to the fact that  q2 = q1, is bounded by

image

where the arguments of  Λhave been dropped and the results from Proposition 9 have been used, which enable the following bound:

image

Due to Proposition 9, it follows that ˜H(ΦT (z0)) − ˜H(z0)is bounded by a term of the order |∇H(z0)|2Te−T0/T . This yields

V (q2, p2) ≤ ˜H(q0, p0) − TpT1 Λp1 + Td12 ∇ ˜f(q1)Tp1 + |∇H(z0)|2O(Te−T0/T )+  O(d1T 2)|p1|2 + O(d1T 2)|∇ ˜f(q1)||p1|

image

Due to the continuity of  ∇ ˜fwe can bound  |∇ ˜f(q1) − ∇ ˜f(q0)| by T(L + ϵ)|p1|, where L is theLipschitz constant of  ∇f. Moreover,  LHis an upper bound on L. This implies

image

provided that T is chosen small enough such that  −pT1 Λp1T/2dominates the  O(d1T 2)|p1|2 termsand  −T 2d1|∇ ˜f(q0)|2/4dominates the  O(d1T 3)|∇ ˜f(q0)|2 terms. Expanding the term  −TpT1 Λp1/2and applying Young’s inequality,

image

CYd1T 2pT0 ∇ ˜f(q0) ≤

image

where  CY > 0is an arbitrary constant, results in

image

where for the last inequality  T ≤ 1has been used. Note that the term  Te−T0/T decays quickly for small  T. Thus for d1 ≥ O(e−T0/T /T)the function V strictly decreases along the trajectories of Φd,T ◦ ΦT. The lower bound on  d1vanishes exponentially for small T. This concludes that the origin is asymptotically stable under the dynamics (13) (with ˜finstead of f) provided that T is chosen small enough, i.e.  T ≤ Tmax, where up to the exponential terms in  d1, Tmaxonly depends on the ratio  d2/d1, and LH. Due to the fact that the first-order approximations of the dynamics resulting from  f and ˜fabout the equilibrium are  ϵ-close (by construction of ˜f), the origin is likewise asymptotically stable for the dynamics resulting from f, and its region of attraction contains at least a small neighborhood of the origin. Consequently, due to the continuity of V , we may choose  ϵsmall enough such that  V (˜q2, ˜p2) − V (q0, p0) < 0 for all (q0, p0)outside a neighborhood of the origin, where  ˜q2, ˜p2denotes the trajectory resulting from (13) (with f instead of ˜f). This shows that A is likewise contained in the region of attraction of the origin under the dynamics (13) with f instead of ˜f.

By using the change of variables  (qk, pk) → (ˆqk, ˆpk),

image

the discrete-time algorithm (13) can be reformulated as

image

where  τ := β/(1 − 2dT), and

image

The change of variables is motivated by the fact that the convexity and smoothness of the objective function implies  f(ˆqk+1) ≤ f(yk),

image

where L is the Lipschitz constant of the gradient of f. In the following it will be shown that the energy H decreases along  ˆqk and ˆpkprovided that certain conditions on the parameters  T, d, and βare met. We evaluate  H(ˆqk+1, ˆpk+1),

image

Due to the fact that f is convex it holds that

image

and as a result

image

The right-hand side of the above expression is guaranteed to decrease provided that the following matrix

image

is positive definite. Thus, the minimum of f is an asymptotically stable equilibrium of the dynamics (13) provided that (160) is positive semidefinite. In particular, choosing  T = τ, i.e., β = T(1 −2dT), 0 < T ≤ 1/√L, 0 < dT < 1, ensures asymptotic stability of the minimum. Moreover, for any fixed  β > 0the above matrix can be made negative definite by choosing T and d sufficiently small.

This section discusses the implications of Assumption 1 compared to Assumption 2. We therefore consider any function f that satisfies Assumption 1 and construct a sequence of functions  fj, wherej > 0 is an integer, in the following way: Let  fj : Rn → Rbe such that  fj(0) = 0, ∇fj(0) = 0,

image

where  sj : Rn → R≥0is an infinitely differentiable function that has support on  Bc1/(2j) and satisfies �Rn sj(¯x)d¯x = 1. As a consequence, it holds that

image

for all  x ∈ Rn and j > 0 (assuming ¯Cf > Cf). In other words,  ∇fjconverges uniformly to  ∇f; thesame holds for  fj → fon any compact subset of  Rn. As a result of (161), any essential upper or lower bound on the curvature of f are translated to the functions  fj. This implies, for example, that for large enough j the origin is an isolated non-degenerate critical point of  fj. In addition, each  fj,j > 0, is infinitely differentiable and satisfies Assumption 2. We therefore consider the dynamical system (37), where f is replaced with  fj:

image

where  zj(t) := (qj(t), pj(t)). Compared to (37), the dependence on  ∇fjis made explicit. Let z(t) be the trajectory satisfying (162), where  fjis replaced with f. In the continuous-time case, (162) implies

image

where  Cg1 > 0 and Cg2 > 0are Lipschitz constants related to  gs. By virtue of the Gr¨onwall inequality, (163) readily implies  zj(t) → z(t)pointwise for any  t ∈ I. A similar argument applies to the discrete-time case.

We show next that, in case the equilibrium at the origin is uniformly stable (uniformity with respect to  j and t0), the convergence  zj(t) → z(t)is in fact uniform in  t ∈ I. As a consequence, the results from Proposition 12 and similarly Proposition 3 generalize to the case where f satisfies merely Assumption 1 instead of Assumption 2, as shown below.

Proposition 18 Let the origin be a stable equilibrium for (162), uniformly in  j and t0 (for j > 0sufficiently large). If z(t) converges to the origin for  t → ∞, zj(t)converges to z(t), uniformly in t ∈ I.

Proof We pick any  ϵ > 0and show that there exists an integer N, independent of t, such that |zj(t) − z(t)| < ϵ for all j > N and all t ≥ t0. Due to the uniform stability of the origin, there exists a  δ > 0(independent of  t0 ∈ I and j) such that  |zj(t0)| < δ implies |zj(t)| < ϵ/2 for allt ≥ t0 and j > 0sufficiently large. The trajectory z(t) converges to the origin. Hence, there exists a finite time  T ∈ I such that |z(T)| < δ/2. Applying the Gr¨onwall inequality to (163) yields

image

a similar conclusion holds in the discrete-time case. Thus, choosing j large enough guarantees that |zj(t) − z(t)| < δ/2 ≤ ϵ/2 for all t ∈ I, t0 ≤ t ≤ T. Therefore  |zj(T)| < δ, which readily implies |zj(t)| < ϵ/2 for all t ≥ T. Combined with  |z(t)| < ϵ/2 for all t ≥ T, this yields  |zj(t) − z(t)| < ϵfor all  t ≥ t0.

Verifying that the origin is stable (uniformly in j) is typically straightforward. In order to illustrate the ideas we consider the discussion of Section 2.6. The total energy  Hj (where f isreplaced with  fj) is well-defined for any j > 0. For large enough j, the curvature of  fj is alocalized average of the curvature of f. The origin is a non-degenerate isolated critical point of f, and therefore the same applies to  fjfor large enough j. This concludes that in a neighborhood of the origin the total energy  Hjis upper and lower bounded:

image

for all j > 0 sufficiently large. These upper and lower bounds are independent of j, which, combined with the fact that  Hjis non-increasing along trajectories, readily implies stability of the origin (uniformly in  j and t0) (see, for example, Sastry, 1999, p. 189). The same argument applies in the discrete-time case, where, by assumption,  f and fjare locally strongly convex. Thus, uniform stability follows from Appendix D.

The fact that the convergence  zj → zis uniform in t implies that the convergence estimates form Proposition 12 and Proposition 3 apply for  j → ∞. More precisely, we have

Proposition 19 Let the origin be a stable equilibrium for (38) uniformly in  j and t0 (for j > 0sufficiently large). Let  A ⊂ R2n be compact and such that  z0 ∈ A implies z(t) → 0 for t → ∞.Then, provided that for sufficiently large j > 0 the trajectories  zj(t)satisfy the estimates

image

where ˆCj > 0 and α > 0are constant, and  ρ : R≥0 → R>0is continuous and monotonically decreasing, there exists a constant ˆC such that z(t)satisfies the estimate

image

for any constant  ¯α < α.

Proof For any  ϵ > 0, (165) implies that for any j > 0 sufficiently large,

image

Due to the fact that  zjconverges uniformly to z, it follows that

image

which yields the desired result.

Ravi P. Agarwal. Difference Equations and Inequalities. Marcel Dekker, Inc., second edition, 2000.

Zeyuan Allen-Zhu and Lorenzo Orecchia. Linear coupling: An ultimate unification of gradient and mirror descent. arXiv:1407.1537 [cs.DS], 2014.

Vladimir I. Arnol’d. Ordinary Differential Equations. Springer, third edition, 1992.

Hedy Attouch, Zaki Chbani, Juan Peypouquet, and Patrick Redont. Fast convergence of inertial dy- namics and algorithms with asymptotic vanishing viscosity. Mathematical Programming, Series B, 168(1-2):123–175, 2018.

Richard Bellman. Stability Theory of Differential Equations. Dover, 1969.

Michael Betancourt, Michael I. Jordan, and Ashia C. Wilson. On symplectic optimization. arXiv:1802.03653v2, pages 1–20, 2018.

S´ebastien Bubeck, Yin Tat Lee, and Mohit Singh. A geometric alternative to Nesterov’s accelerated gradient descent. arXiv:1506.08187 [math.OC], 2015.

S´ebastien Bubeck, Qijia Jiang, Yin Tat Lee, Yuanzhi Li, and Aaron Sidford. Near-optimal method for highly smooth convex optimization. Proceedings of Machine Learning Research, 99:492– 507, 2019.

Frank E. Curtis, Daniel P. Robinson, and Mohammadreza Samadi. A trust region algorithm with a worst-case iteration complexity of  O(ϵ−3/2)for nonconvex optimization. Mathematical Programming, Series A, 162(1-2):1–32, 2017.

Jelena Diakonikolas and Lorenzo Orecchia. The approximate duality gap technique: A unified theory of first-order methods. SIAM Journal on Optimization, 29(1):660–689, 2018.

Simon S. Du, Chi Jin, Jason D. Lee, Michael I. Jordan, Barnab´as P´oczos, and Aarti Singh. Gradi- ent descent can take exponential time to escape saddle points. Advances in Neural Information Processing Systems 30, pages 1067–1077, 2017.

Hans-Bernd D¨urr and Christian Ebenbauer. On a class of smooth optimization algorithms with applications in control. Proceedings of the IFAC Nonlinear Model Predictive Control Conference, pages 291–298, 2012.

Saber Elaydi. An Introduction to Difference Equations. Springer, third edition, 2005.

Guilherme Franc¸a, Jeremias Sulam, Daniel Robinso, and Ren´e Vidal. Conformal symplectic and relativistic optimization. arXiv:1903.04100 [math.OC], pages 1–27, 2019.

S´ebastien Gadat, Fabien Panloup, and Sofiane Saadane. Stochastic heavy ball. Electronic Journal of Statistics, 12(1):461–529, 2018.

Rong Ge, Furong Huang, Chi Jin, and Yang Yuan. Escaping from saddle points - online stochastic gradient for tensor decomposition. Proceedings of Machine Learning Research, 40:797–842, 2015.

Ernst Hairer, Gerhard Wanner, and Christian Lubich. Geometric Numerical Integration. Springer, 2002.

Jack K. Hale. Asymptotic Behavior of Dissipative Systems. American Mathematical Society, 1988.

Alain Haraux. Syst`emes Dynamiques Dissipatifs et Applications. Masson, 1991.

Chi Jin, Praneeth Netrapalli, and Michael I. Jordan. Accelerated gradient descent escapes saddle points faster than gradient descent. arXiv:1711.10456, pages 1–43, 2017.

Chi Jin, Praneeth Netrapalli, Rong Ge, Sham M. Kakade, and Michael I. Jordan. On nonconvex op- timization for machine learning: Gradients, stochasticity, and saddle points. arXiv:1902.04811, pages 1–31, 2019.

Walid Krichene, Alexandre M. Bayen, and Peter L. Bartlett. Accelerated mirror descent in continu- ous and discrete time. Advances in Neural Information Processing Systems 28, pages 2845–2853, 2015.

Jason D. Lee, Max Simchowitz, Michael I. Jordan, and Benjamin Recht. Gradient descent only converges to minimizers. Proceedings of Machine Learning Research, 49(1):1246–1257, 2016.

Laurent Lessard, Benjamin Recht, and Andrew Packard. Analysis and design of optimization algo- rithms via integral quadratic constraints. SIAM Journal on Optimization, 26(1):57–95, 2016.

Chris J. Maddison, Daniel Paulin, Yee Whye Teh, and Brendan O’Donoghue. Hamiltonian descent methods. arXiv:1809.05042 [math.OC], pages 1–72, 2018.

Simon Michalowsky, Carsten Scherer, and Christian Ebenbauer. Robust and structure exploiting optimization algorithms: An integral quadratic constraint approach. arXiv:1905.00279, pages 1–30, 2019.

J. Milnor. Morse Theory. Princeton University Press, 1963.

Michael Muehlebach and Michael I. Jordan. A dynamical systems perspective on Nesterov accel- eration. Proceedings of Machine Learning Research, 97:4656–4662, 2019.

Yuri E. Nesterov. A method of solving a convex programming problem with convergence rate O(1/k2).Soviet Mathematics Doklady, 27(2):372–376, 1983.

Yuri E. Nesterov and Boris T. Polyak. Cubic regularization of Newton method and its global per- formance. Mathematical Programming, Series A, 108:177–205, 2006.

Yurii Nesterov. Introductory Lectures on Convex Optimization - A Basic Course. Springer Science+Business Media, LLC, 2004.

Boris T. Polyak. Some methods of speeding up the convergence of iteration methods. USSR Computational Mathematics and Mathematical Physics, 4(5):1–17, 1964.

Boris T. Polyak. Introduction to Optimization. Optimization Software, Inc., 1987.

Walter Rudin. Principles of Mathematical Analysis. McGraw-Hill, third edition, 1976.

J. M. Sanz-Serna. Symplectic integrators for Hamiltonian problems: an overview. Acta Numerica, pages 243–286, 1992.

Shankar Sastry. Nonlinear Systems. Springer, 1999.

Damien Scieur, Vincent Roulet, Francis Bach, and Alexandre d’Aspremont. Integration methods and optimization algorithms. Advances in Neural Information Processing Systems 30, pages 1109–1118, 2017.

Weijie Su, Stephen Boyd, and Emmanuel J. Cand`es. A differential equation for modeling Nesterov’s accelerated gradient method: Theory and insights. Journal of Machine Learning Research, 17 (153):1–43, 2016.

Roscoe B White. Asymptotic Analysis of Differential Equations. Imperial College Press, 2010.

Andre Wibisono, Ashia C. Wilson, and Michael I. Jordan. A variational perspective on accelerated methods in optimization. Proceedings of the National Academy of Sciences, 113(47):E7351– E7358, 2016.


Designed for Accessibility and to further Open Science