Optimization with Momentum: Dynamical, Control-Theoretic, and Symplectic Perspectives

2020·arXiv

Abstract

1. Introduction

Optimization problems lie at the heart of many machine-learning formulations. As a result, a better understanding of optimization algorithms, combined with implementations that target distinctive properties of machine-learning problems, have contributed significantly to the recent progress in the field.

One of the most popular methods for large-scale optimization is the (stochastic) gradient method, due to its simplicity, wide applicability, and efficiency. However, in the deterministic setting, where gradients are evaluated exactly, it has been shown that better convergence rates can often be achieved by leveraging two successive gradients (Nesterov, 1983). Based on analogies to mechanical systems these methods are referred to momentum-based optimization algorithms, and can be viewed as particular discretizations of continuous-time harmonic oscillators, (Polyak, 1964). Yet, even for the class of strongly convex functions, most proofs that establish the superior convergence are algebraic and provide little qualitative understanding. As a consequence, there is little guide to the generality or robustness of the acceleration phenomenon across instances of optimization problems.

This article characterizes the convergence rate of momentum-based optimization algorithms by taking fundamental topological properties into account. In many cases our analysis leads to closed-

form expressions that relate the convergence rate to the different algorithm parameters, such as the step size and the damping. The analysis provides insight into the design of algorithms that, for example, require little tuning when applied to large-scale and ill-conditioned optimization problems. For simplicity of notation, we focus on the Euclidean setting, but we note that the scope of our analysis is not limited to Euclidean problems.

We will derive convergence rates of the form

where is a distance measure between the iterate x(t) and the isolated local minimum t refers to time or the iteration number, is a constant which does not depend on monotonically decreasing, satisfies , and characterizes the convergence rate. An important aspect of the analysis is to characterize how the convergence rate is affected by the shape of the objective function about an isolated local minimum. The local shape is summarized by the (condition) number , which is defined as the ratio between the maximum and minimum curvature about a local minimum. We call an isolated minimum degenerate if the minimum curvature vanishes in at least one direction. The main results and insights, which will be rigorously derived in the remainder of the article, are summarized as follows:

• The convergence rate, characterized by in (1), is uniquely determined by the local shape of the objective function about an isolated minimum. The global shape of the objective function determines the constant and the set of initial conditions x(0) for which (1) holds. In other words, the global shape (for example described by convexity) determines the stability and region of attraction of a local minimum, whereas the local shape, i.e., the curvature at an isolated minimum, determines the convergence rate.

• An algorithm is called accelerated if the convergence rate scales favorably for ill-conditioned optimization problems, meaning that the convergence rate scales favorably for large celeration is therefore a statement about the robustness of the convergence rate with respect to changes in the curvature, which is well defined for continuous-time as well as for discrete-time formulations.

• Accelerated convergence is generic to momentum-based optimization algorithms, provided that the damping scales with . Neither the evaluation of the gradient at a shifted position, nor a specifically engineered damping parameter, as for example proposed in Nesterov (2004, Sec. 2.2), are necessary.

• From a physics perspective, a momentum-based optimization algorithm can be thought of as a mass-spring-damper system, where the spring potential is given by the objective function. The algorithm design specifies the damping. The fact that the system has inertia (due to the second-order dynamics) implies that the convergence rate is robust to small changes in the spring, i.e., robust to small changes in the curvature of the objective function. The inertia gives the system the tendency to keep its velocity, which, provided that the damping is chosen appropriately, implies that small changes in the spring will not slow down convergence. This captures the mechanism behind accelerated convergence.

• The convergence rate typically becomes arbitrarily small for large (if the dynamics are time-varying this statement applies for ). The underlying dynamics are therefore close to

conservative, which means that great care is required for the discretization. A discretization that introduces an artificial energy drift, such as the explicit forward Euler discretization, for example, might even lead to instability, which clearly makes a favorable scaling of the convergence rate with impossible. This motivates the use of symplectic discretization schemes, which preserve the Hamiltonian structure of the underlying conservative part of the dynamics.

• By introducing time-varying dynamics, which are obtained by adjusting the damping parameters with the number of iterations, the linear convergence rate is improved with an additional sublinearly converging term. If the local minimum is close to degenerate, the convergence is dominated by this term, as the rate of the linear convergence becomes arbitrarily small. In that way, the analysis relates the non-degenerate and degenerate cases.

Related work: The phenomenon of acceleration, which fundamentally motivates the use of momentum, has puzzled many researchers for almost 40 years. We do not attempt to give a full overview of the literature, but highlight some of the most recent work. Important contributions were made by Bubeck et al. (2015), who proposed an accelerated algorithm that has a geometric interpretation, by Allen-Zhu and Orecchia (2014), who show that coupling of gradient and mirror descent can lead to acceleration, and Lessard et al. (2016), who propose a general control-theoretic analysis framework. The framework has subsequently been extended and refined, for example by Michalowsky et al. (2019), who analyzed and quantified robustness and convergence trade-offs. Other work includes Diakonikolas and Orecchia (2018), who unify the analysis of first-order methods by imposing certain decay conditions, and Scieur et al. (2017), who interpret the accelerated gradient method as a multi-step discretization of gradient flow.

The work of Su et al. (2016) and Krichene et al. (2015) showed that the trajectories of the accelerated gradient method approach the solutions of a certain second-order ordinary differential equation. The resulting differential equation was analyzed in further detail in Attouch et al. (2018) and placed within a variational framework by Wibisono et al. (2016). This motivated further research on structure-preserving integration schemes for discretizing continuous-time optimization algorithms (Betancourt et al., 2018). While the continuous-time formulation of Su et al. (2016) is based on an accelerated optimization algorithm for smooth and convex objective functions (i.e., the case where the minimum could be degenerate), an alternative for strongly convex objective functions (non-degenerate case) was proposed in D¨urr and Ebenbauer (2012). In Franc¸a et al. (2019) and Muehlebach and Jordan (2019), important geometric properties of the underlying dynamics are highlighted and corresponding structure-preserving discretization schemes are analyzed. In addition, Franc¸a et al. (2019) propose relativistic dynamics, as these naturally bound the rate of change of the position by the speed of light, which is supposed to prevent overshoot. The work of Maddison et al. (2018) suggests that convergence can be improved by a suitable choice of the kinetic energy (for example choosing the kinetic energy to be the convex conjugate of the objective function).

While this line of work suggests strong parallels between dynamical systems and optimization algorithms, a main part of the research focuses on engineering Lyapunov functions or estimate sequences that arise separately from the dynamical system. These typically arise from educated guesses, or from the solutions of semidefinite programs, and enable the precise characterization of the convergence rate for a given algorithm. The aim of our work is different: We exploit fundamental topological properties of dynamical systems in order to provide an explicit characterization of the convergence rate up to constants, which, in most cases, can be evaluated in closed form. Hence, our analysis does not aim at precisely bounding the number of iterations needed to reach a local minimum with a certain tolerance, but provides a qualitative and quantitative characterization of the convergence rate up to constants. The analysis does not rely on convexity and applies both to continuous-time and discrete-time formulations. This provides a tool to understand how the convergence rate depends on various algorithm parameters and the shape of the objective function, and it characterizes the mechanism leading to acceleration. Our characterization of the convergence rate is both necessary and sufficient.

In the context of nonconvex optimization, the aim is typically to find a local minimum of a twice continuously differentiable function that satisfies certain non-degeneracy conditions. It has been shown that gradient descent converges to a local minimum from almost every initial condition, which is due to the fact that local maxima and saddle points are unstable equilibria (Lee et al., 2016). The same reasoning applies to gradient descent with momentum. However, even though gradient descent and gradient descent with momentum are guaranteed (in an almost everywhere sense) to ultimately reach a local minimum, it may take an arbitrarily long time to escape saddle points and local maxima. A major concern arises due to the fact that an objective function with n decision variables might have different isolated saddle points, which have to be traversed before reaching a local minimum. This implies that even for generic initialization strategies, the worst-case convergence rate depends on n (Du et al., 2017). This article is concerned with characterizing the convergence of momentum-based algorithms up to a constant factor. We will not analyze how this factor scales with n; the example from Du et al. (2017) suggests that the factor scales exponentially in n. However, the results from Ge et al. (2015) indicate that adding small random perturbations to gradient descent can improve the convergence rate and reduce the dimension-dependence of the convergence rate on n to polylog factors. A similar result applies likewise to gradient descent with momentum (Jin et al., 2017). A recent account of the state-of-the art is given in Jin et al. (2019), for example.

The poor scaling in the dimension n in the nonconvex case can be avoided by explicitly leveraging curvature information, as for example proposed in Nesterov and Polyak (2006) and Curtis et al. (2017). However, the computational cost per iteration of these methods is generally higher and increases with n. For simplicity, this article focuses on first-order algorithms with momentum, even though many ideas generalize in straightforward ways.

Outline: This article views optimization algorithms as dynamical systems. Two cases are distinguished: Section 2 assumes that the algorithm parameters are fixed, leading to time-invariant dynamics. The more general case, where the parameters are allowed to vary with the number of iterations, is discussed in Section 3.

Section 2 starts by introducing the notation and defining the scope of the analysis. A momentum-based optimization algorithm is understood as a second-order dynamical system,1 where the local minima of the objective function correspond to asymptotically stable equilibria. Section 2.2 and Section 2.3 introduce two prototypical examples of momentum-based optimization algorithms. These are generalized versions of Nesterov’s acclerated gradient scheme (Nesterov, 2004, Ch. 2.2), and also include Heavy-Ball methods (Polyak, 1964). By focusing on smooth dynamical systems based on smooth objective functions, our analysis excludes, for example, the treatment of constraints via (non-differentiable) indicator functions, as well as optimization algorithms with restart schemes. The subsequent result derived in Section 2.4 highlights our assertion that fundamental topological properties can be exploited for characterizing the convergence rate of momentum-based optimization algorithms. The results are applied to the analysis of the two prototypical examples in Section 2.6 and Section 2.7, leading to a broad characterization of the phenomenon of acceleration. In addition, Section 2.7 rigorously motivates the use of symplectic discretization schemes in the context of optimization: The symplectic discretization enables the computation of a modified energy function that can be used for stability analysis.

The structure of Section 3 is analogous to Section 2. The general result for characterizing the convergence rate in the time-varying case is presented in Section 3.1 and illustrated with two subsequent examples in Section 3.2 and Section 3.3. The results highlight the fact that time-varying damping parameters can speed up the convergence rate by an additional sublinearly converging factor. For ill-conditioned problems, this additional gain in the convergence rate becomes significant. The analysis connects the non-degenerate case with the degenerate case and motivates the update rules for the parameters of well-known accelerated gradient schemes, such as Nesterov (2004, p. 90, Constant Step Scheme II).

The article concludes with a summary and final remarks in Section 4.

2. The Time-Invariant Case

2.1 Introduction

Throughout the article we consider the problem of minimizing the function satisfies the following assumption:

Assumption 1 The function f has a Lipschitz-continuous gradient. The critical points are non-degenerate and isolated.1

The Lipschitz continuity of the gradient is important for ensuring that the resulting continuous-time trajectories exist and are unique. The fact that f has isolated non-degenerate critical points excludes pathological cases where the function has multiple connected local minima. The less pathological case, where the local minima are isolated but have vanishing curvature in certain directions, can be obtained via a limit argument, as will be discussed in Section 3.

Due to the Lipschitz continuity of the gradient, the Hessian exists almost everywhere and is essentially bounded. We will summarize the (essential) upper and lower bounds on the Hessian with the two constants

where denotes the Euclidean norm. Thus for the function is convex, for nonconvex. In order to simplify our exposition, we will consider functions that satisfy:

Assumption 2 In addition to Assumption 1, f has a second derivative that is Lipschitz continuous.

None of the subsequent results will explicitly depend on a Lipschitz constant related to the second derivative. Hence, under very mild conditions, all our results characterizing the convergence rate in (1) apply when f satisfies Assumption 1 instead of Assumption 2.1 This is further discussed in Appendix F.

Without loss of generality, we further assume that f has a local minimum at . The convergence rate of a momentum-based optimization algorithm will depend on the local shape of the local minimum in question, which is determined by the constants

In case Assumption 2 is replaced with Assumption 1, the above constants are defined via the essential supremum and essential infimum of the Hessian in a neighborhood of

We model a momentum-based optimization algorithm either as a continuous-time or discrete-time dynamical system of the form

where the superscript + denotes either differentiation with respect to t (continuous-time setting), in which case , or a unit time-shift (i.e., , discrete-time setting), in which case I = {0, 1, 2, . . . }. The nonnegative real numbers are denoted by , whereas the positive real numbers are denoted by . The dynamics

are implicitly dependent on and are assumed to satisfy the following assumption.2

Assumption 3 The dynamics are continuously differentiable in both arguments and the derivatives are Lipschitz continuous.

In the continuous-time case, Assumption 3 implies that the resulting trajectories exist and are unique for all times (Arnol’d, 1992, p. 93, Corollary 3). Differentiability also implies that the dynamics can be linearized about an equilibrium, which typically provides a means to study the local behavior of the resulting trajectories. In addition, Assumption 3 implies that, over a finite time interval, trajectories are continuously dependent on their initial conditions (Arnol’d, 1992, p. 93, Corollary 4). These topological properties will be exploited in the following.

In order to simplify notation, we define

and introduce . Moreover, the map denoted by , and is referred to as the flow of the dynamical system (4) - (5). Next, we provide a formal definition of a momentum-based optimization algorithm.

Definition 1 We call the dynamical system (4) - (5) satisfying Assumption 3 a momentum-based optimization algorithm for the function is an asymptotically stable equilibrium in the sense of Lyapunov.

This means that is continuous at 0, uniformly in in a neighborhood of the origin.

Even though the dynamics can capture gradient flow, as a special case (e.g., in continuous time), we are interested in analyzing momentum methods, which arise from a nontrivial choice of

Throughout the article we will illustrate our ideas with two examples, which are prototypical versions of momentum-based optimization algorithms.

2.2 Example 1

The first example is based on the following continuous-time dynamics

where denotes the gradient of the dissipative forces. These dynamics can be viewed as a mass-spring-damper system, where f represents the spring potential and the damping. The dissipative forces are assumed to take the form

where the parameters are constant. Ideally, the parameters are designed to take into account additional information about the the function f; for example, upper and lower bounds on the curvature is chosen to be zero, the dissipative forces reduce to In that case, the dynamics (8) describe a continuous-time heavy ball method (Polyak, 1964). In case , the dynamics are related to Nesterov’s accelerated gradient method (Muehlebach and Jordan, 2019).

An intuitive interpretation of the dissipative forces (9) can be given in the following way: For , (9) describes linear isotropic damping. For , (9) can be rewritten as

which implies that (9) includes an additional damping term that averages the local curvature in the interval between . As a result, the damping increases if the local curvature is large, and reduces if the local curvature is small. As the velocity p is larger, the interval over which the average is taken is increased. The two forms of damping, linear isotropic and curvature dependent, are balanced by the coefficients

The equilibria of (8) are given by the critical points of f. Moreover, if a given critical point is a non-degenerate local minimum, the corresponding equilibrium is asymptotically stable. This follows by evaluating the total energy,

along the trajectories of (8),

which shows that the energy necessarily decreases in a neighborhood of the equilibrium. Combined with the fact that H is locally positive definite about a non-degenerate local minimum, this implies stability in the sense of Lyapunov. Asymptotic stability of the non-degenerate local minimum can then be concluded from La Salle’s theorem (see, for example, Sastry, 1999, Ch. 5.4), which is based on examining the invariant sets satisfying dH/dt = 0. Thus, the dynamical system (8) is a momentum-based optimization algorithm for f according to Definition 1.

However, analyzing how the energy evolves along the trajectories of (8) also reveals global properties of the dynamics. For it follows that the set of initial conditions that do not converge to a local minimum is a set of measure zero, given by the critical points of f that are not local minima. The analysis holds without assuming convexity of f. However, if f happens to be convex and has a unique global minimum, then the corresponding equilibrium is globally asymptotically stable for any choice of parameters

The same reasoning applies in case f satisfies Assumption 1 instead of 2, or when f has degenerate local minima.

2.3 Example 2

The second example is obtained by discretizing (8) in the following way:

where T > 0 is the step size. The discretization consists of a forward Euler update of the momentum coordinates, and uses the newly computed momentum for the position update. As will be further discussed in the remainder of the article, the fact that the newly computed momentum coordinate is used for the position update makes the scheme symplectic for . This means that the transformation (13) from preserves the symplectic form (for has important consequences. One of these consequences concerns the spectrum of the linearization of (13). In case , the corresponding eigenvalues are guaranteed to lie on the unit circle (for ). This is in sharp contrast to the standard explicit Euler discretization, which is not symplectic, and where the eigenvalues lie outside the unit circle even for arbitrarily small T > 0. Indeed, we will exploit the fact that the map (13) is symplectic (for ) to construct a modified energy function, which will be used for a stability analysis that extends beyond the linearization. Additional background information on symplectic integration can be found in Sanz-Serna (1992) or Hairer et al. (2002), for example.

For , the resulting algorithm is referred to as gradient descent with momentum (Polyak, 1964). For , Nesterov’s accelerated gradient scheme (Nesterov, 2004, Constant step scheme III, p. 81) is obtained by choosing the parameters as follows:

where characterize the local shape of a local minimum, (Muehlebach and Jordan, 2019). Moreover, as pointed out in Muehlebach and Jordan (2019), the “Constant step scheme II” algorithm of Nesterov (2004) is obtained by a particular choice of time-varying coefficients generalization to time-varying coefficients will be discussed in Section 3.1.

The equilibria of (13) are again the stationary points of f. In order to determine the stability of an isolated local minimum, we linearize the dynamics,

where (without loss of generality we consider ). An eigenvalue analysis then reveals that the corresponding non-degenerate local minimum is asymptotically stable if

holds for all eigenvalues Asymptotic stability of the linearized dynamics implies that the same equilibrium is asymptotically stable under the nonlinear dynamics (13) (Sastry, 1999, p. 215).

For example, provided that the parameters are chosen according to (14), we obtain that the given equilibrium is asymptotically stable if

We thus conclude that (13) is an optimization algorithm for f according to Definition 1 provided that the constants are chosen such that (16) is satisfied.

Unlike Example 1, our analysis for Example 2 is valid only in a neighborhood about the equilibrium, which corresponds to a local minimum. However, the specific structure of the discretization and the ideas from Example 1 can be exploited for obtaining a nonlinear analysis that is valid beyond a neighborhood of the equilibrium. This will be illustrated in Section 2.7.

2.4 Characterizing the convergence rate

Guaranteeing mere convergence is often not enough, as we are primarily interested in how quickly the trajectories of the nonlinear dynamics (4) and (5) converge to a local minimum. In the following, we argue that in most cases, a linear analysis characterizes the convergence rate up to constants. The linear analysis typically reduces to the computation of eigenvalues.

Our main proposition is based on the following assumption and definition.

Assumption 4 Let the linearized dynamics

be such that there exists an estimate

where are constant.

Definition 2 The region of attraction of the equilibrium at the origin (of the nonlinear dynamics (4)) is defined as the set

Proposition 3 Let Assumption 4 be satisfied. Then, for any compact set there exists a finite constant such that for all

Proof If A is empty the statement is trivial. We therefore assume that A is non-empty. Lemma 14 (see Appendix A) implies that there exists an open ball , centered at the origin, such that any trajectory starting in converges with rate . We make the following claim. Claim: There exists a finite time such that for all Proof of the claim: The origin is a stable equilibrium. Hence there exists a constant all trajectories starting in , the open ball of radius centered at the origin, remain in times. In addition, each . Thus, for each there exists a time . The continuity assumptions on g imply that is continuous, which can be verified by the Gr¨onwall inequality. Therefore, for each there exists an open ball centered about such that for all which implies . The collection of all is an open cover for A. Due to the fact that A is compact, there exists a finite sub-cover (according to the HeineBorel theorem), which we denote is finite. As a result, choosing , which proves the claim.

Moreover, the continuity of , implies further that is bounded for any . Combined with Lemma 14 (see Appendix A), this yields the following bound

for all are positive constants. We fix , consider the trajectory , and apply the mean value theorem,

where lies between z(t) and the origin. Due to the fact that the dynamics are assumed to have Lipschitz-continuous derivatives, we obtain the following bound:

where denotes a Lipschitz constant of . According to (19), the trajectory |z(t)| is integrable (in continuous time) and absolutely summable (in discrete time). We obtain, by virtue of Lemma 13 (see Appendix A),

where is constant. The constant is related to an upper bound on the integral (in continuous time) or the sum (in discrete time) of , which according to (19), is guaranteed to be finite.

Remarks:

• Proposition 3 characterizes the convergence rate and states that the number of iterations required to obtain an -accuracy approaches . This does not provide a tight bound on the number of iterations required to achieve a certain accuracy, as the constant might depend on or grow rapidly with the size of A. Nevertheless, it enables a qualitative and quantitative discussion of the convergence rate . In particular Proposition 3 highlights the fact that the convergence rate is determined by the local properties of the dynamics, which depend on the local shape of the objective function f.

• The assumptions required for invoking Proposition 3 are often straightforward to verify, as Assumption 4 hinges on an eigenvalue analysis of . Proposition 3 can also be generalized to the case where f satisfies Assumption 1 instead of Assumption 2, as shown in Appendix F.

• The proof highlights the following alternative statement of Proposition 3: There exists a finite time after which all trajectories starting in A are guaranteed to be within a small neighborhood of the origin. Within this neighborhood the convergence is exponential with rate

• Given the smoothness properties of g, the assumption that the linearized dynamics converge with rate (Assumption 4) is necessary and sufficient for the convergence of the nonlinear dynamics with rate , as can be shown with the arguments of Lemma 14 in Appendix A.

2.5 Implications for optimization algorithms

Proposition 3 enables a characterization of the convergence rate based on a linear analysis of the dynamics about an equilibrium. In the following, we will use Proposition 3 to discuss the convergence rate of the optimization algorithms given in Example 1 and Example 2. In particular, this provides conditions guaranteeing accelerated convergence.

2.6 Example 1

The total energy of the dynamical system is given by (11), and according to (12), energy is dissipated along trajectories. As we will show in the following, the energy function can therefore be used to characterize the region of attraction of the equilibrium , whereby the topology of the level sets of H and f will play an important role. This will be discussed next.

By assumption, f has a local non-degenerate minimum at , which implies that contains a compact, connected component containing the origin for sufficiently small c > 0. The set describes all values . Morse theory (Milnor, 1963) concludes that the topology of the set is determined by the critical points of f. Thus, the set includes a compact, connected component that contains the origin (and no other

Figure 1: The figure illustrates the set , when the function f is scalar. In this example, contains two connected components . The function f has a local minimum at the origin, a saddle point at , and a local maximum at origin is contained in the set . Morse theory states that is guaranteed to be compact provided that . Due to the fact that f is one-dimensional compact for ; this is, however, no longer true in higher dimensions.

critical point), as long as is any other critical point. The situation is illustrated with an example in Figure 1.

From the definition of H we infer that the critical points of H are all of the form corresponds to a critical point of f. The above reasoning therefore also applies to the total energy H, and concludes that the set includes a compact, connected component that contains the origin (and no other critical point), as long as . This motivates the following definition, which will be used throughout the remainder of the article.

Definition 4 The set is defined as the connected component of that contains the origin.

Analyzing the rate of change of the energy along the trajectories of (8) (cf. (12)),

reveals that by a suitable choice of the parameters , the energy necessarily decays. In particular, this is the case for denotes a lower bound on the Hessian of f, as defined in (2). We therefore conclude as follows.

Proposition 5 Provided that , the origin is an asymptotically stable equilibrium in the sense of Lyapunov. Its region of attraction contains the set

Proof We consider any initial condition , the energy necessarily decays along the trajectories of (8), c.f. (23). The trajectory z(t) = (q(t), p(t)) is therefore confined to the connected component of that contains the origin, which, according to the above discussion, is necessarily compact. This implies that the origin is stable. In addition, the energy strictly decreases except when p(t) = 0. Hence, according to La Salle’s theorem, (see, for example, Sastry, 1999, Ch. 5.4), q(t) necessarily converges to a critical point of f, whereas p(t) converges to zero. The origin is the only critical point contained in , which implies asymptotic stability of the origin.

An eigenvalue analysis of the linearized dynamics (about the equilibrium) reveals that the eigenvalues are given by

where h is any eigenvalue of . Thus, Proposition 3 asserts that the convergence rate for all initial conditions in is directly determined by the real part of the eigenvalues, provided that

The eigenvalues h satisfy the upper and lower bounds , cf. (3). An appropriate normalization of the constants reduces the analysis to the case L = 1, which we consider in the following. Figure 2 shows how the eigenvalues vary as a function of , according to the formula (24).

We consider first the case : For small —the eigenvalues are complex conjugates, have real part (independent of h), and their imaginary part varies between changes). As d is increased above the eigenvalues can be

Figure 2: This figure shows how the eigenvalues of the linearization change as a function of the damping parameters , and as a function of the curvature at the equilibrium. The top left plot shows the behavior of the eigenvalues for , as the curvature h is varied from 0 to 1 (the different colors represent the different values of d). The top right plot shows the behavior of the eigenvalues for and the bottom right plot the behavior for , where the curvature h is varied from 0 to 1. The figure indicates that for , the eigenvalues are real for large values of d and small values of h. As h is increased the eigenvalues may become complex conjugated, in which case a further increase in h affects only the imaginary part. The additional parameter has the effect of reducing the real part for larger values of h. This increases the convergence rate for large values of h.

both real or complex conjugates, depending on h. However, their worst-case real part is given by , which rapidly increases for larger d. The worst-case convergence rate (for a fixed d) is given by the maximum real part as h is varied between . Thus, in that sense, the optimal worst-case convergence rate is

Increasing has the effect of increasing the convergence rate for larger values of h, since the parameter introduces additional damping as h becomes large. Provided that , the qualitative behavior of the eigenvalues remains the same: For small eigenvalues are complex conjugates with worst-case convergence rate creased above , the eigenvalues can be both real or complex conjugated, depending on h. Their worst-case real part is again achieved for and is rapidly increasing for larger d. Again, the optimal worst-case convergence rate is obtained for

In continuous time, any desired convergence rate can be realized, by a simple reparametrization of time. A linear reparameterization, is constant, will simply scale the real parts and imaginary parts of the eigenvalues with . A general, nonlinear, but diffeomorphic transformation will lead to time-varying dynamics, which will be discussed in Section 3. Thus, it might seem that any discussion of continuous-time convergence rates in the context of optimization is pointless. However, the analysis above tells us something different; it reveals how the convergence rate is affected by the condition number, which characterizes the shape of a local minimum. The analysis should be interpreted in the following way: provided that the time scale is fixed such that a convergence rate of if the physical units are kept) is achieved for , the analysis reveals how the convergence rate of any optimization algorithm of the type (8) deteriorates as increases.

For the following analysis we introduce the notation if the function dominates the function for large arguments; that is, valued functions that are positive for large arguments. In the same way, the notation implies that the function for large arguments. We further use that neither are dominant for large arguments; that is,

We will analyze the performance of the algorithm given in Example 1 not only on a single function, but on a whole class of functions. In order to make the following statements precise, we will fix a compact set that contains the origin and introduce the following class of functions (parametrized by the constants

Definition 6 Let denote the set of all functions such that each element the following conditions: 1) f has an isolated local minimum at the origin with condition number satisfies Assumption 2 and the bounds in (2), and 3) according to Definition 4.

These conditions are motivated as follows. The first prescribes the local geometry about the local minimum at the origin, which influences the convergence rate in significant ways. The second ensures smoothness of the gradient, and the third guarantees that there are no critical points in A other than the origin.1

Acceleration is obtained whenever the convergence rate (again, relative to the convergence rate achieved for ) scales with for large values of . More precisely:

Definition 7 A momentum-based algorithm is accelerated if there exists constants , such that for any , the following bound holds (for some constant

We will now proceed to derive conditions on the parameters of the algorithm given in Example 1 that guarantee accelerated convergence. We start with the case it follows that the real part of the worst-case convergence rate scales with , which makes acceleration impossible. For the damping is too small; i.e., the real part of (24) scales worse than . Hence, acceleration is only achieved for similar argument applies to the case and yields

The above analysis is summarized with the following proposition.

Proposition 8 In the nonconvex case (), the algorithm given in Example 1 is accelerated for the set of parameters

In the convex case (), the algorithm given in Example 1 is accelerated for the set of parameters

We would like to emphasize that the bound in Definition 7 is checked for each function individually, whereby the constant may depend on the specific function f. Given the (potentially large) compact set A, Proposition 8 therefore answers the following question: What are the algorithm parameters ensuring that for any the time for reaching an scales with up to constants, i.e., for small ? One important aspect of the analysis is that the convergence rate depends only on the local geometry of the objective function, as characterized by the constant

The basic notion of local analysis, obtained from a derivation of eigenvalues, is well known (see, for example, Polyak (1987)). But Proposition 8 goes beyond classical local analysis in that, by virtue of Proposition 3, the local rate can be guaranteed for a large portion of the region of attraction of a given equilibrium. Key for this result is a nonlinear and global stability analysis. In the strongly convex setting, the results regarding the case have been derived in Gadat et al. (2018). Similar results regarding the case can be found in Attouch et al. (2018).

The bounds that we establish throughout the manuscript are restricted to initial conditions that are contained in a compact set within the region of attraction of a non-degenerate and isolated local minimum. This can be further motivated by considering the function shown in Figure 1. Provided that the algorithm dissipates energy, we infer from the above discussion that the region of attraction of the origin (an open set) will comprise points in the state space that are arbitrarily close to the saddle and the maximum (see Figure 1). Saddle points or maxima are equilibria, since any gradient-based algorithm cannot distinguish between them and minima. Hence, when initializing the algorithm close enough to , convergence of the algorithm can potentially take an arbitrarily long time (see Du et al. (2017) for a formal analysis). In that sense, the restriction of initial conditions to a compact set within the region of attraction of a given equilibrium appears to be natural.

2.7 Example 2

In discrete time, the stability analysis is not as straightforward, since the energy H is in general not dissipated along trajectories. However, the specific structure of the discretization can be exploited for obtaining a nonlinear stability analysis that is valid beyond a neighborhood of a local minimum.

As remarked in Muehlebach and Jordan (2019), the discretization (13) can be divided into two parts, an energy dissipation step and a symplectic Euler step (Hairer et al., 2002, p. 3),

with intermediate state . In order to simplify notation, we introduce the maps to denote the energy dissipation and the symplectic Euler step, respectively. The symplectic Euler step is a well-known structure-preserving first-order integration scheme. Due to the symplectic integration, there exists a modified energy function that is nearly (up to exponentially small terms) conserved by the symplectic Euler step (Hairer et al., 2002, Chapter VI). The modified energy function can be computed by means of truncated Taylor-series expansions. In order to make the analysis rigorous, f is assumed to be analytic, which enables the estimation of higher-order derivatives using Cauchy’s integral formula. It will be shown that for the subsequent stability analysis the assumption of f being analytic poses essentially no restriction, as any continuous function can be approximated arbitrarily closely by an analytic function on a compact domain (Stone-Weierstrass Theorem; Rudin, 1976, p. 159). The modified energy function is characterized by the following result.

Proposition 9 Let f be analytic on , the closed n-dimensional ball of radius r centered at the origin, and let be a Lipschitz constant of . Then there exists a perturbed Hamiltonian

for all and for all

are constant. The perturbed Hamiltonian has the form

where is an analytic function. The perturbed Hamiltonian has the same critical points as H and satisfies

Remarks:

• A more general version of Proposition 9 is stated and proved in Appendix C.

• The proof follows the reasoning of Hairer et al. (2002, p. 307), which enables the construction of by means of a truncated series expansion. Cauchy’s integral formula is used to assert the convergence of the truncated series. The result from Hairer et al. (2002, p. 307) is extended by exploiting the specific structure of the underlying dynamics, which leads to the additional statements about the modified energy function . These are crucial in the context of stability analysis.

• The constants are determined as a function of the Lipschitz constant of which can be regarded as the natural time constant for the dynamics governed by the Hamiltonian we obtain the following values:

for all . Choosing, for example, a time step T = 0.001 leads to the bound on the right-hand side of expression (29), indicating that is virtually exactly conserved by the symplectic Euler scheme.

• The estimate for the maximum time step is typically conservative. However, the proposition rigorously establishes that for a small enough time step, the perturbed Hamiltonian will be almost exactly conserved. The importance lies in the fact that the upper bound on the time step is only dependent on the Lipschitz constant of , which is directly related to the Lipschitz constant of . Due to the fact that Proposition 9 is a statement about the integration of the conservative part of the dynamics, the maximum time step is independent of the parameters

• The bounds (28) enable a nonlinear stability analysis, based on the modified Hamiltonian Due to the fact that is necessarily positive in a neighborhood of the origin, cf. (28). Combined with the fact that the perturbed Hamiltonian has the same critical points as H, this concludes that the level sets of are compact in a region about the origin. The size of this region is determined by the critical points of H, as argued in the analysis of Example 1; see Section 2.6.

In the following, we will use the modified energy function , whose existence is ensured by Proposition 9, for analyzing the stability of the dynamics (13) in the large. As pointed out, the dynamics (13) can be subdivided into a dissipation step and a symplectic Euler step.

In order to simplify the presentation we focus on the case where T is small and the map is close to the identity (little damping). For stability analysis this is the most challenging setting, since, due to the almost vanishing damping, the convergence can be arbitrarily slow (the equilibrium is almost non-attractive). In case is not close to the identity, which is, for example, obtained for a constant parameter , independent of , stability can be analyzed by means of the unperturbed energy function H, as shown in Appendix E. We will thus concentrate on the following result.

Proposition 10 Let the dissipative forces be such that

Then, there exists a maximum time step , such that the origin is an asymptotically stable equilibrium of the dynamics (13) for all , with domain of attraction at least where A is any compact set. Up to exponentially small terms due to (27), the maximum time step depends on an upper bound on , the Lipschitz constant of

Remarks:

• The proof of Proposition 10, which is included in Appendix D, is based on the Lyapunov function

where denotes the perturbed energy function introduced in Proposition 9. The function V is motivated by the following observation. As the damping parameter decreases and approaches the identity, V reduces to the perturbed energy function , which (up to exponentially small terms) is known to be conserved by . The correction is required due to the fact that is a pure contraction in the momentum variable, and as a result is not necessarily decreasing through the application of . A similar correction has been previously used in the literature on dissipative systems; see, for example, Hale (1988) or Haraux (1991), and in the context of the heavy-ball method in Gadat et al. (2018).

• Proposition 10 ensures that the region of attraction is the same as the continuous-time counterpart, and that the requirements on the time step for guaranteeing asymptotic stability is independent of the minimum damping (up to the exponentially small terms). This will be important in the following, where we will analyze how the convergence rate scales with

• Provided that and the function f is convex, the stability analysis simplifies substantially. In particular, a sufficient condition for stability is obtained by analyzing the total energy H along the trajectories of (13), as shown in Appendix E.

Next we will analyze the specific implications for the dynamics (13). We recall that the contraction step is given by

which concludes that the upper and lower bounds of Proposition 10 are given by ). The upper and lower bounds are therefore functions of

Hence, according to Proposition 10, provided that are bounded with respect to , there exists a time step T, independent of , such that the origin is asymptotically stable, where the region of attraction includes any compact subset of . Then, according to Proposition 3, the convergence rate of any trajectory starting in compact, is determined by the magnitude of the eigenvalues of the linearized dynamics. The eigenvalues are given by

enabling the calculation of the convergence rate for a given choice of

For the following discussion, we again assume that are normalized such that and analyze how the eigenvalues change as a function of h, as done in Figure 3. The worst-case convergence rate is determined by the maximum magnitude of the eigenvalues (over ). For very small values of h, the eigenvalues are real, and one eigenvalue is very close to 1. As h is increased, the eigenvalues become complex conjugates, where for , the eigenvalues are located along circles centered at the origin, whereas for their magnitude slightly decreases. If h is increased further, the eigenvalues become real again and their magnitude increases. The worst-case convergence rate, i.e., the largest magnitude of therefore either achieved for

We are interested in determining the conditions on such that the convergence rate scales with . We consider first the case , and assume that d is large enough such that the eigenvalues are real. Then, provided that it follows that However, if d is chosen to be larger, i.e., , it follows that for large

that is, the convergence rate scales worse than is small enough, such that the eigenvalues are complex conjugates, their magnitude is given by

which indicates that a scaling of the convergence rate with is only achieved for For it follows from

for large . Thus, for , for example, the eigenvalues are complex conjugates and the worst-case convergence rate scales with . The condition is therefore necessary and sufficient for ensuring that the worst-case magnitude of scales with . A similar analysis applies to the case where , as shown below.

As in the discussion of Section 2.6 we say that the optimization algorithm (13) is accelerated if the convergence rate scales with (see Definition 7), where exponentially small terms resulting from the application of Proposition 9 are neglected. As a consequence, the above discussion allows us to translate the results from Proposition 8 almost verbatim to the discrete-time setting. This results in a broad characterization of the parameters leading to acceleration.

Proposition 11 In the nonconvex case, (), there exists a time step T > 0, such that the algorithm given in Example 2 is accelerated provided that

Example 2 is accelerated provided that and either

Proof The requirements on are needed for guaranteeing asymptotic stability of the origin for a time step T that is small enough, but independent of , as implied by Proposition 10 and Appendix E. It therefore remains to analyze the behavior of the eigenvalues as a function of h for

and if h is increased further they become real again. In case the eigenvalues are complex conjugates, their magnitude is given by

We consider first the case : The eigenvalues are real provided that , which, given the assumptions on , implies that it follows that , which therefore yields accelerated convergence. In case it follows that

due to the fact that . Thus, accelerated convergence is impossible for . In case the eigenvalues are complex conjugates for , it follows from (33) that is necessary for accelerated convergence.

We therefore proceed to the case h = 1, where due to the fact that d vanishes for large two eigenvalues approach constant values,

for , which is either constant or zero. This results in a convergence rate that is independent of is chosen small enough to guarantee convergence.

In particular, this leads to the conclusion that, for example, the following heavy-ball scheme,

leads to accelerated convergence for a sufficiently small value of T. In case the Lipschitz constant of is bounded by one, the bound for guaranteeing acceleration is obtained by following the proof of Proposition 10.1 Note that compared to Nesterov’s original scheme, a gradient evaluation at is not required.

We would like to emphasize that our analysis captures the computational complexity up to constants. For a given objective function, choosing typically allows larger steps T, which is likely to reduce the number of iterations needed for convergence. On strongly convex functions, for example, the “Constant step scheme III” presented in Nesterov (2004, p. 81), seems to provide a good compromise between a straightforward implementation and fast convergence (considering constants).

Local results akin to Proposition 11 which are based on a quadratic objective function are well known. A summary can be found for example in Lessard et al. (2016). In addition to unifying the continuous-time and discrete-time analysis, an important aspect of Proposition 11 is to establish a rate that is valid for a large portion of the region of attraction of a given equilibrium. Furthermore, the fact that the heavy-ball scheme leads to accelerated convergence for a sufficiently small value of T appears to be new. If T is chosen to be too large, however, for example by tuning T on quadratic functions, this may lead to convergence issues even on strongly convex functions, as has been reported in Lessard et al. (2016).

Figure 3: This figure shows how the eigenvalues of the linearization change as a function of the damping parameters , and as a function of the curvature at the equilibrium, h. The top left plot shows the behavior of the eigenvalues for , as the curvature h is varied from 0.1 to 1 (the different colors represent the different values of d). The top right plot shows the behavior of the eigenvalues for and the bottom right plot the behavior for , as the curvature h is varied from 0.1 to 1. For all the plots the time step is set to T = 0.8. The figure indicates that for , the eigenvalues move first along the real axis and then along concentric circles (as h changes). The additional parameter effect of reducing the magnitude of the eigenvalues for larger values of h. While this can increase the convergence rate, it bears also the risk of instability, when chosen to be too large.

3. The Time-Varying Case

Next we will consider the case where the dynamics g are time dependent. The main motivation for introducing time-dependent dynamics lies in improving the convergence rate for functions that have an almost vanishing curvature at a local minimum.

We slightly abuse notation and reuse the variables introduced in Section 2.4. It will be clear from context whether we refer to the time-varying or time-invariant case.

Following Section 2, we model a momentum-based optimization algorithm as a continuous-time or discrete-time dynamical system of the form

where the dynamics satisfy the following assumptions:1

Assumption 5 The dynamics are continuously differentiable in all their arguments. The derivative in the second and third arguments is Lipschitz continuous, uniformly in t.

We define the map from and define the dynamical system (37)-(38) to be a momentum-based optimization algorithm for the function is an asymptotically stable equilibrium in the sense of Lyapunov (uniformly in the initial time ). Due to the continuity assumptions on the dynamics , the continuous-time existence and uniqueness results, as well as the continuity of trajectories with respect to their initial conditions continues to hold (Arnol’d, 1992, p. 93, Corollaries 3 and 4).

The examples presented in Section 2.2 and Section 2.3 are modified by allowing the constants to be time varying. We restrict ourselves to the case where for large time, which leads to asymptotically time-invariant dynamics. This has the advantage that the asymptotic convergence rate is independent of the initial time , and, under mild continuity conditions, the (optimal) convergence rates from Section 2.4 will be recovered for large t. Thus, (8) and (13) are extended by replacing the constants with the functions , which are assumed to be continuously differentiable and convergent,

The stability analysis of Example 1 (Section 2.2) translates to the time-varying case.2 This implies that the algorithm (8) with time-varying parameters is a momentum-based optimization algorithm according to the above definition.

The stability analysis of Example 2 is extended to the time-varying case by rewriting the dynamics (37) in the following way:

where the linear time-varying part vanishes for . The remainder is of second order in (uniformly in ). Thus, if the linear time-invariant part has all eigenvalues strictly within the unit circle, the dynamics (40) are asymptotically stable, uniformly in Consequently, the analysis from Section 2.3 yields the following conditions, cf. (16):

for all eigenvalues

3.1 Characterizing the convergence rate

As in Section 2.4 we argue that a linear analysis can be used to characterize the convergence rate of the nonlinear dynamics up to constants. The results are derived only for non-degenerate isolated local minima. We believe, however, that the discussion also provides insights into the case where the curvature vanishes at a local minimum; see Section 3.2 for further discussion of this point.

A similar analysis of the role of choosing a time-varying damping parameter can be found in Attouch et al. (2018). The emphasis of the following section lies in providing intuition through a unified treatment between the degenerate and non-degenerate case.

We make the following assumption.

Assumption 6 Let the linearized dynamics

be such that there is an estimate

where are constant, and is continuous, and monotonically decreasing.

Proposition 12 Let Assumption 6 be satisfied and let the region of attraction of the equilibrium at the origin of the nonlinear dynamics (37) be denoted by . Then, for any compact set and any initial time , there exists a finite constant such that for all

Proof The set A is compact, nonempty, and contains the origin. Lemma 16 (see Appendix B) implies that there is a ball , centered at the origin, such that any trajectory starting in converges with rate . The following claim can be verified with the arguments of Proposition 3.

Claim: There exists a finite time such that for all

Moreover, the uniform continuity of , implies further that is bounded for any . Combined with Lemma 16, this yields the following bound

for all are positive constants. We fix , consider the trajectory , and apply the mean value theorem,

where lies between z(t) and the origin. Due to the fact that the dynamics are assumed to have Lipschitz continuous derivatives (uniformly in t), we obtain the following bound

where denotes the Lipschitz constant of (uniformly in t). According to (42), the trajectory |z(t)| is integrable (in continuous time) and absolutely summable (in discrete time). We obtain, by virtue of Lemma 15,

where is constant. The constant is related to an upper bound on the integral (in continuous time) or the sum (in discrete time) of , which according to (42), is guaranteed to be finite.

3.2 Example 1

The energy function, as defined in (11), is not explicitly dependent on time, even if the parameters are chosen to be time-varying. Following the discussion of Section 2.6, we conclude that is contained in the region of attraction of the equilibrium at the origin, as long as

As a result, Proposition 12 implies that the convergence rate of the trajectories starting from compact, is given by the linearization of (8) about the origin.

After a change of coordinates that diagonalizes , the linearized dynamics of a single coordinate are given by

Figure 4: The figure shows d(t) defined according to (49) with example. Compared to the time-invariant dynamics (obtained in the asymptotic limit), which converge with rate , the time-varying terms improve the convergence rate by the shaded area, as indicated with (47).

where denotes the corresponding eigenvalue of . As in Section 2.6, upper and lower bounds on the eigenvalues of are given by . The analysis simplifies by expressing the dynamics using the coordinates , which are defined by

with , and yields

The dynamics (48) represent an undamped linear harmonic oscillator whose frequency varies with time. This time-varying frequency can be interpreted as the time-varying generalization of the square-root term in (24). The non-square-root term in (24) is captured by the coordinate change (47).

In the following, the choice

is discussed. This leads to a time-invariant right-hand side of (48), and simplifies the subsequent analysis. The more general case (no restriction on ) can be analyzed numerically, or by approximating with a rational function and applying tools from asymptotic analysis (White, 2010). The solutions of (49) monotonically approach increases, whereby the time-invariant case discussed in Section 2.6 is recovered with The coordinate transformation (47) therefore implies that, compared to the time-invariant case, the convergence rate can be improved by choosing , as this increases the area underneath the curve d(t). Figure 4 illustrates the situation.

This intuition can be made quantitative by noticing that the solutions of (49) are given by

where is constant and related to . This implies

Due to the fact that (48) reduces to a time-invariant harmonic oscillator, the solutions to (46) can be expressed in closed form with elementary functions. The fundamental solutions to (46) are therefore given by1

The rate of the exponential decay behaves in the same way as the real part of the expression (24) (corresponds to d in (24)) and as a result, the discussion for the time-invariant case applies here in the same way. This concludes that accelerated convergence occurs as long as . Compared to the time-invariant case, however, the convergence is improved by the factor . For large values of the difference is not substantial (at most a constant factor of size can be gained). However, the improvement becomes crucial for small values of behaves as

for small t. Thus choosing , ensures convergence roughly with small t, and exponential convergence with rate ; the transition occurs for . The situation is shown in Figure 5.

Summarizing, we therefore conclude according to Proposition 12 that choosing implies that the trajectories of the time-varying generalization of (8) satisfy the estimate

for some constant (that might strongly depend on ) and for all A is compact. For any , the right-hand side can be bounded by

which implies that the convergence rate is at least O(1/t) for any fixed function f that has non-degenerate isolated local minima (the minima can have an arbitrarily small curvature). We believe that this provides useful insights for the case where the isolated minimum at the origin is degenerate. For example, given a function with an isolated minimum that is degenerate, we can add a regularization of the type is small. The bound (54) applies to the resulting regularized function, and implies a convergence rate of the order O(1/t).

Figure 5: The plot shows the convergence rate of the fundamental solutions (52) as a function of

It is important to notice that the results in this section are stated in terms of the distance between the current iterate and the optimizer, i.e., . In the case of smooth and convex functions, this yield the following bound on the function value:

provided that the algorithm is initialized with p(0) = 0. Thus, our analysis recovers the well-known optimal rate (in terms of function value) for smooth and convex functions.

3.3 Example 2

We consider the time-varying generalization of (13), where the parameters are assumed to be time varying. We focus on the case where f is convex and is chosen to be , as this simplifies the stability analysis and enables explicit closed-form evaluation of certain expressions. The stability analysis carried out in Appendix E applies likewise to the time-varying case; it suffices to replace . This concludes that the origin is asymptotically stable as long as

After a change of coordinates that diagonalizes , the linearized dynamics of a single coordinate are given by

where denotes the corresponding eigenvalue of with upper and lower bounds . In analogy to the transformation (47) in the continuous-time case, we define

which transforms (55) into

This transformation is well known in the literature on linear difference equations (see, for example, Elaydi, 2005, p. 369). As in the continuous-time case, the analysis is considerably simplified if is chosen in such a way that the coefficient multiplying in (57) remains constant. This can be achieved, for example, with

where is constant and independent of h. This equation defines a recurrence relation for through a relation for . The choice (58) is motivated by the fact that 1) the recurrence relation is independent of h and 2) the difference equations (57) is time-invariant, which enables closed-form solutions. Thus, any trajectory satisfying (55) is a linear combination of the two fundamental solutions

This explicitly characterizes the convergence rate of (55) and, by virtue of Proposition 12, the convergence rate of (13), (for time-varying ). Similar to the continuous-time discussion in Section 3.2, the convergence rate can be improved by choosing to be close to one for small k. For the specific choice dependence of the convergence rate on is illustrated in Figure 6. As in the continuous-time case, the introduction of the time-varying parameter ensures that the fundamental solutions converge approximately with , whereas for large k the convergence is linear with rate . The transition occurs for . This indicates that even for very large the fundamental solutions converge roughly with

4. Conclusions

We have presented a convergence-rate analysis of momentum-based optimization algorithms from a dynamical systems point of view. Our analysis emphasizes the importance of the curvature properties about an isolated local minimum, which, up to a multiplicative constant, dictate the convergence rate. In addition, we find that reducing the damping over time improves the convergence rate by sublinear terms, which is important for objective functions that have minima with almost vanishing curvature. The use of momentum ensures robustness of the convergence rate against small changes in the curvature and leads, in many cases, to acceleration.

We also provided a rigorous motivation for the use of symplectic discretization schemes, showing that they enable the computation of a modified energy function that is (almost exactly) preserved by the conservative parts of the dynamics. The modified energy function provides a means for stability analysis, which implies, for example, that certain heavy-ball-type methods are in fact accelerated. Thus, an evaluation of the gradient at a shifted position is not necessary for achieving convergence rates that scale with

Figure 6: The plot shows the convergence rate of the fundamental solutions (59) as a function of For small values of k, the convergence rate is roughly , whereas for larger values of k the convergence is linear with rate

Our analysis emphasizes fundamental similarities between convergence rate analysis in continuous time and discrete time, and provides intuitive and rigorous explanations for various phenomena encountered in optimization, without resorting to convexity.

Acknowledgments

We thank the Branco Weiss Fellowship, administered by ETH Zurich, for the generous support and the Office of Naval Research under grant number N00014-18-1-2764. This work was also funded by the Deutsche Forschungsgemeinschaft (DFG, German Research Foundation)–MU 4710/2-1.

Appendix A. Proof of Proposition 3

For proving the proposition we rely on the following lemmas.

Lemma 13 Consider the trajectories of a time-varying dynamical system

with is either continuous (in the continuous-time case) or A + B(t) is invertible (in the discrete-time case). Let A be such that the trajectories converge exponentially with rate , and for some constant satisfies the estimate

Proof We consider the continuous-time case first. Due to the continuity assumptions on B(t), the dynamics are guaranteed to have a unique solution, x(t), satisfying

where exp denotes the matrix exponential. The two-norm of is, by assumption, upper bounded by , yielding the following estimate:

Applying the Gr¨onwall inequality to (64) yields the desired result. The discrete-time case is analogous. Due to the assumption of A + B(t) being full rank, the dynamics are guaranteed to have a unique solution x(t), which satisfies

see, for example, Agarwal (2000, p. 59). By assumption, the two-norm of is bounded by . As a result, the following estimate is obtained:

Applying the Gr¨onwall inequality (Agarwal, 2000, p. 186), yields

Lemma 14 Let Assumption 4 be fulfilled. Then there exists an open ball (with radius centered at the origin and a constant , such that for all

Proof We consider the continuous-time case first. Assumption 4 implies that the dynamics (4) and (5) are (asymptotically) stable. This implies that for any there exists a constant such that , we consider the trajectory starting at and reformulate the dynamics by applying the mean value theorem:

where lies between the origin and z(t). The continuity assumption on the dynamics leads to the following estimate:

for all is a Lipschitz constant of in a neighborhood of the origin. Applying Lemma 13 then yields

We fix . This leads, in turn, to a refinement of the estimate (72),

Applying Lemma 13 once more therefore implies

for all , which leads to the desired result. The same arguments (with slightly modified constants) apply to the discrete-time case.

Appendix B. Proof of Proposition 12

For proving the proposition we rely on the following lemmas.

Lemma 15 Consider the trajectories of a time-varying dynamical system

with are either continuous (in the continuous-time case) or A(t) + B(t) is invertible for all (in the discrete-time case). Let , be defined by (for a fixed , and assume

where is constant, is continuous, monotonically decreasing, and 0. Then, x(t) satisfies the estimate

Proof We consider the continuous-time case first. Due to the continuity assumptions on A(t) and B(t), the dynamics are guaranteed to have a unique solution x(t), satisfying

The bound on the fundamental solution matrix yields the following estimate:

Applying the Gr¨onwall inequality to (81) yields the desired result. The discrete-time case is analogous. Due to the assumption of A(t) + B(t) being full rank, the dynamics are guaranteed to have a unique solution x(t), which satisfies

see, for example, (Agarwal, 2000, p. 59). As a result, the bound on leads to the following estimate

Applying the Gr¨onwall inequality (Agarwal, 2000, p. 186), yields

where the fact that has been used.

Lemma 16 Let Assumption 6 be fulfilled. Then there exists an open ball (with radius centered at the origin and a constant , such that for all

Proof We consider the continuous-time case first and fix the initial time . Assumption 6 implies that the dynamics (37) are (asymptotically) stable, uniformly in . This implies that for any there exists a constant (independent of ) such that for all , we consider the trajectory starting at time and reformulate the dynamics by applying the mean value theorem:

where lies between the origin and z(t). The assumptions on the dynamics, leads to the following estimate:

for all , and where is a time-independent Lipschitz constant of in a neighborhood of the origin.

By assumption, the linear dynamics decay at least with Thus, applying Lemma 15 yields

We fix . This leads, in turn, to a refinement of the estimate (88),

Applying Lemma 15 once more therefore implies

for all , which leads to the desired result. The same arguments (with slightly modified constants) apply to the discrete-time case.

Appendix C. Proof of Proposition 9

We prove the following generalization of Proposition 9:

Proposition 17 Let f be analytic on , the closed ball of radius r about the origin, and let be a Lipschitz constant of . Then there exists a perturbed Hamiltonian , whose trajectories

are such that

for all

where are constants given by

The perturbed Hamiltonian has the form

where

with . Moreover, has the same critical points as H and satisfies

Finally, at time T, the difference of the perturbed energy is bounded by

where

Proof We will complexify the position and momentum variables; i.e., we take to the fact that f is analytic, it can be interpreted as a mapping from ; a similar reasoning applies to be the closed ball of radius , centered about the origin, and let the norm be defined as

where is a trajectory satisfying (93), and are analytic functions, i = 1, 2, . . . . Performing a Taylor expansion of and evaluating at t = T yields, after rearranging terms,

where the Lie derivative is defined by

where denotes the set of infinitely differentiable functions mapping from Hairer et al., 2002, Ch. IX). The notation on the left-hand side of the previous equation should be read in the following order: The operator is applied to , yielding the function then evaluated at z. Applying the operator to the function is denoted by

We require , which leads due to (26) to

Equating equal powers of T in (103) and (105) yields the following recursive scheme for computing the functions

where the last sum ranges over all strictly positive integers that sum up to j. We will construct the perturbed Hamiltonian by appropriately truncating the series show that the resulting truncated series has the desired properties.

We start by explicitly computing the function

and note that it is of the form . We further note that fact a Hamiltonian vector field with corresponding Hamiltonian . From an induction argument relying on (107) and the definition of the Lie derivative, it follows that each written as is the identity, is defined in (108), and

Moreover, as shown in Hairer et al. (2002, p. 295,Theorem 3.2), the vector fields are guaranteed to be Hamiltonian for all

integral formula. More precisely, Cauchy’s integral formula provides us with the following bound on for any function analytic in

for any two constants . A similar argument yields the following bound on the Lie-derivative of a function analytic in (Hairer et al., 2002, p. 308),

where . In the following these two inequalities are used to bound the right-hand side of (109). Claim: The function is bounded above by

for all

Proof of the claim: Explicit calculations show that the claim holds for j = 1. We thus fix the index J > 1 and estimate is fixed, which, for j = J, yields the desired bound on . To simplify notation we denote The right-hand side of (109) is upper bounded by

Due to the fact that , it follows implies and for any function analytic on . Therefore, the above bound reduces to

By introducing the constant , which, by definition of , we can reformulate the above bound as

By exploiting the structure of it can be verified that

which will simplify the following development. Following Hairer et al. (2002, p. 309), we introduce the constants in the following way

It can be verified with an induction argument taking (115) and (116) into account, that

The variables are well-defined for any and not just . In order to bound the constants , we formally introduce the generating function , and note that

where the order of summation has been interchanged to arrive at (118). It can be verified by means of the implicit function theorem that, given any , the equation uniquely solved for b as long as . This implies that is well-defined and analytic for , or equivalently, . Furthermore, explicit calculations show that is bounded by ln(see Hairer et al., 2002, p. 310). Cauchy’s inequality therefore implies

Evaluating the above bound for j = J yields

where N is chosen to be the largest integer such that The choice for the integer N is motivated by the following observation: Due to the bound on the terms in the above sum behave as , which attains a minimum for . The fact that all , are Hamiltonian vector fields proves (96).

which proves the first bound in (97). The second bound is proved analogously; that is,

The fact that has the same critical points is derived from the following observation. First, all critical points of H are necessarily critical points of , which follows from (124) with 0, and second, implies, according to (124), that ), concluding

The first bound in (98) is derived by noticing that

for any path that connects the origin with . We choose . This can be done by considering the gradient flow

which is guaranteed to converge to the origin for any . As a result, choosing

The same argument (with the same path) applies to , which yields the second bound in (98).

(the choice becomes evident), and as above, N to be the largest integer such that small enough , we denote the map from . The symplectic Euler step analytic in and by the above construction, their Taylor expansion in agrees for the first is fixed). Thus, the function analytic. Applying the maximum modulus principle implies

where is chosen as (again, the choice becomes evident subsequently). It remains to bound , which will be done by estimating separately. For the first term we obtain (by definition of (105))

where has been used for the last inequality. For bounding the second term we first derive a bound on , similar to (124) (unlike (124) we have ). We obtain

for all . The gradient of has therefore a Lipschitz constant of less than 2.8, which implies that

as long as the corresponding trajectory stays within . Note that the last inequality stems from the fact that . Due to the condition (132), the trajectory necessarily stays within . Moreover, from ), it can be inferred that

which implies that . Combined with (134), this leads to for , which, according to (133), implies

The fact that has been used to obtain the last inequality. The above result justifies the choice

We derive the bound on in a similar way. Consider the analytic function are fixed. The modified Hamiltonian is dependent on and has the form

where . The function analytic, as can be seen, for example, by a Taylor expansion of agree up to Nth order. We invoke the maximum principle, similar to (133), which implies,

and where is again set to . A bound for is obtained in the following way (applying Taylor’s theorem):

where lies between . The gradient of is necessarily bounded by , which therefore implies that the Hessian of in the expression is bounded by provided that . The bound (134) implies that this is the case for

where have been used. Combining (142) with the bound from (134) yields

Applying the same chain of arguments as in (138) allows us to conclude:

Appendix D. Proof of Proposition 10

Proof We choose is the closed n-dimensional ball of radius r centered at the origin. We use the Stone-Weierstrass theorem (Rudin, 1976, p. 159) to approximate

where will be chosen subsequently. We integrate yields the function . We integrate once more and impose Moreover, by choosing small enough, the origin is guaranteed to be the only critical point of H(z) (with f replaced with . Due to the fact that a polynomial and therefore analytic, we can invoke Proposition 9 with f replaced by will show that the origin is an asymptotically stable equilibrium of the dynamics (13), where f is replaced by , with region of attraction at least A.

in the following way:

where by assumption is symmetric and positive definite. Due to the assumption on the dissipative forces, the singular values of are upper and lower bounded by respectively. Provided that the time step T is small enough, , we conclude from Proposition 9 that there exists a modified energy function that has the same critical points as H, and such that the bound (27) holds for all . Morse theory implies that is locally positive definite and has compact level sets for . The same applies therefore to the function

provided that is chosen to be small enough (independently of ; this can be done since the positive definiteness of is not affected by the dissipative forces

simple structure of . In order to simplify the notation we hide all constants that may depend on , an upper bound on , or higher orders of T with the notation . It will be shown that necessarily decays along the step . To that extent, we define and and consider , which, due to the fact that , is bounded by

where the arguments of have been dropped and the results from Proposition 9 have been used, which enable the following bound:

Due to Proposition 9, it follows that is bounded by a term of the order . This yields

+

Due to the continuity of we can bound Lipschitz constant of . Moreover, is an upper bound on L. This implies

provided that T is chosen small enough such that dominates the and dominates the terms. Expanding the term and applying Young’s inequality,

CYd

where is an arbitrary constant, results in

where for the last inequality has been used. Note that the term decays quickly for small the function V strictly decreases along the trajectories of . The lower bound on vanishes exponentially for small T. This concludes that the origin is asymptotically stable under the dynamics (13) (with instead of f) provided that T is chosen small enough, i.e. , where up to the exponential terms in only depends on the ratio . Due to the fact that the first-order approximations of the dynamics resulting from about the equilibrium are -close (by construction of ), the origin is likewise asymptotically stable for the dynamics resulting from f, and its region of attraction contains at least a small neighborhood of the origin. Consequently, due to the continuity of V , we may choose small enough such that outside a neighborhood of the origin, where denotes the trajectory resulting from (13) (with f instead of ). This shows that A is likewise contained in the region of attraction of the origin under the dynamics (13) with f instead of

Appendix E. Discrete-time stability (convex case)

By using the change of variables

the discrete-time algorithm (13) can be reformulated as

where

The change of variables is motivated by the fact that the convexity and smoothness of the objective function implies

where L is the Lipschitz constant of the gradient of f. In the following it will be shown that the energy H decreases along provided that certain conditions on the parameters are met. We evaluate

Due to the fact that f is convex it holds that

and as a result

The right-hand side of the above expression is guaranteed to decrease provided that the following matrix

is positive definite. Thus, the minimum of f is an asymptotically stable equilibrium of the dynamics (13) provided that (160) is positive semidefinite. In particular, choosing , ensures asymptotic stability of the minimum. Moreover, for any fixed the above matrix can be made negative definite by choosing T and d sufficiently small.

Appendix F. Smoothness assumptions on f

This section discusses the implications of Assumption 1 compared to Assumption 2. We therefore consider any function f that satisfies Assumption 1 and construct a sequence of functions j > 0 is an integer, in the following way: Let be such that

where is an infinitely differentiable function that has support on and satisfies . As a consequence, it holds that

for all ). In other words, converges uniformly to same holds for on any compact subset of . As a result of (161), any essential upper or lower bound on the curvature of f are translated to the functions . This implies, for example, that for large enough j the origin is an isolated non-degenerate critical point of . In addition, each j > 0, is infinitely differentiable and satisfies Assumption 2. We therefore consider the dynamical system (37), where f is replaced with

where . Compared to (37), the dependence on is made explicit. Let z(t) be the trajectory satisfying (162), where is replaced with f. In the continuous-time case, (162) implies

where are Lipschitz constants related to . By virtue of the Gr¨onwall inequality, (163) readily implies pointwise for any . A similar argument applies to the discrete-time case.

We show next that, in case the equilibrium at the origin is uniformly stable (uniformity with respect to ), the convergence is in fact uniform in . As a consequence, the results from Proposition 12 and similarly Proposition 3 generalize to the case where f satisfies merely Assumption 1 instead of Assumption 2, as shown below.

Proposition 18 Let the origin be a stable equilibrium for (162), uniformly in sufficiently large). If z(t) converges to the origin for converges to z(t), uniformly in

Proof We pick any and show that there exists an integer N, independent of t, such that . Due to the uniform stability of the origin, there exists a (independent of ) such that sufficiently large. The trajectory z(t) converges to the origin. Hence, there exists a finite time . Applying the Gr¨onwall inequality to (163) yields

a similar conclusion holds in the discrete-time case. Thus, choosing j large enough guarantees that . Therefore , which readily implies . Combined with , this yields for all

Verifying that the origin is stable (uniformly in j) is typically straightforward. In order to illustrate the ideas we consider the discussion of Section 2.6. The total energy replaced with ) is well-defined for any j > 0. For large enough j, the curvature of localized average of the curvature of f. The origin is a non-degenerate isolated critical point of f, and therefore the same applies to for large enough j. This concludes that in a neighborhood of the origin the total energy is upper and lower bounded:

for all j > 0 sufficiently large. These upper and lower bounds are independent of j, which, combined with the fact that is non-increasing along trajectories, readily implies stability of the origin (uniformly in ) (see, for example, Sastry, 1999, p. 189). The same argument applies in the discrete-time case, where, by assumption, are locally strongly convex. Thus, uniform stability follows from Appendix D.

The fact that the convergence is uniform in t implies that the convergence estimates form Proposition 12 and Proposition 3 apply for . More precisely, we have

Proposition 19 Let the origin be a stable equilibrium for (38) uniformly in sufficiently large). Let be compact and such that Then, provided that for sufficiently large j > 0 the trajectories satisfy the estimates

where are constant, and is continuous and monotonically decreasing, there exists a constant satisfies the estimate

for any constant

Proof For any , (165) implies that for any j > 0 sufficiently large,

Due to the fact that converges uniformly to z, it follows that

which yields the desired result.

References

Ravi P. Agarwal. Difference Equations and Inequalities. Marcel Dekker, Inc., second edition, 2000.

Zeyuan Allen-Zhu and Lorenzo Orecchia. Linear coupling: An ultimate unification of gradient and mirror descent. arXiv:1407.1537 [cs.DS], 2014.

Vladimir I. Arnol’d. Ordinary Differential Equations. Springer, third edition, 1992.

Hedy Attouch, Zaki Chbani, Juan Peypouquet, and Patrick Redont. Fast convergence of inertial dy- namics and algorithms with asymptotic vanishing viscosity. Mathematical Programming, Series B, 168(1-2):123–175, 2018.

Richard Bellman. Stability Theory of Differential Equations. Dover, 1969.

Michael Betancourt, Michael I. Jordan, and Ashia C. Wilson. On symplectic optimization. arXiv:1802.03653v2, pages 1–20, 2018.

S´ebastien Bubeck, Yin Tat Lee, and Mohit Singh. A geometric alternative to Nesterov’s accelerated gradient descent. arXiv:1506.08187 [math.OC], 2015.

S´ebastien Bubeck, Qijia Jiang, Yin Tat Lee, Yuanzhi Li, and Aaron Sidford. Near-optimal method for highly smooth convex optimization. Proceedings of Machine Learning Research, 99:492– 507, 2019.

Frank E. Curtis, Daniel P. Robinson, and Mohammadreza Samadi. A trust region algorithm with a worst-case iteration complexity of for nonconvex optimization. Mathematical Programming, Series A, 162(1-2):1–32, 2017.

Jelena Diakonikolas and Lorenzo Orecchia. The approximate duality gap technique: A unified theory of first-order methods. SIAM Journal on Optimization, 29(1):660–689, 2018.

Simon S. Du, Chi Jin, Jason D. Lee, Michael I. Jordan, Barnab´as P´oczos, and Aarti Singh. Gradi- ent descent can take exponential time to escape saddle points. Advances in Neural Information Processing Systems 30, pages 1067–1077, 2017.

Hans-Bernd D¨urr and Christian Ebenbauer. On a class of smooth optimization algorithms with applications in control. Proceedings of the IFAC Nonlinear Model Predictive Control Conference, pages 291–298, 2012.

Saber Elaydi. An Introduction to Difference Equations. Springer, third edition, 2005.

Guilherme Franc¸a, Jeremias Sulam, Daniel Robinso, and Ren´e Vidal. Conformal symplectic and relativistic optimization. arXiv:1903.04100 [math.OC], pages 1–27, 2019.

S´ebastien Gadat, Fabien Panloup, and Sofiane Saadane. Stochastic heavy ball. Electronic Journal of Statistics, 12(1):461–529, 2018.

Rong Ge, Furong Huang, Chi Jin, and Yang Yuan. Escaping from saddle points - online stochastic gradient for tensor decomposition. Proceedings of Machine Learning Research, 40:797–842, 2015.

Ernst Hairer, Gerhard Wanner, and Christian Lubich. Geometric Numerical Integration. Springer, 2002.

Jack K. Hale. Asymptotic Behavior of Dissipative Systems. American Mathematical Society, 1988.

Alain Haraux. Syst`emes Dynamiques Dissipatifs et Applications. Masson, 1991.

Chi Jin, Praneeth Netrapalli, and Michael I. Jordan. Accelerated gradient descent escapes saddle points faster than gradient descent. arXiv:1711.10456, pages 1–43, 2017.

Chi Jin, Praneeth Netrapalli, Rong Ge, Sham M. Kakade, and Michael I. Jordan. On nonconvex op- timization for machine learning: Gradients, stochasticity, and saddle points. arXiv:1902.04811, pages 1–31, 2019.

Walid Krichene, Alexandre M. Bayen, and Peter L. Bartlett. Accelerated mirror descent in continu- ous and discrete time. Advances in Neural Information Processing Systems 28, pages 2845–2853, 2015.

Jason D. Lee, Max Simchowitz, Michael I. Jordan, and Benjamin Recht. Gradient descent only converges to minimizers. Proceedings of Machine Learning Research, 49(1):1246–1257, 2016.

Laurent Lessard, Benjamin Recht, and Andrew Packard. Analysis and design of optimization algo- rithms via integral quadratic constraints. SIAM Journal on Optimization, 26(1):57–95, 2016.

Chris J. Maddison, Daniel Paulin, Yee Whye Teh, and Brendan O’Donoghue. Hamiltonian descent methods. arXiv:1809.05042 [math.OC], pages 1–72, 2018.

Simon Michalowsky, Carsten Scherer, and Christian Ebenbauer. Robust and structure exploiting optimization algorithms: An integral quadratic constraint approach. arXiv:1905.00279, pages 1–30, 2019.

J. Milnor. Morse Theory. Princeton University Press, 1963.

Michael Muehlebach and Michael I. Jordan. A dynamical systems perspective on Nesterov accel- eration. Proceedings of Machine Learning Research, 97:4656–4662, 2019.

Yuri E. Nesterov. A method of solving a convex programming problem with convergence rate Soviet Mathematics Doklady, 27(2):372–376, 1983.

Yuri E. Nesterov and Boris T. Polyak. Cubic regularization of Newton method and its global per- formance. Mathematical Programming, Series A, 108:177–205, 2006.

Yurii Nesterov. Introductory Lectures on Convex Optimization - A Basic Course. Springer Science+Business Media, LLC, 2004.

Boris T. Polyak. Some methods of speeding up the convergence of iteration methods. USSR Computational Mathematics and Mathematical Physics, 4(5):1–17, 1964.

Boris T. Polyak. Introduction to Optimization. Optimization Software, Inc., 1987.

Walter Rudin. Principles of Mathematical Analysis. McGraw-Hill, third edition, 1976.

J. M. Sanz-Serna. Symplectic integrators for Hamiltonian problems: an overview. Acta Numerica, pages 243–286, 1992.

Shankar Sastry. Nonlinear Systems. Springer, 1999.

Damien Scieur, Vincent Roulet, Francis Bach, and Alexandre d’Aspremont. Integration methods and optimization algorithms. Advances in Neural Information Processing Systems 30, pages 1109–1118, 2017.

Weijie Su, Stephen Boyd, and Emmanuel J. Cand`es. A differential equation for modeling Nesterov’s accelerated gradient method: Theory and insights. Journal of Machine Learning Research, 17 (153):1–43, 2016.

Roscoe B White. Asymptotic Analysis of Differential Equations. Imperial College Press, 2010.

Andre Wibisono, Ashia C. Wilson, and Michael I. Jordan. A variational perspective on accelerated methods in optimization. Proceedings of the National Academy of Sciences, 113(47):E7351– E7358, 2016.

Designed for Accessibility and to further Open Science