Convergence of a Stochastic Gradient Method with Momentum for Non-Smooth Non-Convex Optimization

2020·Arxiv

Abstract

Abstract

Stochastic gradient methods with momentum are widely used in applications and at the core of optimization subroutines in many popular machine learning libraries. However, their sample complexities have never been obtained for problems that are non-convex and non-smooth. This paper establishes the convergence rate of a stochastic subgradient method with a momentum term of Polyak type for a broad class of non-smooth, non-convex, and constrained optimization problems. Our key innovation is the construction of a special Lyapunov function for which the proven complexity can be achieved without any tunning of the momentum parameter. For smooth problems, we extend the known complexity bound to the constrained case and demonstrate how the unconstrained case can be analyzed under weaker assumptions than the state-of-the-art. Numerical results confirm our theoretical developments.

1 Introduction

We study the stochastic optimization problem

where is a random variable; f(x; s) is the instantaneous loss parameterized by x on a sample is a closed convex set. In this paper, we move beyond convex and/or smooth optimization and consider f that belongs to a broad class of non-smooth and non-convex functions called -weakly convex, meaning that

This function class is very rich and important in optimization [39, 47]. It trivially includes all convex functions, all smooth functions with Lipschitz continuous gradient, and all additive composite functions of the two former classes. More broadly, it includes all compositions of the form

where is convex and -Lipschitz and is a smooth map with -Lipschitz Jacobian. Indeed, the composite function is then weakly convex with ]. Some representative applications in this problem class include nonlinear least squares [10], robust phase retrieval [13], Robust PCA [6], robust low rank matrix recovery [7], optimization of the Conditional Value-at-Risk [40], graph synchronization [44], and many others.

Stochastic optimization algorithms for solving (1), based on random samples from P, are of fundamental importance in many applied sciences [4, 43]. Since the introduction of the classical stochastic (sub)gradient descent method (SGD) in [38], several modifications of SGD have been proposed to improve its practical and theoretical performance. A notable example is the use of a momentum term to construct an update direction [24, 37, 42, 46, 45, 19]. The basis form of such a method (when

where is the search direction, is a stochastic subgradient; is the stepsize, and (0, 1] is the momentum parameter. For instance, the scheme (3) reduces to the stochastic heavy ball method (SHB) [37]:

with . Methods of this type enjoy widespread empirical success in large-scale convex and non-convex optimization, especially in training neural networks, where they have been used to produce several state-of-the-art results on important learning tasks, e.g., [29, 45, 25, 28].

Sample complexity, namely, the number of observations required to reach a desired accuracy , has been the most widely adapted metric for evaluating the performance of stochastic optimization algorithms. Although sample complexity results for the standard SGD on problems of the form (1) have been obtained for convex and/or smooth problems [30, 18], much less is known in the non-smooth non-convex case [8]. The problem is even more prevalent in momentum-based methods as there is virtually no known complexity results for problems beyond those that are convex or smooth.

1.1 Related work

As many applications in modern machine learning and signal processing cannot be captured by convex models, (stochastic) algorithms for solving non-convex problems have been studied extensively. Below we review some of the topics most closely related to our work.

Stochastic weakly convex minimization Earlier works on this topic date back to Nurminskii who showed subsequential convergence to stationary points for the subgradient method applied to deterministic problems [34]. The work [41] proposes a stochastic gradient averaging-based method and shows the first almost sure convergence for this problem class. Basic suffi-cient conditions for convergence of stochastic projected subgradient methods is established in [15]. Thanks to the recent advances in statistical learning and signal processing, the problem class has been reinvigorated with several new theoretical results and practical applications (see, e.g., [13, 12, 9, 8] and references therein). In particular, based on the theory of non-convex differential inclusions, almost sure convergence is derived in [14] for a collection of model-based minimization strategies, albeit no rates of convergence are given. An important step toward non-asymptotic convergence of stochastic methods is made in [9]. There, the authors employ a proximal point technique for which they can show the sample complexity ) with a certain stationarity measure. Later, the work [8] shows that the (approximate) proximal point step in [9] is not necessary and establishes the similar complexity for a class of model-based methods including the standard SGD. We also note that there has been a large volume of works in smooth non-convex optimization, e.g., [18, 19].

Stochastic momentum for non-convex functions Optimization algorithms based on momentum averaging techniques go back to Polyak [36] who proposed the heavy ball method. In [31], Nesterov introduced the accelerated gradient method and showed its optimal iteration complexity for the minimization of smooth convex functions. In the last few decades, research on accelerated first-order methods has exploded both in theory and in practice [3, 33, 5]. The effectiveness of such techniques in the deterministic context has inspired researchers to incorporate momentum terms into stochastic optimization algorithms [37, 41, 46, 45, 19]. Despite evident success, especially, in training neural networks [29, 45, 25, 50, 28], the theory for stochastic momentum methods is not as clear as its deterministic counterpart (cf. [22]). As a result, there has been a growing interest in obtaining convergence guarantees for those methods under noisy gradients [27, 22, 49, 16, 19]. In non-convex optimization, almost certain convergence of Algorithm (3) for smooth and unconstrained problems is derived in [42]. Under the bounded gradient hypothesis, the convergence rate of the same algorithm has been established in [49]. The work [21] obtains the complexity of a gradient averaging-based method for constrained problems. In [19], the authors study a variant of Nesterov acceleration and establish a similar complexity for smooth and unconstrained problems, while for the constrained case, a mini-batch of samples at each iteration is required to guarantee convergence.

1.2 Contributions

Minimization of weakly convex functions has been a challenging task, especially for stochastic problems, as the objective is neither smooth nor convex. With the recent breakthrough in [8], this problem class has been the widest one for which provable sample complexity of the standard SGD is known. It is thus intriguing to ask whether such a result can also be obtained for momentum-based methods. The work in this paper aims to address this question. To that end, we make the following contributions:

• We establish the sample complexity of a stochastic subgradient method with momentum of Polyak type for a broad class of non-smooth, non-convex, and constrained optimization problems. Concretely, using a special Lyapunov function, we show the complexity for the minimization of weakly convex functions. The proven complexity is attained in a parameter-free and single time-scale fashion, namely, the stepsize and the momentum constant are independent of any problem parameters and they have the same scale with respect to the iteration count. To the best of our knowledge, this is the first complexity guarantee for a stochastic method with momentum on non-smooth and non-convex problems.

• We also study the sample complexity of the considered algorithm for smooth and constrained optimization problems. Note that even in this setting, no complexity guarantee of SGD with Polyak momentum has been established before. Under a bounded gradient assumption, we obtain a similar ) complexity without the need of forming a batch of samples at each iteration, which is commonly required for constrained non-convex stochastic optimization [20]. We then demonstrate how the unconstrained case can be analyzed without the above assumption.

Interestingly, the stated result is achieved in the regime where can be as small as ), i.e., one can put much more weight to the momentum term than the fresh subgradient in a search direction. This complements the complexity of SGD attained as Note that the worst-case complexity ) is unimprovable even in the smooth case [1].

2 Background

In this section, we first introduce the notation and then provide the necessary preliminaries for the paper.

For any , we denote by the Euclidean inner product of x and y. We use to denote the Euclidean norm. For a closed and convex set denotes the orthogonal projection onto the indicator function of otherwise. Finally, we denote by -field generated by the first k + 1 random variables

For a function , the Fr´echet subdifferential of f at x, denoted by ), consists of all vectors

The Fr´echet and conventional subdifferentials coincide for convex functions, while for smooth functions ) reduces to the gradient is said to be stationary for problem (1) if 0

The following lemma collects standard properties of weakly convex functions [47].

Lemma 2.1 (Weak convexity)-weakly convex function. Then the following hold:

Weakly convex functions admit an implicit smooth approximation through the Moreau envelope:

For small enough , the point achieving ) in (4), denoted by prox), is unique and given by:

The lemma below summarizes two well-known properties of the Moreau envelope and its associated proximal map [26].

Lemma 2.2 (Moreau envelope). Suppose that -weakly convex function. Then, for a fixed parameter , the following hold:

-smooth with the gradient given by

-smooth, i.e., for all

Failure of stationary test A major source of difficulty in convergence analysis of non-smooth optimization methods is the lack of a controllable stationary measure. For smooth functions, it is natural to use the norm of the gradient as a surrogate for near stationary. However, this rule does not make sense in the non-smooth case, even if the function is convex and its gradient existed at all iterates. For example, the convex function f(x) = |x| has = 1 at each = 0, no mater how close x is to the stationary point x = 0. To circumvent this difficulty, we adopt the techniques pioneered in [8] for convergence of stochastic methods on weakly convex problems. More concretely, we rely on the connection of the Moreau envelope to (near) stationarity: For any , the point ˆwhere ), satisfies:

Therefore, a small gradient implies that x is close to a point ˆthat is near stationary for F. Note that ˆx is just a virtual point, there is no need to compute it.

3 Algorithm and convergence analysis

We assume that the only access to f is through a stochastic subgradient oracle. In particular, we study algorithms that attempt to solve problem (1) using i.i.d. samples

Assumption A (Stochastic oracle). Fix a probability space (S, F, P). Let S be a sample drawn from . We make the following assumptions:

(A2) There exists a real L > 0 such that for all

The above assumptions are standard in stochastic optimization of non-smooth functions (see, e.g., [30, 8]).

Algorithm To solve problem (1), we use an iterative procedure that starts from ) and generates sequences of points by repeating the following steps for k = 0, 1, 2, . . .:

where , this algorithm reduces to the procedure (3). For a general convex set X, this scheme is known as the iPiano method in the smooth and deterministic setting [35]. For simplicity, we refer to Algorithm 7 as stochastic heavy ball (SHB).

Throughout the paper, we will frequently use the following two quantities:

Before detailing our convergence analysis, we note that most proofs of complexity for subgradient-based methods rely on establishing an iterate relationship on the form (see, e.g., [30, 8, 18]):

where opt mea denotes some suboptimality measure such as for convex and for smooth (possibly non-convex) problems, are certain Lyapunov functions, stepsize, and C is some constant. Once (8) is given, a simple manipulation results in the desired complexity provided that is chosen appropriately. The case with decaying stepsize can be analyzed in the same way with minor adjustments. We follow the same route and identify a Lyapunov function that allows to establish relation (8) for the quantity in (6).

Since our Lyapunov function is nontrivial, we shall build it up through a series of key results. We start by presenting the following lemma, which quantifies the averaged progress made by one step of the algorithm.

Lemma 3.1. Let Assumptions (A1)–(A2) hold. Let for some constant that be generated by procedure (7). It holds for any

Proof. See Appendix A.

In view of (8), the lemma shows that the quantity ] can be made arbitrarily small. However, this alone is not sufficient to show convergence to stationary points. Nonetheless, we shall show that a small ] indeed implies a small (averaged) value of the norm of the Moreau envelope defined at a specific point. Toward this goal, we first need to detail the points in (6). It seems that taking the most natural candidate is unlikely to produce the desired result. Instead, we rely on the following iterates:

and construct corresponding virtual reference points:

Lemma 3.2. Assume the same setting of Lemma 3.1. Let be such that and define the function:

Then, for any

The proof of this lemma is rather involved and can be found in Appendix B. Lemma 3.2

has established a relation akin to (8) with the Lyapunov function defined in (10). We can now use standard analysis to obtain our sample complexity.

Theorem 1. Let Assumptions (A1)-(A2) hold. Let be sampled uniformly at random from and denote ∆ = . If we set for some real , then under the same setting of Lemma 3.2:

where . Furthermore, if obtain

Proof. Taking the expectation on both sides of (11) and summing the result over k = 0, . . . , K yield

Let , the left-hand-side of the above inequality can be lowerbounded by . Using the facts that Consequently,

where the last expectation is taken with respect to all random sequences generated by the method and the uniformly distributed random variable . Note that and 1 1. Thus, letting ), the constants can be upper-bounded by + 1, respectively. Therefore, if we let , we arrive at

as desired.

Some remarks regarding Theorem 1 are in order:

i) The choice is just for simplicity; one can pick any constant such that (0, 1]. Note that the stepsize used to achieve the rate in (12) does not depend on any problem parameters. Once is set, the momentum parameter selection is completely parameter-free. Since both scale like ), Algorithm 7 can be seen as a single time-scale method [21, 42]. Such methods contrast those that require at least two time-scales to ensure convergence. For example, stochastic dual averaging for convex optimization [48] requires one fast scale O(1/K) for averaging the subgradients, and one slower scale stepsize. To show almost sure convergence of SHB for smooth and unconstrained problems, the work [22] requires that both the stepsize and the momentum parameter tend to zero but the former one must do so at a faster speed.

ii) To some extent, Theorem 1 supports the use of a small momentum parameter such as 01, which corresponds to the default value 1 9 in PyTorchthe smaller 1 99 suggested in [23] Indeed, the theorem allows to have as ), i.e., one can put much more weight to the momentum term than the fresh subgradient and still preserve the complexity. Recall also that SHB reduces to SGD as which also admits a similar complexity. It is thus quite flexible to set , without sacrificing the worst-case complexity. We refer to [22, Theorem 2] for a similar discussion in the context of almost sure convergence on smooth problems.

iii) In view of (6), the theorem indicates that ¯is nearby a near-stationary point ˆ¯may not belong to X, it is thus more preferable to have the similar guarantee for the iterate . Indeed, we have

Since both terms on the right converge at the rate ), it immediately translates into the same guarantee for the term on the left, as desired.

In summary, we have established the convergence rate ) or, equivalently, the sample complexity ) of SHB for the minimization of weakly convex functions.

4 Extension to smooth non-convex functions

In this section, we study the convergence property of Algorithm (7) for the minimization of -smooth functions:

Note that -smooth functions are automatically -weakly convex. In this setting, it is more common to replace Assumption (A2) by the following. Assumption (A3). There exists a real 0 such that for all

Deriving convergence rates of stochastic schemes with momentum for non-convex functions under Assumption (A3) can be quite challenging. Indeed, even in unconstrained optimization, previous studies often need to make the assumption that the true gradient is bounded, i.e., (see, e.g., [49, 16, 22]). This assumption is strong and does not hold even for quadratic convex functions. It is more realistic in constrained problems, for example when X is compact, albeit the constant G could then be large.

Our objective in this section is twofold: First, we aim to extend the convergence results of SHB in the previous section to smooth optimization problems under Assumption (A3). Note that even in this setting, the sample complexity of SHB has not been established before. The rate is obtained without the need of forming a batch of samples at each iteration, which is commonly required for constrained non-convex stochastic optimization [19, 20]. Second, for unconstrained problems, we demonstrate how to achieve the same complexity without the bounded gradient assumption above. Let ) be its convex conjugate. Our convergence analysisrelies on the function:

The use of this function is inspired by [41]. Roughly speaking, is the negative of the optimal value of the function on the RHS of (7a), and hence, . This function also underpins the analysis of the dual averaging scheme in [32].

The following result plays a similar role as Lemma 3.1.

Lemma 4.1. Let Assumptions (A1) and (A3) hold. Let constant , and define the function:

Then, it holds for any

Proof. Since the proof is rather technical, we defer details to Appendix C and sketch only the main arguments here.

By smoothness of , weak convexity of f and the optimality condition for the update formula (7a) we get

The proof of this relation can be found in Lemma C.1. The preceding inequality admits very useful properties as we have terms that form a telescoping sum, and the constant associated with has the right order-dependence on the stepsize. However, we still have a re- maining term . Thus, in view of relation (8), our next strategy is to bound this term in a way that still keeps all the favourable features described above, and at the most introduces an additional term of order ). As shown in Lemma C.2, we can establish the following inequality

Now, (14) follows immediately from combining the two previous inequalities and the fact that

We remark that Lemma 4.1 does not require the bounded gradient assumption and readily indicates the convergence rate However, to establish the rate for ], we need to impose such an assumption in the theorem below. Nonetheless, the assumption is much more realistic in this setting than the unconstrained case.

Theorem 2. Let Assumptions (A1) and (A3) hold. Assume further that all be defined as in Theorem 1. If we set for some real

Furthermore, if , we obtain

Proof. The proof is a verbatim copy of that of Theorem 1 with replaced by Appendix D.

Some remarks are in order:

i) To the best of our knowledge, this is the first convergence rate result of a stochastic (or even deterministic) method with Polyak momentum for smooth, non-convex, and constrained problems.

ii) The algorithm enjoys the same single time-scale and parameter-free properties as in the non-smooth case.

iii) The rate in the theorem readily translates into an analogous estimate for the norm of the so-called , which is commonly adapted in the literature, e.g., [20]. This is because for -smooth functions [11]:

Since the bounded gradient assumption is rather restrictive in the unconstrained case, our final result shows how the desired complexity can be attained without this assumption.

Theorem 3. Let Assumptions (A1) and (A3) hold. Let uniformly at random from , then under the same setting of Lemma 4.1:

where . Furthermore, let , we obtain

Proof. See Appendix E.

It should be mentioned that a similar result has been attained very recently in [21] using a different analysis, albeit no sample complexity is given for the non-smooth case.

5 Numerical evaluations

In this section, we perform experiments to validate our theoretical developments and to demonstrate that despite sharing the same worst-case complexity, SHB can be much better in terms of speed and robustness to problem and algorithm parameters than SGD.

We consider the robust phase retrieval problem [13, 14]: Given a set of m measurements (, the phase retrieval problem seeks for a vector most i = 1, . . . , m. Whenever the problem is corrupted with gross outliers, a natural exact penalty form of this (approximate) system of equations yields the minimization problem:

This objective function is non-smooth and non-convex. In view of (2), it is the composition of the Lipschitz-continuous convex function and the smooth map . Hence, it is weakly convex.

In each experiment, we set m = 300, n = 100 and select uniformly from the unit sphere. We generate is a matrix with standard normal distributed entries, and D is a diagonal matrix with linearly spaced elements between 1and 1, with 1 playing the role of a condition number. The elements of the vector b are generated as 25) models the corruptions, and

Figure 1: The function gap ) versus iteration count for phase retrieval with

Figure 2: The number of epochs to achieve -accuracy versus initial stepsize retrieval with

a binary random variable taking the value 1 with probability measurements are corrupted. The algorithms are all randomly initialized at 1). The stochastic

Figure 3: The number of epochs to achieve -accuracy versus initial stepsize retrieval with = 10 and popular choices of

subgradient is simply given as an element of the subdifferential of

In each of our experiments, we set the stepsize as + 1, where initial stepsize. We also refer m stochastic iterations as one epoch (pass over the data). Within each individual run, we allow the considered stochastic methods to perform K = 400m iterations. We conduct 50 experiments for each stepsize and report the median of the epochsto--accuracy; the shaded areas in each plot cover the 10th to 90th percentile of convergence times.

with . It is evident that SHB converges with a theoretically justified parameter and is much less sensitive to problem and algorithm parameters than the vanilla SGD. Note that the sensitivity issue of SGD is rather well documented; a slight change in its parameters can have a severe effect on the overall performance of the algorithm [30, 2]. For example, Fig. 1d shows that SGD exhibits a transient exponential growth before eventual convergence. This behaviour can occur even when minimizing the smooth quadratic function

, Example 2]. In contrast, SHB converges in all settings of the figure, suggesting that using a momentum term can help to improve the robustness of the standard SGD. This is expected as the update formula (3b) acts like a lowpass filter, averaging past stochastic subgradients, which may have stabilizing effect on the sequence

To further clarify this observation, in the next set of experiments, we test the sensitivity of SHB and SGD to the initial stepsize . Figure 2 shows the number of epochs required to reach -accuracy for phase retrieval with 3. We can see that the standard SGD has good performance for a narrow range of stepsizes, while wider convergence range can be achieved with SHB.

Finally, it would be incomplete without reporting experiments for some of the most popular momentum parameters used by practitioners. Figure 3 shows a similar story to Fig. 2 for the parameter 1 99 as discussed in remark ii) after Theorem 1. This together with Fig. 2 demonstrates that SHB is able to find good approximate solutions for diverse values of the momentum constant over a wider (often significantly so) range of algorithm parameters than SGD.

6 Conclusion

Using a carefully constructed Lyapunov function, we established the first sample complexity results for the SHB method on a broad class of non-smooth, non-convex, and constrained optimization problems. The complexity is attained in a parameter-free fashion in a single time-scale. A notable feature of our results is that they justify the use of a large amount of momentum in search directions. We also improved some complexity results for SHB on smooth problems. Numerical results show that SHB exhibits good performance and low sensitivity to problem and algorithm parameters compared to the standard SGD.

References

[1] Y. Arjevani, Y. Carmon, J. C. Duchi, D. J. Foster, N. Srebro, and B. Woodworth. Lower bounds for non-convex stochastic optimization. arXiv preprint arXiv:1912.02365, 2019.

[2] H. Asi and J. C. Duchi. Stochastic (approximate) proximal point methods: Convergence, optimality, and adaptivity. SIAM Journal on Optimization, 29(3):2257–2290, 2019.

[3] A. Beck. First-order methods in optimization, volume 25. SIAM, 2017.

[4] L. Bottou. Large-scale machine learning with stochastic gradient descent. In Proceedings , pages 177–186. Springer, 2010.

[5] S. Bubeck. Convex optimization: Algorithms and complexity. Foundations and Trends in Machine Learning, 8(3-4):231–357, 2015.

[6] E. J. Cand`es, X. Li, Y. Ma, and J. Wright. Robust principal component analysis? Journal of the ACM (JACM), 58(3):1–37, 2011.

[7] V. Charisopoulos, Y. Chen, D. Davis, M. D´ıaz, L. Ding, and D. Drusvyatskiy. Low-rank matrix recovery with composite optimization: good conditioning and rapid convergence. arXiv preprint arXiv:1904.10020, 2019.

[8] D. Davis and D. Drusvyatskiy. Stochastic model-based minimization of weakly convex functions. SIAM Journal on Optimization, 29(1):207–239, 2019.

[9] D. Davis and B. Grimmer. Proximally guided stochastic subgradient method for nons- mooth, nonconvex problems. SIAM Journal on Optimization, 29(3):1908–1930, 2019.

[10] D. Drusvyatskiy. The proximal point method revisited. arXiv preprint arXiv:1712.06038, 2017.

[11] D. Drusvyatskiy and C. Paquette. Efficiency of minimizing compositions of convex func- tions and smooth maps. Mathematical Programming, 178(1-2):503–558, 2019.

[12] J. C. Duchi and F. Ruan. Asymptotic optimality in stochastic optimization. arXiv preprint arXiv:1612.05612, 2016.

[13] J. C. Duchi and F. Ruan. Solving (most) of a set of quadratic equalities: Composite optimization for robust phase retrieval. Information and Inference: A Journal of the IMA, 8(3):471–529, 2018.

[14] J. C. Duchi and F. Ruan. Stochastic methods for composite and weakly convex opti- mization problems. SIAM Journal on Optimization, 28(4):3229–3259, 2018.

[15] Y. M. Ermol’ev and V. Norkin. Stochastic generalized gradient method for nonconvex nonsmooth stochastic optimization. Cybernetics and Systems Analysis, 34(2):196–215, 1998.

[16] S. Gadat, F. Panloup, and S. Saadane. Stochastic heavy ball. Electronic Journal of Statistics, 12(1):461–529, 2018.

[17] E. Ghadimi, H. R. Feyzmahdavian, and M. Johansson. Global convergence of the heavy-ball method for convex optimization. In 2015 European Control Conference (ECC), pages 310–315. IEEE, 2015.

[18] S. Ghadimi and G. Lan. Stochastic first-and zeroth-order methods for nonconvex stochas- tic programming. SIAM Journal on Optimization, 23(4):2341–2368, 2013.

[19] S. Ghadimi and G. Lan. Accelerated gradient methods for nonconvex nonlinear and stochastic programming. Mathematical Programming, 156(1-2):59–99, 2016.

[20] S. Ghadimi, G. Lan, and H. Zhang. Mini-batch stochastic approximation methods for nonconvex stochastic composite optimization. Mathematical Programming, 155(1-2):267– 305, 2016.

[21] S. Ghadimi, A. Ruszczy´nski, and M. Wang. A single time-scale stochastic approximation method for nested stochastic optimization. SIAM J. on Optimization, 2020. Accepted for publication (arXiv preprint arXiv:1812.01094).

[22] I. Gitman, H. Lang, P. Zhang, and L. Xiao. Understanding the role of momentum in stochastic gradient methods. In Advances in Neural Information Processing Systems, pages 9630–9640, 2019.

[23] G. Goh. Why momentum really works. Distill, 2017.

[24] A. M. Gupal and L. G. Bazhenov. Stochastic analog of the conjugant-gradient method. Cybernetics and Systems Analysis, 8(1):138–140, 1972.

[25] K. He, X. Zhang, S. Ren, and J. Sun. Deep residual learning for image recognition. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 770–778, 2016.

[26] J.-B. Hiriart-Urruty and C. Lemar´echal. Convex analysis and minimization algorithms, volume 305. Springer science & business media, 1993.

[27] C. Hu, W. Pan, and J. T. Kwok. Accelerated gradient methods for stochastic optimization and online learning. In Advances in Neural Information Processing Systems, pages 781– 789, 2009.

[28] G. Huang, Z. Liu, L. Van Der Maaten, and K. Q. Weinberger. Densely connected convo- lutional networks. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 4700–4708, 2017.

[29] A. Krizhevsky, I. Sutskever, and G. E. Hinton. Imagenet classification with deep convo- lutional neural networks. In Advances in neural information processing systems, pages 1097–1105, 2012.

[30] A. Nemirovski, A. Juditsky, G. Lan, and A. Shapiro. Robust stochastic approximation approach to stochastic programming. SIAM Journal on optimization, 19(4):1574–1609, 2009.

[31] Y. Nesterov. A method for solving the convex programming problem with convergence rate , volume 269, pages 543–547, 1983.

[32] Y. Nesterov. Primal-dual subgradient methods for convex problems. Mathematical programming, 120(1):221–259, 2009.

[33] Y. Nesterov. Lectures on convex optimization, volume 137. Springer, 2018.

[34] E. A. Nurminskii. The quasigradient method for the solving of the nonlinear programming problems. Cybernetics, 9(1):145–150, 1973.

[35] P. Ochs, Y. Chen, T. Brox, and T. Pock. iPiano: Inertial proximal algorithm for non- convex optimization. SIAM Journal on Imaging Sciences, 7(2):1388–1419, 2014.

[36] B. T. Polyak. Some methods of speeding up the convergence of iteration methods. USSR Computational Mathematics and Mathematical Physics, 4(5):1–17, 1964.

[37] B. T. Polyak. Introduction to optimization. Optimization Software, 1987.

[38] H. Robbins and S. Monro. A stochastic approximation method. The annals of mathematical statistics, pages 400–407, 1951.

[39] R. T. Rockafellar. Favorable classes of Lipschitz continuous functions in subgradient optimization. In E. A. Nurminski, editor, Progress in Nondifferentiable Optimization, CP-82-S8, pages 125–143, 1982.

[40] R. T. Rockafellar and S. Uryasev. Optimization of conditional value-at-risk. Journal of risk, (2):21–42, 2000.

[41] A. Ruszczy´nski. A linearization method for nonsmooth stochastic programming prob- lems. Mathematics of Operations Research, 12(1):32–49, 1987.

[42] A. Ruszczynski and W. Syski. Stochastic approximation method with gradient averaging for unconstrained problems. IEEE Transactions on Automatic Control, 28(12):1097– 1105, 1983.

[43] A. Shapiro, D. Dentcheva, and A. Ruszczy´nski. Lectures on stochastic programming: modeling and theory. SIAM, 2014.

[44] A. Singer. Angular synchronization by eigenvectors and semidefinite programming. Applied and computational harmonic analysis, 30(1):20–36, 2011.

[45] I. Sutskever, J. Martens, G. Dahl, and G. Hinton. On the importance of initialization and momentum in deep learning. In International conference on machine learning, pages 1139–1147, 2013.

[46] P. Tseng. An incremental gradient (-projection) method with momentum term and adaptive stepsize rule. SIAM Journal on Optimization, 8(2):506–531, 1998.

[47] J.-P. Vial. Strong and weak convexity of sets and functions. Mathematics of Operations Research, 8(2):231–259, 1983.

[48] L. Xiao. Dual averaging methods for regularized stochastic learning and online optimiza- tion. Journal of Machine Learning Research, 11(Oct):2543–2596, 2010.

[49] T. Yang, Q. Lin, and Z. Li. Unified convergence analysis of stochastic momentum methods for convex and non-convex optimization. arXiv preprint arXiv:1604.03257, 2016.

[50] S. Zagoruyko and N. Komodakis. Wide residual networks. arXiv preprint arXiv:1605.07146, 2016.

[51] S. Zavriev and F. Kostyuk. Heavy-ball method in nonconvex optimization problems. Computational Mathematics and Modeling, 4(4):336–341, 1993.

A Proof of Lemma 3.1

First, we write the update formula of in (7) on a more compact form as:

We have

Since (1 ) is nonexpansive, it holds that

Next, we decompose the right-hand-side of the preceding inequality as

Let ), then by multiplying both sides of (17) by 0, together with (18) and the definition of

By the weak convexity of f and Assumption (A1), it holds that

Next using the optimality condition of

we deduce (by selecting

Therefore, by taking the conditional expectation in (19), combining the result with (20) and (22), and rearranging terms, we arrive at

Next, we show by induction that ), the base case follows directly from Assumption (A2). Suppose the hypothesis holds for i = 0, . . . , k, we have

where (a) is true since is convex; (b) follows from Assumption (A2) and the nonexpan- siveness of Π); and (c) follows from the induction hypothesis. This together with the nonexpansiveness of Π) imply that

Finally, using (25) and Assumption (A2), we obtain

completing the proof.

B Proof of Lemma 3.2

Let ¯) and define the virtual iterates:

In view of Lemma 2.2, we have

where ) denotes the Moreau envelope of By the definition of ), it holds that

Since (1 ) is nonexpansive, we obtain

Combining (28) and (29) yields

Using the definition of ¯

It follows from (30) and (31) that

We next bound the first inner product in (32). Since -weakly convex, it holds that

where the last step is true since We also have that the function )-strongly convex with ˆbeing its minimizer, it follows from [3, Theorem 5.25] that

We thus have

where (a) is due to (34), (b) follows from the definition of ), and (c) holds since . For the second inner product in (32), we have

Therefore, combining (33), (35), (36) and plugging the result into (32) yield

Now, using (25), Assumption (A2), the definition of ¯, and the fact that

For simplicity, define . Finally, to form a telescoping sum, we simply multiply both sides of eq. (9) in Lemma (3.1) by ) and combine the result with (37) to get

where

The proof is complete.

C Proof of Lemma 4.1

In this section, we prove Lemma 4.1. We begin with the following lemma.

Lemma C.1. Let Assumptions (A1) and (A3) hold. Let constant . Then, for any , the iterates generated by procedure (7) satisfy:

Proof. We first rewrite the update of in (7a) on the following form:

Let ) and define its convex conjugate is well-known that is 1-Lipschitz with gradient Chapter X]. Therefore, the update formula (38) implies that smoothness of

We next apply the identity

to obtain

Using the identity (40) again, we get

Plugging (41)–(43) into (39) yields

By identity (40), we have

Note that the optimality condition of in (21) implies that

By the definition of , it holds that

We can also deduce from (46) that , and hence (47) can be further bounded by

Thus, by combining (44), (45), (46), and (48), we arrive at

Next, by the weak convexity of f and Assumption (A1), we have

Thus, multiplying both sides of (49) by 1), taking the conditional expectation, and adding the result to (50) give

Using the definition of completes the proof.

The next lemma bounds the term

First, by Assumption (A3), we have

Define be the last term in (52), then

Since is a function of ), it follows from Assumption (A1) that

Similarly, conditioned on , both terms in the inner product of deterministic, and hence

Similar to (55), we obtain

To bound the remaining term ], we introduce the following virtual iterate:

and define

We have

where (a) holds since is convex; (b) follows since -Lipschitz; (c) follows from the definition of , and the nonexpansiveness of Π); and (d) is true since We thus have

Therefore, taking the conditional expectation in (54) and using (55), (56), (57), (59) yield

Plugging (53) and (60) into (52), we arrive at

where we used the convexity of in the second inequality. We now decompose and bound the term

First, it follows from the optimality condition of in (21) and the definition of

Thus,

We have

For the first term, since

following the same steps leading to (58), it can be bounded as

By the smoothness of , we also have

Hence, it follows from (63)–(66) that

Plugging (68) into (61) yields

Multiplying both sides of (69) by 1) and noting that the proof.

Having established Lemmas C.1 and C.2, the result of Lemma 4.1 follows immediately from the definition of the function and the fact that

D Proof of Theorem 2

We first show by induction that , it follows from Assumptions (A1) and (A3) that

where we used the convexity of in the last step. Since nonexpansiveness of Π) and the induction hypothesis imply that

Since 1], we have , and hence , as desired. From this, together with the fact that , the theorem follows immediately from replacing everywhere in the proofs of Lemmas 3.1–3.2 and Theorem 1.

E Proof of Theorem 3

First, from (30), we have

We next follow [8] and write ˆ

which is true since in this case 1), we have

Using Young’s inequality

Since -Lipschitz, it follows that

where we also used the inequality in the last step. We have

Note that for ], the term associated with is nonpositive, and hence

Therefore, it follows from (70) and (71) that

We also have

where last two step hold since . Therefore, using the fact that ¯, we obtain

Now, multiplying both sides of (14) by and combining the result with (72), we obtain

Let denote the left-hand-side of (73), summing the result over

Note that

where we used the facts that = 0 (since ). Therefore, lower-bounding the left-hand-side by (1 + 2ranging terms, we obtain

where the last expectation is taken with respect to all random sequences generated by the method and the uniformly distributed random variable 1

Finally, letting ) completes the proof.