In this paper we consider the question of whether it is possible to apply a gradient averaging strategy to improve on the sublinear convergence rates without any increase in storage for stochastic gradient (SG) methods. The SG method is the popular methodology (Zinkevich, 2003; Zhang, 2004; Bottou and Bousquet, 2007; Nemirovski et al., 2009; Shalev-Shwartz et al., 2011) for solving the following class of stochastic optimization problems:
where the real-valued function F is defined by
and can be defined as a collection of real-valued functions with a certain probability distribution P over the index set Ξ.
With an initial point , these methods are characterized by the iteration
where 0 is the stepsize and
) is the stochastic gradient defined by
which is usually assumed to be an unbiased estimate of the socalled full gradient i.e.,
) (Shapiro et al., 2009; Bottou et al., 2018). The SG method was originally developed by Robbins and Monro (1951) for smooth stochastic approximation problems. It is guaranteed to achieve the sublinear convergence rate O(1/k) for strongly convex objectives (Nemirovski et al., 2009; Bottou et al., 2018) and this theoretical rate is also supported by practical experience (Bottou et al., 2018). In particular, the practical performance of stochastic gradient methods with momentum has made them popular in the community working on training DNNs (Sutskever et al., 2013); in fact, the approach could be viewed as a gradient averaging strategy. While momentum can lead to improved practical performance, it is still not known to lead to a faster convergence rate. Usually, the gradient averaging strategy and its variants can improve the constants in the convergence rate (Xiao, 2010), it does not improve on the sublinear convergence rates for SG methods. However, owing to the successful practical performance of gradient averaging, it is worth considering whether it is possible to improve the convergence rate, which forms the starting point of this work.
The primary contribution of this work is to show that under the variance dominant condition (Assumption 3), the proposed gradient averaging strategy could improve the convergence rate ) in probability without any increase in storage for the strongly convex objectives with Lipschitz gradients. This result also suggests how we should control the stochastic gradient iterations to improve the rate of convergence in practice. In particular, our averaging strategy coordinates the relationship between the mean and variance of the increment of the iteration, so that the growth of expectation can be controlled when the variance is reduced.
1.1 Related Work
We briefly review several methods related to the new strategy, mainly including stochastic gradient with momentum (SGM), gradient averaging, stochastic variance reduced gradient (SVRG), SAGA and iterate averaging.
SGM. With an initial point , two scalar sequences
that are either predetermined or set dynamically, and
, SGM uses iterations of the form (Tseng, 1998; Bottou et al., 2018)
They are procedures in which each step is chosen as a combination of the stochastic gradient direction and the most recent iterate displacement. It is common to set
as some fixed constants, and in this case we can rewrite the SGM iteration as
So it is clear that SGM is a weighted average of all previous stochastic gradient directions. In deterministic settings, it is referred to as the heavy ball method (Polyak, 1964). While SGM can lead to improved practical performance, it is still not known to lead to a faster convergence rate. Moreover, see Remark 8 in Section 4 for a variance analysis of SGM.
Gradient Averaging. Similar to SGM, gradient averaging is using the average of all previous gradients,
This approach is used in the dual averaging method (Nesterov, 2009). Compared with our new strategy, this method reduces the variance to a similar order O(1/k) without considering the stepsize , however, its expectation is not well controlled, for details see Remark 5 in Section 4. So as mentioned above, it can improve the constants in the convergence rate (Xiao, 2010) but does not improve on the sublinear convergence rates.
SVRG. SVRG is designed to minimize the objective function of the form of a finite sum (Johnson and Zhang, 2013), i.e.,
The method is able to achieve a linear convergence rate for strongly convex problems, i.e.,
SVRG needs to compute the batch gradients ) and has two parameters that needs to be set: besides the stepsize
, there is an additional parameter m, namely the number of iterations per inner loop. In order to guarantee a linear convergence theoretically, one needs to choose
where l and L are given in Assumption 1. Without explicit knowledge l and L, the lengths of the inner loop m and the stepsize are typically both chosen by experimentation. This improved rate is achieved by either an increase in computation or an increase in storage. Hence, SVRG usually can not beat SG for very large n (Bottou et al., 2018).
SAGA. SAGA has its origins in the stochastic average gradient (SAG) algorithm (Le Roux et al., 2012; Schmidt et al., 2017); moreover, the SAG algorithm is a randomized version of the incremental aggregated gradient (IAG) method proposed in Blatt et al. (2007) and analyzed in G¨urb¨uzbalaban et al. (2017). Compared with SVRG, SAGA is to apply an iteration that is closer in form to SG in that it does not compute batch gradients except possibly at the initial point, and SAGA has a practical advantage that there is only one parameter (the stepsize ) to tune instead of two. Beyond its initialization phase, the per-iteration cost of SAGA is the same as in a SG method; but it has been shown that the method can also achieve a linear rate of convergence for strongly convex problems (Defazio et al., 2014). However, the price paid to reach this rate is the need to store n stochastic gradient vectors for general cases except logistic and least squares regression (Bottou et al., 2018), which would be prohibitive in many large-scale applications.
Iterate Averaging. Rather than averaging the gradients, some authors propose to perform the basic SG iteration and try to use an average over iterates as the final estimator (Polyak, 1991; Polyak and Juditsky, 1992). Since SG generates noisy iterate sequences that tend to oscillate around minimizers during the optimization process, the iterate averaging would possess less noisy behavior (Bottou et al., 2018). It is shown that suitable iterate averaging strategies obtain an O(1/k) rate for strongly convex problems even for non-smooth objectives (Hazan and Kale, 2014; Rakhlin et al., 2012). However, none of these methods improve on the sublinear convergence rates O(1/k) (Schmidt et al., 2017).
1.2 Paper Organization
The next section introduces the assumptions we used for establishing convergence results, especially, the variance dominant condition ( Assumption 3). Then the new strategy is discussed in detail in Section 3. In Section 4, we show that under the variance dominant condition, the proposed strategy could increase the convergence rate in probability for the strongly convex objectives with Lipschitz gradients, which suggests how we should control the stochastic gradient iterations to improve the rate of convergence. And we draw some conclusions in Section 5.
2.1 Assumption on the objective
First, let us begin with a basic assumption of smoothness of the objective function. Such an assumption is essential for convergence analyses of most gradient-based methods (Bottou et al., 2018).
The objective function is continuously differentiable and there exist 0
such that, for all
The inequality (7) ensures that the gradient of the objective F is bounded and does not change arbitrarily quickly with respect to the parameter vector, which implies that
is called a strong convexity condition, which is often used to ensure a sublinear convergence for the stochastic gradient methods; and the role of strong sonvexity may be essential for such rates of convergence (Nemirovski et al., 2009; Bottou et al., 2018). Under the strong sonvexity assumption, the gap between the value of the objective and the minima can be bounded by the squared -norm of the gradient of the objective:
This is referred to as the Polyak-�Lojasiewicz inequality which was originally introduced by Polyak (1963). It is a sufficient condition for gradient descent to achieve a linear convergence rate; and it is also a special case of the �Lojasiewicz inequality proposed in the same year (�Lojasiewicz, 1963), which gives an upper bound for the distance of a point to the nearest zero of a given real analytic function.
2.2 Assumption on the variance
We follow Bottou et al. (2018) to make the following assumption about the variance of stochastic gradients, i.e., ). It states that the variance of
) is restricted in a relatively minor manner.
Assumption 2 (Variance limit) The objective function F and the stochastic gradient satisfy there exist scalars
such that, for all
2.3 Assumption on the iteration
Now we make the following variance dominant assumption. It states that the iterative variance ] is always dominant even a little bit in the stochastic gradient iterations. This assumption guarantees that the proposed strategy could achieve the convergence rate
) in probability for the strongly convex objectives.
Assumption 3 (Variance dominant) The sequence of iterates satisfy for all 1
, there is a fixed
(could be arbitrarily small) such that
where denotes the historical expectation operator defined as
the variance
is defined as
When is strictly less than
obviously, Assumption 3 implies that
3.1 Methods
Our accelerated method is procedures in which each step is chosen as a weighted average of all historical stochastic gradients. And specifically, with an initial point , the method is characterized by the iteration
where the weighted average direction
Here, is the weighted average of past gradients and the values of p mean different weighting methods. A larger value of p makes us focus on more recent gradients, as shown in Figure 1; and the recommended weighting method is to choose p = 20, which uses about the nearest 20% historical gradients.
Figure 1: Illustration of the weight coefficients with different values of
and 10
, where the coefficient
Moreover, the method (14) can be equivalently rewritten by the iteration
where the direction vector is recursively defined as
which could be viewed as the classical stochastic gradient with momentum ) where the decay factor
depends on k.
We now define our accelerated method as Algorithm 1. The algorithm presumes that three computational tools exist: (i) a mechanism for generating a realization of random variable representing a sequence of jointly independent random variables); (ii) given an iteration number
, a mechanism for computing a scalar stepsize
and (iii) given an iterate
and the realization of
, a mechanism for computing a stochastic vector
and a scalar
3.2 Stepsize Policy
For strongly convex objectives, we consider the stepsize sequence taking the form
Notice that the accelerated method and the stochastic gradient method are exactly the same in the first iteration. So we assume, without loss of generality, that the first k iterations generated by (14) has the sublinear convergence rate under Assumptions 1 and 2, that is, for every 1
then we shall prove by induction on k that the accelerated method maintains the sublinear convergence rate O(1/k) under Assumptions 1 and 2; and furthermore, we shall also prove that this method can achieve a convergence rate ) under Assumptions 1 to 3.
It follows from (18) and Assumption 1 that
Since 1 , it follows from (19) that
and then, we obtain
Together with Assumption 3, we further obtain
And we will finally show that (21) implies actually ) in Section 4. On the basis of (20), (21) and the stepsize policy (17), we first prove two Lemmas which are necessary for the following convergence analysis.
Lemma 1 Under the conditions of (18), suppose that the sequence of iterates is generated by (14) with a stepsize sequence
taking the form (17). Then, there is
such that for any given diagonal matrix Λ = diag(
the inequality
Proof Note that for any
together with (17), we have ; thus, to prove (22), we only need to show that
holds in probability. Using the mean-variance analysis, it follows from and (20) that
meanwhile, according to (20), we also have
Using Chebyshev’s inequality, there is a C > 0 such that for
which gives the inequality (22) in probability.
Under Assumption 3, it is clear that (22) could be further strengthened.
Lemma 2 Suppose the conditions of Lemma 1 and Assumption 3 hold. Then, there is such that for any given diagonal matrix Λ = diag(
the inequality
Proof Note that (23), we have ; thus, to prove (25), we only need to show that
holds in probability. First, it follows from and (21) that
together with (24) and using Chebyshev’s inequality, there is a C > 0 such that
It is worth noting that when 1, the variance will become the principal part; so the proof is complete.
4.1 Mean-Variance Framework
The mean-variance framework can be described as a fundamental lemma for any iteration based on random steps, which relies only on Assumption 1 and is a slight generalization of Lemma 4.2 in Bottou et al. (2018).
Lemma 3 Under Assumption 1, if for every is any random vector independent of
is a stochastic step depending on
, then the iteration
Proof According to the inequality (9), the iteration
Noting that is independent of
and taking expectations in these inequalities with respect to the distribution of
, we obtain
Recalling (26), we finally get the desired bound.
Regardless of the states before , the expected decrease in the objective function yielded by the kth stochastic step
), could be bounded above by a quantity involving the expectation
)] and variance
4.2 Expectation Analysis
Now we will analyze the mean of to get the bounds of
where
] is the historical expectation of
. First, according to the definition (15) of the weighted average direction
Further, from Assumption 1, we have
then there is a diagonal matrix Λ = diag(] such that
Therefore, ] could be written as
Theorem 4 Suppose the conditions of Lemma 1 hold. Then for every , the following conditions
Remark 5 For the gradient averaging method mentioned (6) in Subsection 1.1, it is using the average of all previous gradients,
And the bound of , which decays slower than
in Theorem 4.
Proof According to (28) and Lemma 1, it follows that
and
so the proof is complete.
Under Assumption 3, both (29) and (30) can be further improved.
Theorem 6 Suppose the conditions of Lemma 2 hold. Then for every , the following conditions
desired results could be proved in the same way as the proof of Theorem 4.
4.3 Variance Analysis
Now we will analyze the variance of to get the bound of
]. As an important result, we will show that the variance of
tends to zero with a rate
Lemma 7 Under Assumption 2, suppose that the sequence of iterates is generated by (14) with
Remark 8 For the SGM method (5) mentioned in Subsection 1.1, it is using the weighted average of all previous gradients,
Together with the proof below, one can find that the variance of could be controlled by a by a fixed fraction
Proof It follows from (27) and
together with the Arithmetic Mean Geometric Mean inequality, we have
Hence, along with Assumption 2, we obtain
According to (23), ); therefore, we have
), and the proof is complete.
Combining Theorem 4 and Lemma 7, we can obtain a bound for each iteration of the accelerated method.
Lemma 9 Under the conditions of Theorem 4 and Lemma 7, suppose that the stepsize sequence . Then, the inequality
holds in probability, where further,
Proof According to (29), together with the Arithmetic Mean Geometric Mean inequality, one obtains
Similarly, by (30) and the Arithmetic Mean Geometric Mean inequality, it holds that
Finally, accoding to Lemma 3, Assumption 2, and , the iterates satisfy
and the proof is complete.
From Theorem 6, the bound we obtained could be further improved in the same way as the proof of Lemma 9.
Lemma 10 Under the conditions of Theorem 6 and Lemma 7, suppose that the stepsize sequence . Then, the inequality
4.4 Average Behavior of Iterations
According to Lemmas 9 and 10, it is easy to analyze the average behavior of iterations of the accelerated method for strong convex functions.
Theorem 11 Under the conditions of Lemma 9, suppose that the stepsize sequence satisfies
. Then, the inequality
Proof According to Lemma 9 and 0
Subtracting from both sides and applying (10), this yields
and it follows from taking historical expectations that
Therefore, the desired result follows by repeatedly applying this inequality above through iteration from 1 to k.
According to Lemma 10, the result could be further improved.
Theorem 12 Under the conditions of Lemma 10, suppose that the stepsize sequence satisfies
. Then, the inequality
4.5 Convergence
According to Theorems 11 and 12, the convergence of the accelerated methods is closely related to the following two limits:
where 1. Therefore, the results in Theorems 11 and 12 can be rewritten as
and
respectively. In the following, we will use the properties of the gamma function to analyze the asymptotic behavior of
Let Γ(z) denote the gamma function for all , and let (
denote the Pochhammer symbol or shifted factorial
1) for all
have two recursive formulas
And we also have the following lemma which gives the first-order asymptotic expansion of the ratio of two gamma functions:
Now we can prove the following first-order asymptotic expansions of
Lemma 14 If a stepsize sequence takes the form (17), then we have the following first-order asymptotic expansions
Proof According to the stepsize policy (17), i.e.,
we have
By using the Pochhammer symbol, can be written as
together with the recursive formula (37), can be further written as
then the first-order asymptotic expansion of can be obtained from Lemma 13. Similarly, according to the stepsize policy (17), we have
By using the Pochhammer symbol, the cumulative product term in the sum above can be written as
together with the recursive formula (37), this cumulative product term above can be further written as
then ) can be rewritten as
Further, it follows from Lemma 13 that
and it follows from (23) and Lemma 13 that
where C is a positive real number. Hence, we finally get
as desired.
Combining Theorem 11 and Lemma 14, we see that
Lemma 15 Suppose the conditions of Theorem 11 hold. Then there are that the bound
Note that for the (k + 1)th iteration, the entire mean-variance analysis process is only related to the first k iterations. Thus, combining (18) and Lemma 15, we prove that
Theorem 16 Suppose the conditions of Theorem 11 hold. Then for every , there are
such that the bound
Similarly, combining Lemma 14, Theorems 12 and 16, we see that
Theorem 17 Suppose the conditions of Theorem 12 hold. Then for every , there are
such that the bound
Therefore, when 0 1, Assumption 1 and Theorem 17 implies that for every 1
, it holds that
then we have
together with Assumption 3, we obtain
Hence, from (21) and (38), it is clear that ) for every 0
1, which actually means
that is, = 1; thus, we have the final result.
Theorem 18 Suppose the conditions of Theorem 12 hold. Then for every , there are
such that the bound
In this work, we discussed the question of whether it is possible to apply a gradient averaging strategy to improve on the sublinear convergence rates without any increase in storage for SG methods. We proposed a gradient averaging strategy and proved that under the variance dominant condition, the proposed strategy could improve the convergence rate O(1/k) to ) in probability without any increase in storage for the strongly convex objectives with Lipschitz gradients. This result suggests how we should control the stochastic gradient iterations to improve the rate of convergence in practice.
D Blatt, A O Hero, and H Gauchman. A convergent incremental gradient method with a constant step size. SIAM J. Optim., 18:29–51, 2007.
L Bottou and O Bousquet. The tradeoffs of large scale learning. In J C Platt, D Koller, Y Singer, and S T Roweis, editors, Adv. Neural Inf. Process. Syst. 20, pages 161–168, Curran Associates, Red Hook, NY, 2007.
L´eon Bottou, Frank E Curtis, and Jorge Nocedal. Optimization methods for large-scale machine learning. SIAM Review, 60(2):223–311, 2018.
A Defazio, F Bach, and S Lacoste-Julien. SAGA: A fast incremental gradient method with support for non-strongly convex composite objectives. In Advances in Neural Information Processing Systems, volume 27, page 1646C1654, 2014.
M G¨urb¨uzbalaban, A Ozdaglar, and P A Parrilo. On the convergence rate of incremental aggregated gradient algorithms. SIAM J. Optim., 27:1035–1048, 2017.
Elad Hazan and Satyen Kale. Beyond the regret minimization barrier: optimal algorithms for stochastic strongly-convex optimization. Journal of Machine Learning Research, 15: 2489–2512, 2014.
Rie Johnson and Tong Zhang. Accelerating stochastic gradient descent using predictive variance reduction. In Advances in Neural Information Processing Systems, volume 26, pages 315–323, 2013.
N Le Roux, M Schmidt, and F R Bach. A stochastic gradient method with an exponential convergence rate for finite training sets. In Advances in Neural Information Processing Systems, volume 25, page 2663C2671, 2012.
S �Lojasiewicz. A topological property of real analytic subsets (in french). Coll. du CNRS, , pages 87–89, 1963.
Xiaopeng Luo and Xin Xu. Stochastic gradient-free descents. preprint, https://arxiv.org/abs/1912.13305v5, 2019.
A Nemirovski, A Juditsky, G Lan, and A Shapiro. Robust stochastic approximation ap- proach to stochastic programming. SIAM J. Optim., 19:1574–1609, 2009.
Y Nesterov. Primal-dual subgradient methods for convex problems. Mathematical programming, 120(1):221–259, 2009.
B T Polyak. Gradient methods for minimizing functionals (in russian). Zh. Vychisl. Mat. Mat. Fiz., 3(4):643–653, 1963.
B T Polyak. Some methods of speeding up the convergence of iteration methods. USSR Comput. Math. Math. Phys., 4:1–17, 1964.
B T Polyak. New method of stochastic approximation type. Automat. Remote Control, 51: 937–946, 1991.
B T Polyak and A B Juditsky. Acceleration of stochastic approximation by averaging. SIAM J. Control Optim., 30:838–855, 1992.
Alexander Rakhlin, Ohad Shamir, and Karthik Sridharan. Making gradient descent optimal for strongly convex stochastic optimization. In Proceedings of the 29th International Coference on International Conference on Machine Learning (ICML), pages 1571–1578, 2012.
H Robbins and S Monro. A stochastic approximation method. Ann. Math. Statist., 22: 400–407, 1951.
M Schmidt, N Le Roux, and F Bach. Minimizing finite sums with the stochastic average gradient. Math. Program., 162:83–112, 2017.
S Shalev-Shwartz, Y Singer, N Srebro, and A Cotter. PEGASOS: Primal estimated sub- gradient solver for SVM. Math. Program., 127:3–30, 2011.
A Shapiro, D Dentcheva, and A Ruszczy´nski. Lectures on Stochastic Programming: Modeling and Theory. MPS-SIAM Ser. Optim., SIAM, Philadelphia, PA, 2009.
I Sutskever, J Martens, G Dahl, and G Hinton. On the importance of initialization and momentum in deep learning. In Proceedings of the 30th International Conference on International Conference on Machine Learning (ICML), volume 28, pages 1139–1147, 2013.
F G Tricomi and A Erd´elyi. The asymptotic expansion of a ratio of gamma functions. Pacific Journal of Mathematics, 1:133–142, 1951.
P Tseng. An incremental gradient(-projection) method with momentum term and adaptive stepsize rule. SIAM Journal on Optimization, 8(2):506–531, 1998.
L Xiao. Dual averaging methods for regularized stochastic learning and online optimization. Journal of Machine Learning Research, 11:2543–2596, 2010.
T Zhang. Solving large scale linear prediction problems using stochastic gradient descent algorithms. In Proceedings of the Twenty-First International Conference on Machine Learning, ACM Press, New York, 2004.
M Zinkevich. Online convex programming and generalized infinitesimal gradient ascent. In Proceedings of the Twentieth International Conference on Machine Learning, pages 928–935, AAAI Press, Palo Alto, CA, 2003.