Convex optimization problems are at the heart of many machine learning algorithms. These problems are typically huge in scale due to either large number of samples or large number of features or both. Gradient descent type algorithms are standard approaches to solve these problems. However, these algorithms are computationally expensive, i.e., have huge per-iteration complexity, in large scale problems. We thus require alternative iterative algorithms where
1. Only one feature parameter (or, a block of parameters) is updated at each iteration. These are coordinate descent algorithms [1]; these check per-iteration computation complexity when the number of features are humongous.
2. Data samples are processed one (or, one block) at a time. This is preferred when data sets have enormous samples, and is necessitated in scenario where samples are availed in real time. Processing randomly chosen samples from the whole available data set leads to stochastic gradient type algorithms [2], whereas processing samples in real time is referred to as online convex optimization or online learning [3].
None of these alternatives is adequate if both, the number of samples as well as the number of features, are huge.
We propose iterative descent algorithms that first choose a sample and then randomly choose a feature, and update the corresponding parameter only based on the chosen sample. In other words, we propose algorithms that combine characteristics of coordinate descent and stochastic gradient descent (or, online gradient descent) algorithms. It is well known that stochastic coordinate descent algorithms suffer from slow convergence due to variance of stochastic gradients of random samples. On the other hand, randomized coordinate descent algorithms have same convergence rate as gradient descent but worse constants. We have proposed “accelerated” gradient algorithms in order to alleviate these deficiencies.
1.1 Related Work
Stochastic gradient descent (SGD) [2] was introduced by Nemirovski et al. where as Online convex optimization and the associated projected gradient (OGD) [3] were introduced by Zinkevich. Hu et al. derived accelerated versions of SGD and OGD; they refer to these algorithms as Stochastic Accelerated GradiEnt (SAGE) [4]. Recently, Roux et al. proposed stochastic average gradient (SAG) [5] and Johnson and Zhang proposed Stochastic variation reduced gradient (SVRG) [6], both aimed at improving the convergence rate of SGD. Langford et al. [7] and McMahan and Streeter studied delay tolerant OGD algorithms [8] where parameter updates are based on stale gradient information.
Cyclic block coordinate descent algorithms were introduced by Luo and Tseng [9,10]. Nesterov proposed randomized block coordinate descent (RBCD) [1] algorithms for large scale optimization problems. Fercoq and Richtarik [11] and Singh et al. [12] proposed accelerated randomized block coordinate algorithms. More recently, Allen-zhu et al. [13] proposed faster accelerated coordinate descent methods in which sampling frequencies depend on coordinate wise smoothness parameters (i.e., Lipschitz parameters of the corresponding partial derivatives).
Dang and Lan proposed stochastic block mirror descent (SBMD) [14] which combines SGD and RBCD. Similar algorithms were proposed by Wang and Banerjee [15], Hua et al. [16], Zhao et al. [17] and Zhang and Gu [18] also, who called their algorithms ORBCD, R-BCG, MRBCD and ASBCD, respectively. ORBCD and MRBCD were enhanced by variance reduction techniques to attain linear rate of convergence for strongly convex loss functions. On the other hand, ASBCD used optimal sampling of training data samples to achieve the same.
More recently, there has been interest in machine learning settings in which training data and features are distributed across nodes of a computing cluster, or more generally, of a network. Nathan and Klabjan [19] proposed an algorithm where nodes parallelly update (possibly overlapping) blocks of feature parameters based on locally available data samples; this was seen as a combination of SVRG and block coordinate descent. Konecny et al. [20] studied algorithms for nodes connected through a network; this scenario was referred to as federated learning.
1.2 Our Contribution
We have proposed two accelerated randomized coordinate descent algorithms for stochastic optimization and online learning, which we refer to as SARCD and OARCD, respectively. Expectedly, these algorithms have significantly less per-iteration computation complexity than accelerated gradient descent algorithms, e.g., SAGE. Moreover, the proposed algorithms have the following properties.
1. SARCD for general convex objective functions exhibits convergence rate which matches the best known rates.
2. SARCD for strongly convex objective functions exhibits convergence rate which is strictly better than rate of ORBCD.
3. OARCD for general convex loss functions yields regret bound which is strictly better than , the regret bound of ORBCD and R-BCG.
4. OARCD for strongly convex loss functions yields regret bound O(n log(T)) which is strictly better than , the regret bound of ORBCD.
The proposed algorithms can easily be generalized to accelerated randomized block coordinate descent algorithms.
We consider machine learning models characterized by input-output pairs, model parameters (or, feature parameters) and a loss function. The model parameters are used to estimate or predict outputs (also called labels) from the corresponding inputs (also called features). The loss function is used to measure discrepancy between the predicted and the actual outputs. To illustrate, let be an input-output pair and be a vector of the model parameters. Then and y yield an estimate of . Clearly, the loss function, which provides a measure of discrepancy between and , can be seen as a mapping from the couple to real numbers; let us denote this function as . For example, considering -losses,
Machine learning aims at identifying the model parameters that minimize the losses for all input-output pairs. We make this notion precise in the following two subsections which focus on two different premises.
2.1 Stochastic Optimization
Let us assume that we have a collection of input-output pairs, also called training samples. An input-output pair may appear more than once in the collection, and different pairs can have different relative frequencies. We aim at determining the model parameters that minimize the average loss over all the input-output pairs. Towards this, we let denote a random input-output pair with appropriate distribution and consider the optimization problem1
Let . We assume that
1. is convex and differentiable. Moreover, we assume that is Lipschitz continuous with parameter L, 2. is an unbiased estimator of , i.e., , 3. is strongly convex with parameter yields better iteration complexity.2
In Section 3, we propose an algorithm, SARCD, to solve the above problem. We establish convergence rates of SARCD for general convex loss functions (or, cost functions) and strongly convex loss functions in Theorems 1 and 2, respectively.
1 The proposed algorithm does not need the distribution of 2 We allow to accommodate general convex loss functions. Strong convexity warrants
2.2 Online Learning
Here the input-output pairs arrive sequentially in steps and the model parameters are updated after each step. More precisely, we start with an arbitrary modelling parameter vector . Assuming that we have model parameters on arrival of the tth input-output pair , we incur a loss and update to based on . For a given T > 0, we aim at generating a sequence of model parameters that minimize the T-step “regret” R(T), defined as
Here we assume that are convex and differentiable for all . Moreover, we also assume that, for all are Lipschitz continuous with parameter L and are strongly convex with parameter .
In Section 4, we propose an algorithm, OARCD, to solve the above learning problem. We provide regret bounds of OARCD for general convex loss functions and strongly convex loss functions in Theorems 3 and 4, respectively.
SARCD is the “coordinate descent” version of SAGE and an “accelerated” version of ORBCD. In other words, it is an iterative algorithm in which, at each iteration, we randomly choose an input-output pair and then a coordinate, and update only this coordinate of the vector of model parameters. As is typical of accelerated gradient methods, we update two other sequences and apart from , and we also maintain two parameter sequences and (see 2.1, [1,4]). Further, we use two constants a(n) and b(n) which we later set to achieve best convergence results. We let indicate the random input-output pair chosen at tth iteration; are i.i.d. We also use a random diagonal matrix to indicate the coordinate chosen at tth iteration; each has only one nonzero entry and are i.i.d. Formally, our algorithm is as follows.
Clearly, and are independent and are also independent of . Let . Let be the gradient mapping involved in updating . Also, let denote the history of the algorithm
until time t. More explicitly,
Notice that and are functions of . We first establish the following lower bound on f(x).
Proof. Using strong convexity of f (see assumption 3, [1])
On the other hand, by Descent Lemma [1],
Taking expectation in (2) (conditioned on ) and then combining with (1),
⟩|H−
which is the desired inequality.
Proposition 1. Assume that and for all . Then, for all ,
Proof. Let us first notice that (see Algorithm 1)
and also that the objective function in this minimization problem is strongly convex with parameter . Hence,
Using this in Lemma 1,
where we have dropped the term from the right hand side without affecting the inequality. Also, substituting in Lemma 1,
where again we have dropped the term from the right hand side. Now, multiplying (3) by and (4) by and adding,
Rearranging the terms,
where
We can see that
where we use the update rule of to get the second equality (see Algorithm 1) and then the Young’s inequality.3 Further,
where we again use the update rule of in the last equality. Using the above bound on A, the expression for B and Cauchy-Schwartz inequality (to infer ), and setting ,
We now set a =
Finally, we get the desired result by using the bound
Let be the optimal solution of the optimization problem (2.1). Then
because owing to unbiasedness of . We set in Proposition 1 and take expectation on both the sides to obtain the following corollary.
Corollary 1. Assume that and for all . Then, for all ,
We now appropriately set and a(n) to obtain rapid convergence of to . We first consider the case when , i.e., f is not strongly convex.
Theorem 1. Assume that that and . Set and for , where and b > 0 is a constant. Then
Proof. Dividing both sides in Corollary 1 by ,
where the last inequality holds because
Similar inequalities hold for all t = 1, . . . , T, which, when put together, yield
where the second inequality is obtained by using Corollary 1 for t = 0 and the last inequality by substituting the value of . We thus get
where the second inequality is obtained by using and also substituting the value of . Finally, we get the desired bound by setting a(n) = n and
We now consider the case when , i.e., f is strongly convex.
Theorem 2. Assume that that and
Dfor some D > 0. Set α, Land α
for where and for . Then
Proof. We first make a few observations.
(a) Using definitions of and ,
(c) For all ,
which contradicts the former. This completes the argument.
Using (a), (b), (c) above and Corollary 1 as in the proof of Theorem 1,
Using Corollary 1 once more for t = 0 and ,
Therefore,
where the second inequality follows from our setting of , the third from (b) and the fourth from (d).
Remark 1. 1. Settings of and in Theorems 1 and 2 do not require knowledge of
and the number of iterations T.
2. The convergence rate bound in the case of strongly convex objective functions is independent of the choice of a(n).
OARCD can also be seen as the “coordinate descent” version of the SAGE-based online learning algorithm and an “accelerated” version of ORBCD for online learning. Here, at each step t, we encounter an input-output pair and incur a loss . We then randomly choose a coordinate of , and update only this coordinate based on . As in the case of SARCD, we maintain two other sequences, and , and two parameter sequences, and , and also use two constants a(n) and b(n) to achieve optimal regret bounds. We again use a random diagonal matrix to indicate the coordinate chosen at tth step. Formally, this algorithm is as follows.
Let and denote the history of the algo- rithm until time t. We first establish the following lower bound on .
Lemma 2. For ,
Proof. From strong convexity of ,
On the other hand, by Descent Lemma,
Taking expectation in (6) (conditioned on ) and then combining with (5),
which is the desired inequality.
Proposition 2. Assume that and for all . Then, for all ,
Proof. From the update equation of (see Algorithm 2),
The objective function in this minimization problem is strongly convex with parameter . Hence,
Using this in Lemma 2,
where we have dropped the term from the right hand side without affect- ing the inequality.
On the other hand,
Hence, using the update equation of x(t) (see Algorithm 2),
where the inequality follows from convexity of . Using this inequality in (7) with b(n) = 1,
Further, from convexity of ,
where the second inequality follows from Young’s inequality, the third inequality from the bound on , the first equality from the update rule of and the second equality from convexity of . Taking conditional expectation in the above inequality and adding with (8) we get
Finally we get the desired result by using the bound and then taking expectation.
Let us assume that minimizes (see (2.2)). We now set anda(n) to obtain best regret bounds. As in case of SARCD, we first consider .
Theorem 3. Assume that and for . Set and , where is a constant, and . Then the regret of OARCD can be bounded as
Proof. From Proposition 2, using ,
where
Substituting ,
Further,
However, for ,
Hence,
So, setting and using the bounds on A, B and C,
which is the desired bound.
Finally, we consider the case when .
Theorem 4. Assume that and for . Set and , where is a constant, and . Then the regret of OARCD can be bounded as
Proof. As in the proof of Theorem 3, we can write
where A, C are exactly same as before and
Substituting ,
Further,
However, for ,
Hence,
So, setting a(n) = n and using the bounds on A, B and C,
which is the desired bound.
The algorithm proposed in this paper OARCD gives an improvement over ORBCD [15,16] as shown in red in the figures 1,2 and 3. The regret is much lower for both classification as well as regression. Table 1 shows the dataset description taken from UCI Repository. We have performed several experiments on RCV1 dataset also. We have observed that regularization is needed for the well posed-ness of the problem. Adaptive coordinated descent method is also used for setting the learning rates as a function of the sum of the gradients which reduced the regret significantly. We have observed that normalization and choosing L as sum of the squares of maximum features are related in the sense that if we do not normalize but set L to be the sum of the squares of maximum features then also it gives the same result as we obtained after normalizing and setting L to be the number of features. We don’t require per coordinate Lipschitz continuity if normalization is done. Online gradient methods will not work in cases where number of features are more. In the dataset 3 and 4 as shown in Table 1 and 2, OGD [3] and SAGE [4] took more than 1 minute while OARCD proposed in this paper took only 6 seconds to compute the regret. Also, OARCD shows improvent in accuracy and performs less number of mistakes than ORBCD [15,16] as shown in Table 2. When we add more number of examples, we see that the regret becomes constant since the difference between the best algorithm upto time t and the online algorithm becomes less. Figure 4 shows the comparison between the loss of APPROX [11] and SARCD proposed in this paper.
Fig. 1. Regret comaprison on abalone dataset Fig. 2. Regret comaprison on breastcancer dataset
Fig. 3. Regret comaprison on dorothea dataset Fig. 4. Loss of different algorithms on abalone dataset
Table 1. Dataset Description
Table 2. Accuracy and number of mistakes
We have proposed two accelerated randomized coordinate descent algorithms for stochastic optimization and online learning, respectively. Our algorithms exhibit performance as good as the best known randomized coordinate descent algorithms and yield strictly better regret bounds in case of online learning.
Our ongoing and future work entails extending these algorithms to regularized loss functions. We would like to investigate adaptation of feature selection probabilities to coordinate wise smoothness parameters. We would also like to consider online learning problems where update of model parameters takes considerable time, and so, the updated parameters are available only after a certain, potentially random, number of data samples have passed.
Acknowledgments: The second author acknowledges support of INSPIRE Faculty Research Grant (DSTO-1363).
1. Nesterov, Y.: Efficiency of Coordinate Descent Methods on Huge-Scale Optimization Problems. SIAM Journal on Optimization. 22, 341-362 (2012).
2. Nemirovski, A., Juditsky, A., Lan, G., Shapiro, A.: Robust stochastic approximation approach to stochastic programming. SIAM Journal on Optimization. 19, 1574-1609 (2009).
3. Zinkevich, M.: Online convex programming and generalized infinitesimal gradient as- cent. Proceedings of the 20th International Conference on Machine Learning (ICML-03). pp. 928-936 (2003).
4. Hu, C., Pan, W., Kwok, J.: Accelerated gradient methods for stochastic optimization and online learning. Advances in Neural Information Processing Systems. pp. 781-789 (2009).
5. Roux, N.Le., Schmidt, M., Bach, F.: A stochastic gradient method with an exponential convergence rate for finite training sets. In Neural Information Processing Systems. (2012).
6. Johnson, R., Zhang, T.: Accelerating stochastic gradient descent using predictive vari- ance reduction. Advances in neural information processing systems. pp. 315-323 (2013).
7. Langford, J., Smola, A., Zinkevich, M.: Slow learners are fast. Advances in Neural Information Processing Systems. 22, 2331-2339 (2009).
8. McMahan, B., Streeter, M.: Delay-tolerant algorithms for asynchronous distributed online learning. Advances in Neural Information Processing Systems. pp. 2915-2923 (2014).
9. Luo, Z-Q., Tseng, P.: On the convergence of the coordinate descent method for convex differentiable minimization. Journal of Optimization Theory and Applications, 72:735, (2002).
10. Tseng, P.: Convergence of a block coordinate descent method for non differentiable minimization. Journal of Optimization Theory and Applications, 109:475494, (2001).
11. Fercoq, O., Richtarik, P.: Accelerated, Parallel, and Proximal Coordinate Descent. SIAM Journal on Optimization. 25, 1997-2023 (2015).
12. Singh, C., Nedic, A., Srikant, R.: Random Block-Coordinate Gradient Projection Al- gorithms. Decision and Control (CDC). pp. 185-190. IEEE (2014).
13. Allen-Zhu, Z., Qu, Z., Richtarik, P., Yuan, Y.: Even faster accelerated coordinate de- scent using non-uniform sampling. International Conference on Machine Learning. pp. 1110-1119 (2016).
14. Deng, Q., Lan, G., Rangarajan, A.: Randomized Block Subgradient Methods for Con- vex Nonsmooth and Stochastic Optimization., (2015).
15. Wang, H., Banerjee, A.: Randomized block coordinate descent for online and stochas- tic optimization., (2014).
16. Hua, X., Kadomoto, S., Yamshita, N.: Regret Analysis of Block Coordinate Gradient Methods for Online Convex Programming, (2015).
17. Zhao, T., Yu, M., Wang, Y., Arora, R., Liu, H.: Accelerated mini-batch randomized block coordinate descent method. Advances in neural information processing systems. pp. 3329-3337 (2014).
18. Zhang, A., Gu, Q.: Accelerated Stochastic Block Coordinate Descent with Optimal Sampling. KDD. pp. 2035-2044 (2016).
19. Nathan, A., Klabjan, D.: Optimization for Large-Scale Machine Learning with Dis- tributed Features and Observations. International Conference on Machine Learning and Data Mining in Pattern Recognition. pp. 132-146. Springer, Cham (2017).
20. Konecny, J., McMahan, H., Ramage, D., Richtarik, P.: Federated optimization: dis- tributed machine learning for on-device intelligence., (2016).