Mathematical optimization plays an important role in machine learning. Many learning tasks can be formulated as a problem of minimizing a finite sum objective:
where N is the number of samples, d is the dimension of parameters and denotes the loss function for sample i. In order to solve Eq.(1), many first- and second-order methods have been proposed and has the following update paradigm:
where is the gradient and
is the step size at t-th iteration. The term
can be set differently in different methods. First-order methods set
as an identity matrix. The resulting updating procedure becomes gradient descent (GD) or stochastic gradient descent (SGD) depending on whether the gradient
is calculated over the whole sample set or one random sample [1, 2, 3]. A series of first-order linearly convergent methods and their variance reduction variants were proposed to accelerate the above iteration updates, including SVRG [4], SAGA [5], SDCA [6], etc. Their main intuition is to balance the computational cost of
and the approximation gap between the stochastic gradient
and the global gradient
Compared with first-order optimizers, second-order optimization methods regard in Eq. (2) as the Hessian inverse (the standard Newton Method) or certain carefully constructed Hessian inverse (quasi-Newton methods). The matrix
can be thought to adjust the step size and gradient coordinates through the high order information or the quasi-Newton condition. Second-order methods usually achieve a better convergence rate than first-order ones.
However, second-order algorithms are not widely used in large-scale optimization problems due to expensive computation cost. Computing the Hessian matrix and its inverse requires high computation complexity and consumes large memory. Standard Newton method takes per iteration, where N and d are the number of samples and the number of unknown parameters, respectively. Quasi-Newton methods like BFGS and L-BFGS [7] are faster, but still requires
. However, they do not maintain the local quadratic convergence rate. For problems with large parameters, it is not affordable even if the inverse calculation could fit in memory.
Our goal is to develop an approximate yet efficient Newton method with provable convergence guarantee. We aim to reduce per-iteration computation cost for the Hessian inverse calculation while maintaining the approximation accuracy. To this end, we propose Stochastic Projected Approximate Newton method (SPAN), a novel and generic approach to speed up second-order optimization calculation. The main contributions of the paper are as follows:
• We propose a novel second-order optimization method, SPAN, to achieve a better trade-off between Hessian approximation error and per-iteration efficiency. Inside the method, we propose a stochastic sampling technique to construct the Hessian approximately. Since it only requires first-order oracles and Hessian-vector products, the Hessian and its inverse can be computed very efficiently.
• We present a theoretical analysis about the Hessian approximation error and the convergence rate. SPAN achieves linear-quadratic convergence with a provable Hessian approximation error bound.
• We conduct experiments on multiple real datasets against existing state-of-the-art methods. The results validate that our proposed method achieves state-of-the-art performance in terms of the wall-clock time, with almost no sacrificing on the construction accuracy of the Hessian.
In order to improve the per-iteration efficiency, several stochastic second-order methods have tried to seek the trade-off between per-iteration computational complexity and convergence rate, which can be divided into two main categories.
Stochastic Newton Methods. A series of sub-sampled Newton methods are proposed to solve the problems where the sample size N is far larger than the feature size d. In these methods, the Hessian matrix is approximated with a small subset of training samples. NewSamp [8] and similar methods by [9, 10] sample from function set and construct a regularized m-rank approximate Hessian with truncated singular value decomposition to improve the per-iteration efficiency. However, they need to compute the sub-sampled Hessian, which has an
computational cost even if there is only one single instance. Removing the requirements of second-order oracle in NewSamp-type methods, our SPAN further accelerates the iteration with stochastic low-rank approximation techniques, which improves the complexity by a factor nearly O(d/m).
Sketch Newton method in [11] adopts sketching techniques to approximate Hessian. A non-uniform probability distribution was introduced to sample rows ofin [12]. Both of them use new approximate Hessian construction and proper sampling methods to improve Hessian estimation efficiency. Differ from such sketch Newton methods, our SPAN does not need the Hessian Decomposition assumption for sketched updates in [12, 11].
Rather than estimating the Hessian, LiSSA by [13] approximates the Hessian inverse with matrix Taylor expansion. The most elegant part of LiSSA is that the Hessian inverse estimation is unbiased, and the approximation error only depends on the matrix concentration inequalities. Compared with LiSSA, SPAN guarantees more robust per-iteration complexity when the objective function is not Generalized Linear Model (GLM). Moreover, we do not require the initial solution is close to the optimum. Additionally, the procedure of calculating the approximate Hessian for each sample in SPAN is independent, which means our method has better parallelism compared with LiSSA.
Table 1: Important mathematical notations in this paper.
Stochastic Quasi-Newton Methods. The quasi-Newton methods can also be improved with stochastic approximation techniques. S-LBFGS by [14] adopts the randomization to the classical L-BFGS and integrates the widely used gradient variance reduction techniques. Subsequently, SB-BFGS in [15] extends BFGS with matrix sketching to approximate Hessian inverse. Stochastic quasi-Newton methods have a better per-iteration complexity compared with stochastic Newton methods. However, most of them cannot explicitly demonstrate the benefit of introducing the curvature information in the convergence analysis. Such problem is solved in our SPAN by establishing the Hessian approximation error bound which can hardly be analyzed in stochastic quasi-Newton methods.
In this section, we will first present the overall idea and intuition of our proposed SPAN. Then, we will describe three technical components and the full algorithm. For better illustration, we list some important notations and their descriptions in Table 1.
The goal of SPAN method is to optimize Eq. (1) with respect to the decision variable x (i.e., model parameters) using the mini-batched iterative update:
where t is the iteration index. Note that the straightforward calculation of Eq. (3) requires time-consuming for models with a large d.
To make the computation faster, instead of calculating the batch Hessian (the abbreviation of
) and then taking the inverse explicitly, our idea is to replace
with an approximate Hessian
be bounded within a small region around the true Hessian
. We propose to use the projected approximation for the Hessian
where P is a carefully constructed orthogonal projector [16]. To further ensure
invertible, we add an additional perturbation term
. Note that such formulation is sufficient to approximate
, since we can consider
to obtain the optimal k-rank approximation of
. However, in practice, we can hardly find
exactly without knowing the batch Hessian. Thus, it is challenging to construct P with desired efficiency and approximation accuracy. At the first glance, this seems infeasible, while our main intuition is based on the observation that affine transformation A over random vectors tends to have larger components on the top singular vectors of A. Such intuition is helpful to find an accurate approximation of orthonormal projector
. More details follow later.
In the remaining of this section, we present the construction of an approximate Hessian and the resulting optimization algorithm. In particular, we design the algorithm components guided by the following questions.
1. How to design a structure for can be computed efficiently without knowing
2. How to make the inverse robust — permitting to be invertible in any circumstance?
3. How to balance the Hessian approximation error and iteration complexity flexibly?
3.1 Stochastic Projected Approximation
We construct the approximate Hessian as which can be calculated without knowing
essentially decides the direction where the Hessian
would project onto. We decompose P into the product of the form
is an orthonormal matrix. With that, we construct an approximate Hessian
One can easily verify that the Moore–Penrose inverse of can be computed efficiently via
since the size of
is much smaller than
and can be calculated by the product of
In the following, we develop a method to obtain a U while keeping small. We derive a method to calculate
without requiring Hessian
Our method is to randomly choose a set of column vectors and project them to the space expanded by the singular vectors of
. From the projection, one can extract an orthonormal basis U to expand a low-dimensional space for
to project to. The procedure is inspired by the Proto-Algorithm [17], combining with the secant equation for Hessian-vector products calculation. We utilize the standard Gaussian distribution to generate such
and calculate the projection
directly. To make the computation more efficient, the set of random vectors
contains l elements where
Notice that for any matrix V whose i-th column vector is presented as , the extended Hessian-vector product on a sample batch B, denoted as
In practice, the Hessian-vector product can be calculated by the finite difference of gradients like Algorithm 1.
U is then constructed as follows:
With the above-constructed U, can be further constructed again using the extended Hessian-vector product,
. But there is no need to calculate
explicitly, it suffices to calculate its Moore-Penrose inverse. One can verify that with such constructed
, the error in the term of
is well-bounded [17].
3.2 Robust Hessian Inversion
The above constructed approximate Hessian is simplified with one flaw ( is actually not invertible) since the orthonormal matrix U is low-rank. This can be alleviated with a perturbed version of Eq.(4).
where is a carefully chosen constant with
The purpose of the perturbation term is to introduce the invertibility of the approximate matrix
presented in Eq. (4). We will give an informal analysis here to show the importance of choosing a proper
. On one hand, a large
will impair captures of the main actions of
, since it increases the lower bound of the Hessian approximation error
Figure 1: The normalized projection, i.e., and
(LTR) of standard gaussian vectors
(blue points) where
On the other hand, from an iteration perspective, we can neither choose a tiny since it will lose the benefit of the curvature information induced by approximate Newton. In particular, if we regard the SVDs of
and
as follows
The SVD of constructed Hessian in Eq. (5) and its inverse can be formulated as
It can be observed that a tiny will make the singular vectors associated with
dominate the action of
, which impairs the introduction of curvature information taken by
Hence, a proper is needed to balance the Hessian approximation error and the
norm of constructed Hessian inverse. Specifically, we will further discuss
(see Theorem 4.2) and the Hessian approximation error
Lemma 4.1) in Section 4. Rigorous proof will be deferred to our supplementary materials.
3.3 Power Iteration
In this part, we introduce a power iteration technique to balance and iteration complexity more flexibly. That is to say, with limit iteration complexity sacrificing, the construction of
in Eq. (5) can be improved through some auxiliary steps in the main algorithm. Generally speaking, any matrix U obtained as in Section 3.1 is valid for
’s construction. However, in practice, a better orthogonal projector
maintains that top singular vectors of
are rotated less after performing the projection on
. The matrix U is usually obtained from the projection
of random vectors. Thus, we hope basis vectors of the projection have larger components on top singular vectors e.g.,
, compared with those on
and
. Such a requirement can be satisfied by taking the power of the Hessian.
Specifically, as shown in Figure 1 showed, normalized (green dots) seems to like a unit ball, while they degenerate to a unit circle (red dots) when random vectors
are projected through
(red dots) because the component of projection on [1, 0, 0] almost have become 0. A similar phenomenon also happens for the component of projection on [0, 1, 0] when taking a larger power for Hessian, i.e.,
(yellow dots). As a result, the normalized projections nearly collapse to two points (yellow dots), i.e., [0, 0, 1] and
. That is to say, the component of projection on bottom singular vectors tends to vanish. Besides, such a phenomenon becomes more significant as the power of Hessian increases. With such an observation, we can calculate a better U with the following steps:
1. generate a standard Gaussian matrix
2. iteratively use the extended Hessian-vector product to get
3. calculate the basis U via QR decomposition
3.4 Details of SPAN
In this section, we show the complete algorithm about SPAN in Algorithm 2. Also, we explain the iteration complexity of SPAN, and compare it with the state-of-the-art optimizers.
Table 2: Per-iteration complexity comparisons among stochastic first-order optimization methods, sub-sampled Newton Methods, LiSSA and ours. Notice that
In Algorithm 2, we use as the abbreviation of
. At each iteration, we first select a sample batch (Step 3) and generate some random matrix (Step 4). With the random matrix
can be found through the process we introduced in Section 3.3 (Step 5 to Step 8). After that, we compute
(Step 9) as an intermediate variable and select the perturbation constant
. Finally, we calculate the constructed Hessian inverse
(Step 10) like Eq. (5) and update the decision variables (Step 11).
Iteration Efficiency. In order to illustrate the computational complexity of each iteration, we first introduce some condition numbers, which are designed with respect to the component functions, i.e., . In such case, one typically assumes that each component is bounded by
and
like LiSSA [13], we define
Such condition numbers have the following relations
We compare per-iteration complexity among state-of-the-art optimizers, including SPAN, and list the results in Table 2 where we ignore log terms of d and different s for brevity. The iteration complexity of SPAN consists of three terms. The first term O(Nd) represents the time complexity of the full gradient computation (Step 11). The second term indicates the complexity of power iteration (Step 5 to Step 7). The calculation of
in Step 6 requires O(bld) where the fact
and
will be detailedly demonstrated in our supplementary materials. The last term
shows the complexity of QR decomposition (Step 8). In particular, the most time-consuming step in constructing
is to calculate
, which leads to an
time complexity. For a given
, the computational cost of
is only O(ld) when the order of matrix multiplication is appropriately arranged. Owing to the fact that
, such computational cost is much smaller than
(the third term).
Comparing with the NewSamp [8] which updates decision variables with a sub-sampled constructed Hessian inverse, SPAN takes a significant acceleration at each iteration through the only first-order oracle and the Hessian-vector product requirements when the m-th singular value is close to the minimum singular value with . In addition, per-iteration complexity in LiSSA [13] is even worse than NewSamp in general cases. However, in GLM models, LiSSA has better performance because of the fast calculation of Hessian-vector products. Even in GLM models, our SLAN can still be faster than LiSSA when the approximate Hessian is good enough, or the condition number
is large with
. Notice that, the condition numbers of batch loss chosen, i.e.,
, will not be too large. Otherwise, per-iteration efficiency is governed by batch Hessian or Hessian-vector products calculation, which impairs the acceleration achieved by matrix approximation techniques.
We will give a theoretical analysis of our results in this section. We will first introduce some standard assumptions for the optimization problems, and then bound the error between our approximate Hessian and the sub-sampled one. Finally, we will show the local linear-quadratic convergence of our main algorithm (Algorithm 2), with the detailed coefficients of the convergence rate. We further compare the convergence rate of our method with NewSamp [8] and LiSSA [13]. Due to space limitations, the details of proof arguments are provided in the supplementary materials.
Assumption 1. (Hessian Lipschitz Continuity) For any subset and a second-order differentiable objective function F like the Eq.(1), there exists a constant
depending on b, such that
Assumption 2. (Gradient Lipschitz Continuity and Strong Convexity) For any subset and a second-order differentiable function F like the Eq.(1), there exists constants
depending on the b, such that for any
Assumption 3. (Bounded Hessian) For any , and the Hessian of the function
is upper bounded by an absolute constant K, i.e.,
Lemma 4.1. Suppose the Assumption 1, 2 and 3 hold. For every iteration t in Algorithm 2, if the parameters satisfy:
As far as we know, if we approximate the Hessian with an m-rank+sparse construction, the optimal Hessian approximation error will be which comes from NewSamp [8]. Corresponding to such an approximation error, the construction complexity of NewSamp requires
which can hardly be endured with a slightly larger instance dimension. While, from Lemma 4.1, SPAN improves the complexity by a factor at least
(
) when the stochastic Hessian can be well approximated and keeps the approximation error nearly three times the optimal (
is a tiny constant). Such a guarantee cannot be promised for any quasi-Newton method. Although LiSSA [13] has a similar error bound for the approximate Hessian inverse, that approximation error is not comparable because it depends almost entirely on the concentration inequalities of the sub-sample processing but not the matrix approximation techniques. Additionally, we only bound the Hessian approximation error when q = O(log d) and
. In fact, such upper bounds on q and
are possibly pessimistic and can likely be improved to a more average quantity. However, since the parameters q = O(1) and
suffice for convergence in our experimental settings, we have not tried to optimize it further.
Theorem 4.2. Suppose . Frame the hypotheses of Lemma. 4.1, if the parameters satisfy:
The coefficients
Figure 2: Training loss versus running time. The first two columns are for MNIST4-9 dataset. The last two are for CovType dataset. Note SPAN achieves the best or near the best performance with respect to wall-clock time.
Remark. Notice that, we require in Theorem 4.2, which is only introduced to simplify the formulation of coefficients, i.e.,
and
. For any
, we can obtain a similar convergence rate by setting
. Thus, the theorem is without loss of generality.
Theorem 4.2 shows that SPAN has a similar composite convergence rate like NewSamp [8] and LiSSA [13] where and
are coefficients of the linear and quadratic terms, respectively. Comparing the convergence with NewSamp whose first-order coefficient
is abbreviated as
, ours can be similar to
. In particular, truncated SVD is introduced in NewSamp to find an optimal m-rank approximate Hessian so that
in NewSamp can be regarded as the optimal coefficient in most stochastic Newton method settings. For LiSSA, its convergence rate is constrained by the initial decision variable
and the sample size for calculating Hessian-vector product strictly, which causes that we cannot compare its convergence rate with ours directly. However, SPAN has a better convergence wall-clock time which is validated in our experimental results (see Section 5). In addition, during the analysis of the convergence rate, we notice that promoting the Hessian approximation error
not only reduces the first-order coefficient in the linear-quadratic convergence rate but also expands the range of step size. This coincides with our intuition that a faster convergence usually requires a smaller Hessian approximation error. In summary, SPAN is designed to achieve a better trade-off between the iteration complexity and the convergence rate. With randomizing the process of constructing the special approximate Hessian, SPAN both accelerates the iteration and limits Hessian approximation error.
To evaluate the performances and fully demonstrate the advantages of our proposed method, we conduct our experiments on several machine learning tasks with different objective functions, including binary image classification and text classification. As we get similar conclusions on these two tasks, we only present the experimental result on the binary image classification in this section. We will further show all experimental details, including the parameters of all optimizers and the results on text classification tasks in the supplementary materials due to space limitations.
For the image classification tasks, we refer to the experimental settings in [13] and [8]. We utilize the following objective function:
where and
are the instances and labels, respectively, and a is the
regularization parameter which influences the condition number of the objective.
We choose state-of-the-art first- and second-order optimization methods as baselines, including SVRG [4], NewSamp [8], S-LBFGS [14], SB-BFGS [15], and LiSSA [13]. We implement all the methods in C++ and Intel Math Kernel Library(MKL). All the code for our experiments can be found in our supplementary material.
We adopt two datasets in our experiment, including the MNIST4-9 dataset [13] consisting of approximate instances, and CovType dataset [8] consisting of approximate
instances. For each dataset, we evaluate all the methods on two different condition numbers, respectively. Besides, for the fairness of comparison, in each set of experiments, we pick the optimal hyperparameters for every method and present the wall-clock time of all the methods in Figure 2.
Figure 3: Empirical convergence rate comparison on MNIST4-9 dataset. Four columns represent different batch-size for sub-sampled Hessian.
Figure 4: Hessian approximation error. Notice that SPAN achieves near lowest error.
From Figure 2, we can find that our method SPAN achieves the best results in almost all experimental settings. NewSamp gets a very close performance on the CovType dataset but consumes a lot of time on the MNIST dataset. LiSSA performs very close on the MNIST dataset but does not perform well on the CovType dataset. Meanwhile, there are significant gaps between SPAN and the other methods in all the experimental settings.
To further explain the gaps between the second-order baselines and SPAN, we evaluate the empirical approximation error between the real full Hessian and the constructed approximate Hessian in Figure 4. We can see that the approximation error of our method is very close to that of the theoretical optimal solution (NewSamp) but much more accurate than the other methods. The results of the experiment demonstrates that our constructed approximate Hessian in SPAN makes better use of second-order information than other stochastic second-order optimizers (S-LBFGS, SB-BFGS, and LiSSA). Moreover, Figure 3 illustrates the empirical convergence rate of these methods. As we can see, the empirical convergence rate of our proposed method is much better than LiSSA but slightly worse than NewSamp, which confirms our theoretical results and insights again.
Even though NewSamp guarantees an excellent convergence rate, the efficiency of the algorithm mainly depends on the number of parameters. NewSamp needs a long time to perform an iteration, even when there are only 784 parameters in MNIST dataset. As a comparison, our proposed SPAN is much more competitive in terms of the robustness of the feature dimension and the per-iteration efficiency in various scenarios.
Considering the experiments mentioned above, SPAN has a tolerable Hessian approximation error which results in the runner up with respect to the empirical convergence rate. Besides, it enjoys the benefit of per-iteration efficiency, which makes it outperform others in practice. In summary, we conclude that SPAN achieves a better trade-off between per-iteration efficiency and convergence rate. Furthermore, it is robust for the sub-sampled batch size.
Newton and quasi-Newton methods converge at faster rates than gradient descent methods. However, they are often expensive, computationally. In this paper, we propose SPAN, a novel method to optimize a smooth-strongly convex objective function. SPAN utilizes the first-order oracle for Hessian approximation. Therefore, it is much faster than Newton method and its alike. We give a theoretical analysis of its approximation accuracy and convergence rate. Experiments on several real datasets demonstrate that our proposed method outperforms previous state-of-the-art methods.
For the future work, the constructed approximate Hessian of SPAN can be combined with advanced first-order methods, i.e., SVRG [4] and SAGA [5], or help to introduce second-order information to complex objective functions, e.g., cubic regularization [18, 19, 20], neural networks with low complexity and competitive convergence rate.
Thanks the anonymous reviewers for their helpful comments and suggestions. We also thank Zhe Wang, Linyun Yu, Yitan Li, and Hao Zhou for their valuable discussions.
[1] Herbert Robbins and Sutton Monro. A stochastic approximation method. In Herbert Robbins Selected Papers, pages 102–109. Springer, 1985.
[2] Mu Li, Tong Zhang, Yuqiang Chen, and Alexander J Smola. Efficient mini-batch training for stochastic optimization. In Proceedings of the 20th ACM SIGKDD international conference on Knowledge discovery and data mining, pages 661–670. ACM, 2014.
[3] Andrew Cotter, Ohad Shamir, Nati Srebro, and Karthik Sridharan. Better mini-batch algorithms via accelerated gradient methods. In Advances in neural information processing systems, pages 1647–1655, 2011.
[4] Rie Johnson and Tong Zhang. Accelerating stochastic gradient descent using predictive variance reduction. In Advances in neural information processing systems, pages 315–323, 2013.
[5] Aaron Defazio, Francis Bach, and Simon Lacoste-Julien. Saga: A fast incremental gradient method with support for non-strongly convex composite objectives. In Advances in neural information processing systems, pages 1646–1654, 2014.
[6] Shai Shalev-Shwartz and Tong Zhang. Accelerated proximal stochastic dual coordinate ascent for regularized loss minimization. In International Conference on Machine Learning, pages 64–72, 2014.
[7] Dong C Liu and Jorge Nocedal. On the limited memory bfgs method for large scale optimization. Mathematical programming, 45(1-3):503–528, 1989.
[8] Murat A Erdogdu and Andrea Montanari. Convergence rates of sub-sampled newton methods. arXiv preprint arXiv:1508.02810, 2015.
[9] Farbod Roosta-Khorasani and Michael W Mahoney. Sub-sampled newton methods ii: local convergence rates. arXiv preprint arXiv:1601.04738, 2016.
[10] Richard H Byrd, Gillian M Chin, Will Neveitt, and Jorge Nocedal. On the use of stochastic hessian information in optimization methods for machine learning. SIAM Journal on Optimization, 21(3):977–995, 2011.
[11] Mert Pilanci and Martin J Wainwright. Newton sketch: A near linear-time optimization algorithm with linear-quadratic convergence. SIAM Journal on Optimization, 27(1):205–245, 2017.
[12] Peng Xu, Jiyan Yang, Farbod Roosta-Khorasani, Christopher Ré, and Michael W Mahoney. Sub-sampled newton methods with non-uniform sampling. In Advances in Neural Information Processing Systems, pages 3000–3008, 2016.
[13] Naman Agarwal, Brian Bullins, and Elad Hazan. Second-order stochastic optimization for machine learning in linear time. The Journal of Machine Learning Research, 18(1):4148–4187, 2017.
[14] Philipp Moritz, Robert Nishihara, and Michael Jordan. A linearly-convergent stochastic l-bfgs algorithm. In Artificial Intelligence and Statistics, pages 249–258, 2016.
[15] Robert Gower, Donald Goldfarb, and Peter Richtárik. Stochastic block bfgs: Squeezing more curvature out of data. In International Conference on Machine Learning, pages 1869–1878, 2016.
[16] Roger A Horn and Charles R Johnson. Matrix analysis. Cambridge university press, 2012.
[17] Nathan Halko, Per-Gunnar Martinsson, and Joel A Tropp. Finding structure with randomness: Probabilistic algorithms for constructing approximate matrix decompositions. SIAM review, 53(2):217–288, 2011.
[18] Yurii Nesterov and Boris T Polyak. Cubic regularization of newton method and its global performance. Mathematical Programming, 108(1):177–205, 2006.
[19] Jonas Moritz Kohler and Aurelien Lucchi. Sub-sampled cubic regularization for non-convex optimization. In Proceedings of the 34th International Conference on Machine Learning-Volume 70, pages 1895–1904. JMLR. org, 2017.
[20] Nilesh Tripuraneni, Mitchell Stern, Chi Jin, Jeffrey Regier, and Michael I Jordan. Stochastic cubic regularization for fast nonconvex optimization. In Advances in Neural Information Processing Systems, pages 2904–2913, 2018.
[21] David Gross and Vincent Nesme. Note on sampling without replacing from a finite collection of matrices. arXiv preprint arXiv:1001.2738, 2010.
[22] Rajendra Bhatia. Matrix analysis, volume 169. Springer Science & Business Media, 2013.
[23] Vladimir Igorevich Bogachev. Gaussian measures. Number 62. American Mathematical Soc., 1998.
[24] Roger A Horn and Charles R Johnson. Matrix analysis cambridge university press. New York, 37, 1985.
[25] Mnist dataset http://yann.lecun.com/exdb/mnist/.
[26] Covertype dataset https://archive.ics.uci.edu/ml/datasets/covertype.
[27] Steve Webb, James Caverlee, and Calton Pu. Introducing the webb spam corpus: Using email spam to identify web spam automatically. In CEAS, 2006.
[28] Yelp dataset https://www.yelp.com/dataset/challenge.
In this section, we give several important definitions and lemmas which will be used in the proof of the convergence analysis (Section. 4 in this paper).
Definition 1. (Orthogonal Projectors [17]) An orthogonal projector is an Hermitian matrix P that satisfies the polynomial . This identity implies
. An orthogonal projector is completely determined by its range(column space). For a given matrix M, we write
for the unique orthogonal projector with
range (M). When M has full column rank, we can express this projector explicitly:
where denotes the conjugate transpose matrix of M. The orthogonal projector onto the complementary subspace,
Lemma A.1. (Matrix Concentration Inequalities [21]) Let X be a finite set of Hermitian matrices in
where
Given its size, let S denote a uniformly random sample from {1, 2, ..., |X|} with or without replacement. Then we have
Lemma A.2. (Conjugation Rule) Suppose that , for every real matrix A, the matrix
. In particular,
Proof. The positive semidefinite matrices form a convex cone, which induces a partial ordering on the linear space of Hermitian matrices: if and only if
is positive semidefinite, which means
For any real vector u, v = Au satisfies
Lemma A.3. Let P be an orthogonal projector, and let M be a real square matrix. For each positive number q, we have
Proof. Define the SVD of M as where U and V are unitary matrices. Thus, we have
where
denotes the i-th largest singular value of matrix M. Then, we have
where is an orthogonal projector, X is a nonnegative diagonal matrix and
. This relation follows immediately from Theorem. IX.2.10 in [22].
Lemma A.4. (Proposition 10.2 in [17]) Draw a standard Gaussian matrix
Lemma A.5. (Proposition 10.4 in [17]) Let Gaussian matrix where
Lemma A.6. (Theorem. 4.5.7 in [23]). Suppose that h is a Lipschitz function on matrices:
Draw a standard Gaussian matrix G, then
Lemma A.7. Suppose . Then, for each real matrix A, it holds that
and that
Proof. By applying Definition 1 and Lemma A.2, we have and
. Since
That is to say, . Then, we have
The second statement follows from the first
by taking orthogonal complements.
Lemma A.8. Suppose a real matrix satisfies
Proof. Define as the positive semidefinite square root of M promised by Theorem 7.2.6 in [24], we have
The first equality can be verified algebraically. The second holds because rational functions of a diagonalizable matrix, such as A, commute. The last relation is because the conjugate rule
Lemma A.9. (Proposition 8.3 in [17]) We have for each partitioned positive semidefinite matrix
Lemma A.10. Let be two positive definite matrices. If there is existing that
the minimum singular values of
Proof. To make a contradiction with Eq. (8), we assume that the minimum singular values of A and B have
For the singular vectors which respectively correspond to
where the inequality. Since the inequality
contradict with the given condition Eq. (7), the assumption Eq. (8) will never be satisfied. Hence, we have
Lemma A.11. Let be a positive definite matrix. A and a matrix
which can be presented as
where Q is an unitary matrix and
Choose a test matrix (
), and construct the sample matrix
. Assuming that
has full row rank, the orthogonal projector
wheremeans the Moore–Penrose inverse of matrix
Proof. For the orthogonal projector and the unitary matrix Q, it is clear that
is an orthogonal projector due to Definition 1. Evidently,
Since ranges determine orthogonal projectors, we have and the following equation
where the first euqation is because the unitary invariance of the spectral norm. Subsequently, we construct a matrix Z whose column space belongs to the range of to scale the right-most item of the Eq.(11) as follows
uses the full row rank properties of
, and the construction of Z denotes that
Lemma A.7 implies
where the matrix Z has a full column rank because of Eq. (12). Thus, we have
Applying Lemma A.8, the top-left block verifies
Besides, the bottom-right block satisfies
For convenience, we abbreviate , the matrix
As our construction in Eq. (12), we have . Thus, we have
Finally, after introducing the inequality into Eq. (14), the proof is completed.
B.1 Proof of Lemma 4.1
Proof. According to the construction of (Step. 5 to Step. 7 in Algorithm 1) and Z (Step. 9 in Algorithm 1), we have
where
Here, we denote the orthogonal projector of Y as . With the approximate Hessian
(Eq. (6) in Section. 2.2), we have the Hessian approximation error satisfies
With algorithm settings and positive definite properties of , we reformulate the Y and introduce an auxiliary variable
in the following
where we set the SVD of the batch Hessian and diagonal elements of
Thus, we have the following inequality according to Lemma A.3
Subsequently, we can introduce a test matrix with some special properties to scale Eq.(18). We set
as a standard Gaussian matrix, and
. As a result, the matrix
is also a standard Gaussian matrix because of the unitary of Q and the rotational invariance of
. Besides,
and
are nonoverlapping submatrices of
, and these two matrices are stochastically independent. Hence, we can learn how the error depends on the matrix
by conditioning on the event that
is relatively regular. We set a event as
According to Lemma A.5, we find
Consider the function whose Lipschitz constant L can be calculated in the following
That is to say,
Applying the concentration measure inequality, Lemma A.6, conditionally to the random variable
According to Eq. (19), we have a explicit bound under the event as follows
With the fact Eq.(20), we can remove the conditioning and obtain
Because of the SVDs presented in Eq. (16), we have
After introducing Eq.(21) and Eq.(22) to Eq.(18), we have
with probability at least . For making the upper bound clear, we set
, and have
Hence, we make some abbreviations as
and Eq.(23) can be reformulated and scaled as follows
Finally, we introduce Eq.(24) and Eq.(17) to Eq.(15) to have the following error upper bound with probability at least
Due to the fact
That is to say, when the power iteration q satisfy
Corollary 1. Frame the hypotheses of Lemma A.11, and assume
with failure probability at most
B.2 Proof of Theorem 4.2
Proof. According to Step. 11 in SPAN Algorithm, we have
whereestablish because of the symmetry of
With part 1.1 in Eq. (28) and the construction of
where we denote
for abbreviation. Besides, for the definition of v, there are
For part 1.1.1, we have
whereestablish because of the Young’s inequality and the Cauchy–Schwarz inequality.
For part 1.1.2, with the similar proof technique, we obtain
For part 1.1.3, we can easily find
Submitting Eq.(32), Eq.(33) and Eq.(34) back into Eq.(29), we obtain that, for any feasible v, the part 1.1 of Eq. (28) satisfies
where establish because of the fact
. For any specific
and
, there is existing some
which satisfies the inequality
. To clarify the proof procedure, we assume
Notice that, the proof techniques presented in the following are not constrained by such specific . To minimize our convergence constant, we required that
where we set . Hence, we request that the positive root about
in Eq.(37) should satisfy
. Then, we present the analytic form of
According to Corollary 1, the sufficient condition for the establishment of the last equation of Eq. (38) can be formulated as
with probability at least . Combining Eq.(39) and Eq.(36), when k = 1, we have
To summarize these equations, we have the following conclusion. When , we can set
with probability at least . In Eq.(41),
is from Eq. (38). Part 1.1 in Eq.(28) has
With Similar to the tricks on the part 1.1, for part 1.2 in Eq.(28), we have
where establish because of Eq. (29). Utilizing the similar proof techniques presented in Eq. (32) and Eq. (33) on part 1.2.2 and part 1.2.3, we respectively have
where are introduced as Eq. (30). Besides
satisfies Eq. (37). With such settings, we have the following inequalities for part 1.2.1 and part 1.2.4
Therefore, with Eq. (43), Eq. (44) and Eq. (45), we can obtain that
With the condition shown in Eq. (40) and our requirements, we have
Thus, we can derive
as
when
The sufficient conditioncomes from the upper bound of
shown in Eq. (41).
For the part 2 in Eq.(27): We denote the stochastic bias matrix and the sub-sample hessian matrix
as follows
With Assumption 2.3 and the construction Eq.(54), we can get some properties
According to the matrix Bernstein’s inequality given in Lemma A.1, for any available we have bound the part 2 in Eq.(27) as
Hence, for any , if we want
with probability at least
, the sampled size of B, b, should satisfy
Hence, the Eq.(27) can be rewrited in the following with probability at least
due to Eq. (50), Eq. (52), Eq. (53) and Eq. (54). Since , we can conclude the convergence rate as
with probability at least . For clarify the coefficients of convergence further, we have
Thus, the coefficients of convergence satisfy
C.1 Exact Experimental setting on Logistic Regression
In this section, we detail our experiments on Logistic Regression. We set the objective function as
Datasets: In this experiment, we use two datasets for the binary classification task including MNIST [25] and CovType [26].
• MNIST: The MNIST database is a dataset of handwritten digits. It has 60,000 training samples, and 10,000 test samples. Each image is represented by 28x28 pixels, each containing a value 0 - 255 with its grayscale value.
• CovType: Predicting forest cover type from cartographic variables only (no remotely sensed data). The actual forest cover type for a given observation (30 x 30 meter cell) was determined from US Forest Service (USFS) Region 2 Resource Information System (RIS) data. Independent variables were derived from data originally obtained from US Geological Survey (USGS) and USFS data. Data is in raw form (not scaled) and contains binary (0 or 1) columns of data for qualitative independent variables (wilderness areas and soil types).
Data Preprocessing: For the two datasets proposed above, we will list our data preprocessing in the following
• MNIST: We first select the samples whose label is 4 or 9 (Logistic Regression is used in binary classification tasks). Then we normalize the samples as
Preiteration: As the preiteration requirement of LiSSA [13], for the sake of fairness, we will perform 2 epoch SVRG [4] iteration for all the methods after initializing the decision variable
Experimental Results: We have shown our experimental results about Logistic Regression in our paper.
Parameters Selection: We have shown all the experimental parameters in Path: ./code/lr_data/params
C.2 Extra Experiments on Huber SVM
In this section, we conduct our experiments on Support Vector Machines. To satisfy assumptions of all baselines and SPAN, we set the loss function l(.) in the SVM
as smoothed Huber loss presented as
Figure 5: Training loss versus running time. The first two columns are for webspam dataset. The last two are for Yelp Review Polarity dataset. Note SPAN achieves the best or near the best performance with respect to wall-clock time.
Figure 6: Hessian approximation error. Note that SPAN achieves near lowest error.
Datasets: In this experiment, we use two text classification datasets including webspam [27] and Yelp Review Polarity [28].
• webspam: For each instance, continuous 1 bytes are treated as a word, and use word count as the feature value. In the end, there are 254 features for each sample, and the number of sample is about 160, 000.
• Yelp Review Polarity: The Yelp reviews polarity dataset is constructed by considering stars 1 and 2 negative, and 3 and 4 positive. For each polarity 280,000 training samples and 19,000 testing samples are take randomly. In total there are 560,000 trainig samples and 38,000 testing samples.
Data Preprocessing: For the two datasets proposed above, we will list our data preprocessing in the following
• Yelp Review Polarity: We want all baselines (including the second-order methods) to converge in a relatively short period of time. Hence, we use CountVectorizer to preprocess the data, and normalize each sample as we do in webspam. Then, we uniformly pick 1000 features to describe each sample. If the number of samples which are described as the zero vector is over a half of the total sample number, we will re-sample the features
Preiteration: As the preiteration requirement of LiSSA [13], for the sake of fairness, we will perform 2 epoch SVRG [4] iteration for all the methods after initializing the decision variable
Experimental Results: We first show the graph about the training loss versus running time in different condition numbers in Fig 5. Second, we give the Hessian approximation error to validate the quality of our approximate Hessian in Fig 6.
Parameters Selection: We have shown all the experimental parameters in Path: ./code/svm_data/params