SPAN: A Stochastic Projected Approximate Newton Method

2020·Arxiv

ABSTRACT

ABSTRACT

Second-order optimization methods have desirable convergence properties. However, the exact Newton method requires expensive computation for the Hessian and its inverse. In this paper, we propose SPAN, a novel approximate and fast Newton method. SPAN computes the inverse of the Hessian matrix via low-rank approximation and stochastic Hessian-vector products. Our experiments on multiple benchmark datasets demonstrate that SPAN outperforms existing first-order and second-order optimization methods in terms of the convergence wall-clock time. Furthermore, we provide a theoretical analysis of the per-iteration complexity, the approximation error, and the convergence rate. Both the theoretical analysis and experimental results show that our proposed method achieves a better trade-off between the convergence rate and the per-iteration efficiency.

1 Introduction

Mathematical optimization plays an important role in machine learning. Many learning tasks can be formulated as a problem of minimizing a finite sum objective:

where N is the number of samples, d is the dimension of parameters and denotes the loss function for sample i. In order to solve Eq.(1), many first- and second-order methods have been proposed and has the following update paradigm:

where is the gradient and is the step size at t-th iteration. The term can be set differently in different methods. First-order methods set as an identity matrix. The resulting updating procedure becomes gradient descent (GD) or stochastic gradient descent (SGD) depending on whether the gradient is calculated over the whole sample set or one random sample [1, 2, 3]. A series of first-order linearly convergent methods and their variance reduction variants were proposed to accelerate the above iteration updates, including SVRG [4], SAGA [5], SDCA [6], etc. Their main intuition is to balance the computational cost of and the approximation gap between the stochastic gradient and the global gradient

Compared with first-order optimizers, second-order optimization methods regard in Eq. (2) as the Hessian inverse (the standard Newton Method) or certain carefully constructed Hessian inverse (quasi-Newton methods). The matrix can be thought to adjust the step size and gradient coordinates through the high order information or the quasi-Newton condition. Second-order methods usually achieve a better convergence rate than first-order ones.

However, second-order algorithms are not widely used in large-scale optimization problems due to expensive computation cost. Computing the Hessian matrix and its inverse requires high computation complexity and consumes large memory. Standard Newton method takes per iteration, where N and d are the number of samples and the number of unknown parameters, respectively. Quasi-Newton methods like BFGS and L-BFGS [7] are faster, but still requires . However, they do not maintain the local quadratic convergence rate. For problems with large parameters, it is not affordable even if the inverse calculation could fit in memory.

Our goal is to develop an approximate yet efficient Newton method with provable convergence guarantee. We aim to reduce per-iteration computation cost for the Hessian inverse calculation while maintaining the approximation accuracy. To this end, we propose Stochastic Projected Approximate Newton method (SPAN), a novel and generic approach to speed up second-order optimization calculation. The main contributions of the paper are as follows:

• We propose a novel second-order optimization method, SPAN, to achieve a better trade-off between Hessian approximation error and per-iteration efficiency. Inside the method, we propose a stochastic sampling technique to construct the Hessian approximately. Since it only requires first-order oracles and Hessian-vector products, the Hessian and its inverse can be computed very efficiently.

• We present a theoretical analysis about the Hessian approximation error and the convergence rate. SPAN achieves linear-quadratic convergence with a provable Hessian approximation error bound.

• We conduct experiments on multiple real datasets against existing state-of-the-art methods. The results validate that our proposed method achieves state-of-the-art performance in terms of the wall-clock time, with almost no sacrificing on the construction accuracy of the Hessian.

2 Related Work

In order to improve the per-iteration efficiency, several stochastic second-order methods have tried to seek the trade-off between per-iteration computational complexity and convergence rate, which can be divided into two main categories.

Stochastic Newton Methods. A series of sub-sampled Newton methods are proposed to solve the problems where the sample size N is far larger than the feature size d. In these methods, the Hessian matrix is approximated with a small subset of training samples. NewSamp [8] and similar methods by [9, 10] sample from function set and construct a regularized m-rank approximate Hessian with truncated singular value decomposition to improve the per-iteration efficiency. However, they need to compute the sub-sampled Hessian, which has an computational cost even if there is only one single instance. Removing the requirements of second-order oracle in NewSamp-type methods, our SPAN further accelerates the iteration with stochastic low-rank approximation techniques, which improves the complexity by a factor nearly O(d/m).

Sketch Newton method in [11] adopts sketching techniques to approximate Hessian. A non-uniform probability distribution was introduced to sample rows ofin [12]. Both of them use new approximate Hessian construction and proper sampling methods to improve Hessian estimation efficiency. Differ from such sketch Newton methods, our SPAN does not need the Hessian Decomposition assumption for sketched updates in [12, 11].

Rather than estimating the Hessian, LiSSA by [13] approximates the Hessian inverse with matrix Taylor expansion. The most elegant part of LiSSA is that the Hessian inverse estimation is unbiased, and the approximation error only depends on the matrix concentration inequalities. Compared with LiSSA, SPAN guarantees more robust per-iteration complexity when the objective function is not Generalized Linear Model (GLM). Moreover, we do not require the initial solution is close to the optimum. Additionally, the procedure of calculating the approximate Hessian for each sample in SPAN is independent, which means our method has better parallelism compared with LiSSA.

Table 1: Important mathematical notations in this paper.

Stochastic Quasi-Newton Methods. The quasi-Newton methods can also be improved with stochastic approximation techniques. S-LBFGS by [14] adopts the randomization to the classical L-BFGS and integrates the widely used gradient variance reduction techniques. Subsequently, SB-BFGS in [15] extends BFGS with matrix sketching to approximate Hessian inverse. Stochastic quasi-Newton methods have a better per-iteration complexity compared with stochastic Newton methods. However, most of them cannot explicitly demonstrate the benefit of introducing the curvature information in the convergence analysis. Such problem is solved in our SPAN by establishing the Hessian approximation error bound which can hardly be analyzed in stochastic quasi-Newton methods.

3 The Proposed SPAN Method

In this section, we will first present the overall idea and intuition of our proposed SPAN. Then, we will describe three technical components and the full algorithm. For better illustration, we list some important notations and their descriptions in Table 1.

The goal of SPAN method is to optimize Eq. (1) with respect to the decision variable x (i.e., model parameters) using the mini-batched iterative update:

where t is the iteration index. Note that the straightforward calculation of Eq. (3) requires time-consuming for models with a large d.

To make the computation faster, instead of calculating the batch Hessian (the abbreviation of ) and then taking the inverse explicitly, our idea is to replace with an approximate Hessian be bounded within a small region around the true Hessian . We propose to use the projected approximation for the Hessian where P is a carefully constructed orthogonal projector [16]. To further ensure invertible, we add an additional perturbation term . Note that such formulation is sufficient to approximate , since we can consider to obtain the optimal k-rank approximation of . However, in practice, we can hardly find exactly without knowing the batch Hessian. Thus, it is challenging to construct P with desired efficiency and approximation accuracy. At the first glance, this seems infeasible, while our main intuition is based on the observation that affine transformation A over random vectors tends to have larger components on the top singular vectors of A. Such intuition is helpful to find an accurate approximation of orthonormal projector . More details follow later.

In the remaining of this section, we present the construction of an approximate Hessian and the resulting optimization algorithm. In particular, we design the algorithm components guided by the following questions.

1. How to design a structure for can be computed efficiently without knowing

2. How to make the inverse robust — permitting to be invertible in any circumstance?

3. How to balance the Hessian approximation error and iteration complexity flexibly?

3.1 Stochastic Projected Approximation

We construct the approximate Hessian as which can be calculated without knowing essentially decides the direction where the Hessian would project onto. We decompose P into the product of the form is an orthonormal matrix. With that, we construct an approximate Hessian

One can easily verify that the Moore–Penrose inverse of can be computed efficiently via since the size of is much smaller than and can be calculated by the product of

In the following, we develop a method to obtain a U while keeping small. We derive a method to calculate without requiring Hessian

Our method is to randomly choose a set of column vectors and project them to the space expanded by the singular vectors of . From the projection, one can extract an orthonormal basis U to expand a low-dimensional space for to project to. The procedure is inspired by the Proto-Algorithm [17], combining with the secant equation for Hessian-vector products calculation. We utilize the standard Gaussian distribution to generate such and calculate the projection directly. To make the computation more efficient, the set of random vectors contains l elements where

Notice that for any matrix V whose i-th column vector is presented as , the extended Hessian-vector product on a sample batch B, denoted as

In practice, the Hessian-vector product can be calculated by the finite difference of gradients like Algorithm 1.

U is then constructed as follows:

With the above-constructed U, can be further constructed again using the extended Hessian-vector product, . But there is no need to calculate explicitly, it suffices to calculate its Moore-Penrose inverse. One can verify that with such constructed , the error in the term of is well-bounded [17].

3.2 Robust Hessian Inversion

The above constructed approximate Hessian is simplified with one flaw ( is actually not invertible) since the orthonormal matrix U is low-rank. This can be alleviated with a perturbed version of Eq.(4).

where is a carefully chosen constant with

The purpose of the perturbation term is to introduce the invertibility of the approximate matrix presented in Eq. (4). We will give an informal analysis here to show the importance of choosing a proper . On one hand, a large will impair captures of the main actions of , since it increases the lower bound of the Hessian approximation error

Figure 1: The normalized projection, i.e., and (LTR) of standard gaussian vectors (blue points) where

On the other hand, from an iteration perspective, we can neither choose a tiny since it will lose the benefit of the curvature information induced by approximate Newton. In particular, if we regard the SVDs of and as follows

The SVD of constructed Hessian in Eq. (5) and its inverse can be formulated as

It can be observed that a tiny will make the singular vectors associated with dominate the action of , which impairs the introduction of curvature information taken by

Hence, a proper is needed to balance the Hessian approximation error and the norm of constructed Hessian inverse. Specifically, we will further discuss (see Theorem 4.2) and the Hessian approximation error Lemma 4.1) in Section 4. Rigorous proof will be deferred to our supplementary materials.

3.3 Power Iteration

In this part, we introduce a power iteration technique to balance and iteration complexity more flexibly. That is to say, with limit iteration complexity sacrificing, the construction of in Eq. (5) can be improved through some auxiliary steps in the main algorithm. Generally speaking, any matrix U obtained as in Section 3.1 is valid for ’s construction. However, in practice, a better orthogonal projector maintains that top singular vectors of are rotated less after performing the projection on . The matrix U is usually obtained from the projection of random vectors. Thus, we hope basis vectors of the projection have larger components on top singular vectors e.g., , compared with those on and . Such a requirement can be satisfied by taking the power of the Hessian.

Specifically, as shown in Figure 1 showed, normalized (green dots) seems to like a unit ball, while they degenerate to a unit circle (red dots) when random vectors are projected through (red dots) because the component of projection on [1, 0, 0] almost have become 0. A similar phenomenon also happens for the component of projection on [0, 1, 0] when taking a larger power for Hessian, i.e., (yellow dots). As a result, the normalized projections nearly collapse to two points (yellow dots), i.e., [0, 0, 1] and . That is to say, the component of projection on bottom singular vectors tends to vanish. Besides, such a phenomenon becomes more significant as the power of Hessian increases. With such an observation, we can calculate a better U with the following steps:

1. generate a standard Gaussian matrix

2. iteratively use the extended Hessian-vector product to get

3. calculate the basis U via QR decomposition

3.4 Details of SPAN

In this section, we show the complete algorithm about SPAN in Algorithm 2. Also, we explain the iteration complexity of SPAN, and compare it with the state-of-the-art optimizers.

Table 2: Per-iteration complexity comparisons among stochastic first-order optimization methods, sub-sampled Newton Methods, LiSSA and ours. Notice that

In Algorithm 2, we use as the abbreviation of . At each iteration, we first select a sample batch (Step 3) and generate some random matrix (Step 4). With the random matrix can be found through the process we introduced in Section 3.3 (Step 5 to Step 8). After that, we compute (Step 9) as an intermediate variable and select the perturbation constant . Finally, we calculate the constructed Hessian inverse (Step 10) like Eq. (5) and update the decision variables (Step 11).

Iteration Efficiency. In order to illustrate the computational complexity of each iteration, we first introduce some condition numbers, which are designed with respect to the component functions, i.e., . In such case, one typically assumes that each component is bounded by and like LiSSA [13], we define

Such condition numbers have the following relations

We compare per-iteration complexity among state-of-the-art optimizers, including SPAN, and list the results in Table 2 where we ignore log terms of d and different s for brevity. The iteration complexity of SPAN consists of three terms. The first term O(Nd) represents the time complexity of the full gradient computation (Step 11). The second term indicates the complexity of power iteration (Step 5 to Step 7). The calculation of in Step 6 requires O(bld) where the fact and will be detailedly demonstrated in our supplementary materials. The last term shows the complexity of QR decomposition (Step 8). In particular, the most time-consuming step in constructing is to calculate , which leads to an time complexity. For a given , the computational cost of is only O(ld) when the order of matrix multiplication is appropriately arranged. Owing to the fact that , such computational cost is much smaller than (the third term).

Comparing with the NewSamp [8] which updates decision variables with a sub-sampled constructed Hessian inverse, SPAN takes a significant acceleration at each iteration through the only first-order oracle and the Hessian-vector product requirements when the m-th singular value is close to the minimum singular value with . In addition, per-iteration complexity in LiSSA [13] is even worse than NewSamp in general cases. However, in GLM models, LiSSA has better performance because of the fast calculation of Hessian-vector products. Even in GLM models, our SLAN can still be faster than LiSSA when the approximate Hessian is good enough, or the condition number is large with . Notice that, the condition numbers of batch loss chosen, i.e., , will not be too large. Otherwise, per-iteration efficiency is governed by batch Hessian or Hessian-vector products calculation, which impairs the acceleration achieved by matrix approximation techniques.

4 Theoretical Results

We will give a theoretical analysis of our results in this section. We will first introduce some standard assumptions for the optimization problems, and then bound the error between our approximate Hessian and the sub-sampled one. Finally, we will show the local linear-quadratic convergence of our main algorithm (Algorithm 2), with the detailed coefficients of the convergence rate. We further compare the convergence rate of our method with NewSamp [8] and LiSSA [13]. Due to space limitations, the details of proof arguments are provided in the supplementary materials.

Assumption 1. (Hessian Lipschitz Continuity) For any subset and a second-order differentiable objective function F like the Eq.(1), there exists a constant depending on b, such that

Assumption 2. (Gradient Lipschitz Continuity and Strong Convexity) For any subset and a second-order differentiable function F like the Eq.(1), there exists constants depending on the b, such that for any

Assumption 3. (Bounded Hessian) For any , and the Hessian of the function is upper bounded by an absolute constant K, i.e.,

Lemma 4.1. Suppose the Assumption 1, 2 and 3 hold. For every iteration t in Algorithm 2, if the parameters satisfy:

As far as we know, if we approximate the Hessian with an m-rank+sparse construction, the optimal Hessian approximation error will be which comes from NewSamp [8]. Corresponding to such an approximation error, the construction complexity of NewSamp requires which can hardly be endured with a slightly larger instance dimension. While, from Lemma 4.1, SPAN improves the complexity by a factor at least () when the stochastic Hessian can be well approximated and keeps the approximation error nearly three times the optimal (is a tiny constant). Such a guarantee cannot be promised for any quasi-Newton method. Although LiSSA [13] has a similar error bound for the approximate Hessian inverse, that approximation error is not comparable because it depends almost entirely on the concentration inequalities of the sub-sample processing but not the matrix approximation techniques. Additionally, we only bound the Hessian approximation error when q = O(log d) and . In fact, such upper bounds on q and are possibly pessimistic and can likely be improved to a more average quantity. However, since the parameters q = O(1) and suffice for convergence in our experimental settings, we have not tried to optimize it further.

Theorem 4.2. Suppose . Frame the hypotheses of Lemma. 4.1, if the parameters satisfy:

The coefficients

Figure 2: Training loss versus running time. The first two columns are for MNIST4-9 dataset. The last two are for CovType dataset. Note SPAN achieves the best or near the best performance with respect to wall-clock time.

Remark. Notice that, we require in Theorem 4.2, which is only introduced to simplify the formulation of coefficients, i.e., and . For any , we can obtain a similar convergence rate by setting . Thus, the theorem is without loss of generality.

Theorem 4.2 shows that SPAN has a similar composite convergence rate like NewSamp [8] and LiSSA [13] where and are coefficients of the linear and quadratic terms, respectively. Comparing the convergence with NewSamp whose first-order coefficient is abbreviated as , ours can be similar to . In particular, truncated SVD is introduced in NewSamp to find an optimal m-rank approximate Hessian so that in NewSamp can be regarded as the optimal coefficient in most stochastic Newton method settings. For LiSSA, its convergence rate is constrained by the initial decision variable and the sample size for calculating Hessian-vector product strictly, which causes that we cannot compare its convergence rate with ours directly. However, SPAN has a better convergence wall-clock time which is validated in our experimental results (see Section 5). In addition, during the analysis of the convergence rate, we notice that promoting the Hessian approximation error not only reduces the first-order coefficient in the linear-quadratic convergence rate but also expands the range of step size. This coincides with our intuition that a faster convergence usually requires a smaller Hessian approximation error. In summary, SPAN is designed to achieve a better trade-off between the iteration complexity and the convergence rate. With randomizing the process of constructing the special approximate Hessian, SPAN both accelerates the iteration and limits Hessian approximation error.

5 Experiments

To evaluate the performances and fully demonstrate the advantages of our proposed method, we conduct our experiments on several machine learning tasks with different objective functions, including binary image classification and text classification. As we get similar conclusions on these two tasks, we only present the experimental result on the binary image classification in this section. We will further show all experimental details, including the parameters of all optimizers and the results on text classification tasks in the supplementary materials due to space limitations.

For the image classification tasks, we refer to the experimental settings in [13] and [8]. We utilize the following objective function:

where and are the instances and labels, respectively, and a is the regularization parameter which influences the condition number of the objective.

We choose state-of-the-art first- and second-order optimization methods as baselines, including SVRG [4], NewSamp [8], S-LBFGS [14], SB-BFGS [15], and LiSSA [13]. We implement all the methods in C++ and Intel Math Kernel Library(MKL). All the code for our experiments can be found in our supplementary material.

We adopt two datasets in our experiment, including the MNIST4-9 dataset [13] consisting of approximate instances, and CovType dataset [8] consisting of approximate instances. For each dataset, we evaluate all the methods on two different condition numbers, respectively. Besides, for the fairness of comparison, in each set of experiments, we pick the optimal hyperparameters for every method and present the wall-clock time of all the methods in Figure 2.

Figure 3: Empirical convergence rate comparison on MNIST4-9 dataset. Four columns represent different batch-size for sub-sampled Hessian.

Figure 4: Hessian approximation error. Notice that SPAN achieves near lowest error.

From Figure 2, we can find that our method SPAN achieves the best results in almost all experimental settings. NewSamp gets a very close performance on the CovType dataset but consumes a lot of time on the MNIST dataset. LiSSA performs very close on the MNIST dataset but does not perform well on the CovType dataset. Meanwhile, there are significant gaps between SPAN and the other methods in all the experimental settings.

To further explain the gaps between the second-order baselines and SPAN, we evaluate the empirical approximation error between the real full Hessian and the constructed approximate Hessian in Figure 4. We can see that the approximation error of our method is very close to that of the theoretical optimal solution (NewSamp) but much more accurate than the other methods. The results of the experiment demonstrates that our constructed approximate Hessian in SPAN makes better use of second-order information than other stochastic second-order optimizers (S-LBFGS, SB-BFGS, and LiSSA). Moreover, Figure 3 illustrates the empirical convergence rate of these methods. As we can see, the empirical convergence rate of our proposed method is much better than LiSSA but slightly worse than NewSamp, which confirms our theoretical results and insights again.

Even though NewSamp guarantees an excellent convergence rate, the efficiency of the algorithm mainly depends on the number of parameters. NewSamp needs a long time to perform an iteration, even when there are only 784 parameters in MNIST dataset. As a comparison, our proposed SPAN is much more competitive in terms of the robustness of the feature dimension and the per-iteration efficiency in various scenarios.

Considering the experiments mentioned above, SPAN has a tolerable Hessian approximation error which results in the runner up with respect to the empirical convergence rate. Besides, it enjoys the benefit of per-iteration efficiency, which makes it outperform others in practice. In summary, we conclude that SPAN achieves a better trade-off between per-iteration efficiency and convergence rate. Furthermore, it is robust for the sub-sampled batch size.

6 Conclusion and future work

Newton and quasi-Newton methods converge at faster rates than gradient descent methods. However, they are often expensive, computationally. In this paper, we propose SPAN, a novel method to optimize a smooth-strongly convex objective function. SPAN utilizes the first-order oracle for Hessian approximation. Therefore, it is much faster than Newton method and its alike. We give a theoretical analysis of its approximation accuracy and convergence rate. Experiments on several real datasets demonstrate that our proposed method outperforms previous state-of-the-art methods.

For the future work, the constructed approximate Hessian of SPAN can be combined with advanced first-order methods, i.e., SVRG [4] and SAGA [5], or help to introduce second-order information to complex objective functions, e.g., cubic regularization [18, 19, 20], neural networks with low complexity and competitive convergence rate.

Acknowledge

Thanks the anonymous reviewers for their helpful comments and suggestions. We also thank Zhe Wang, Linyun Yu, Yitan Li, and Hao Zhou for their valuable discussions.

References

[1] Herbert Robbins and Sutton Monro. A stochastic approximation method. In Herbert Robbins Selected Papers, pages 102–109. Springer, 1985.

[2] Mu Li, Tong Zhang, Yuqiang Chen, and Alexander J Smola. Efficient mini-batch training for stochastic optimization. In Proceedings of the 20th ACM SIGKDD international conference on Knowledge discovery and data mining, pages 661–670. ACM, 2014.

[3] Andrew Cotter, Ohad Shamir, Nati Srebro, and Karthik Sridharan. Better mini-batch algorithms via accelerated gradient methods. In Advances in neural information processing systems, pages 1647–1655, 2011.

[4] Rie Johnson and Tong Zhang. Accelerating stochastic gradient descent using predictive variance reduction. In Advances in neural information processing systems, pages 315–323, 2013.

[5] Aaron Defazio, Francis Bach, and Simon Lacoste-Julien. Saga: A fast incremental gradient method with support for non-strongly convex composite objectives. In Advances in neural information processing systems, pages 1646–1654, 2014.

[6] Shai Shalev-Shwartz and Tong Zhang. Accelerated proximal stochastic dual coordinate ascent for regularized loss minimization. In International Conference on Machine Learning, pages 64–72, 2014.

[7] Dong C Liu and Jorge Nocedal. On the limited memory bfgs method for large scale optimization. Mathematical programming, 45(1-3):503–528, 1989.

[8] Murat A Erdogdu and Andrea Montanari. Convergence rates of sub-sampled newton methods. arXiv preprint arXiv:1508.02810, 2015.

[9] Farbod Roosta-Khorasani and Michael W Mahoney. Sub-sampled newton methods ii: local convergence rates. arXiv preprint arXiv:1601.04738, 2016.

[10] Richard H Byrd, Gillian M Chin, Will Neveitt, and Jorge Nocedal. On the use of stochastic hessian information in optimization methods for machine learning. SIAM Journal on Optimization, 21(3):977–995, 2011.

[11] Mert Pilanci and Martin J Wainwright. Newton sketch: A near linear-time optimization algorithm with linear-quadratic convergence. SIAM Journal on Optimization, 27(1):205–245, 2017.

[12] Peng Xu, Jiyan Yang, Farbod Roosta-Khorasani, Christopher Ré, and Michael W Mahoney. Sub-sampled newton methods with non-uniform sampling. In Advances in Neural Information Processing Systems, pages 3000–3008, 2016.

[13] Naman Agarwal, Brian Bullins, and Elad Hazan. Second-order stochastic optimization for machine learning in linear time. The Journal of Machine Learning Research, 18(1):4148–4187, 2017.

[14] Philipp Moritz, Robert Nishihara, and Michael Jordan. A linearly-convergent stochastic l-bfgs algorithm. In Artificial Intelligence and Statistics, pages 249–258, 2016.

[15] Robert Gower, Donald Goldfarb, and Peter Richtárik. Stochastic block bfgs: Squeezing more curvature out of data. In International Conference on Machine Learning, pages 1869–1878, 2016.

[16] Roger A Horn and Charles R Johnson. Matrix analysis. Cambridge university press, 2012.

[17] Nathan Halko, Per-Gunnar Martinsson, and Joel A Tropp. Finding structure with randomness: Probabilistic algorithms for constructing approximate matrix decompositions. SIAM review, 53(2):217–288, 2011.

[18] Yurii Nesterov and Boris T Polyak. Cubic regularization of newton method and its global performance. Mathematical Programming, 108(1):177–205, 2006.

[19] Jonas Moritz Kohler and Aurelien Lucchi. Sub-sampled cubic regularization for non-convex optimization. In Proceedings of the 34th International Conference on Machine Learning-Volume 70, pages 1895–1904. JMLR. org, 2017.

[20] Nilesh Tripuraneni, Mitchell Stern, Chi Jin, Jeffrey Regier, and Michael I Jordan. Stochastic cubic regularization for fast nonconvex optimization. In Advances in Neural Information Processing Systems, pages 2904–2913, 2018.

[21] David Gross and Vincent Nesme. Note on sampling without replacing from a finite collection of matrices. arXiv preprint arXiv:1001.2738, 2010.

[22] Rajendra Bhatia. Matrix analysis, volume 169. Springer Science & Business Media, 2013.

[23] Vladimir Igorevich Bogachev. Gaussian measures. Number 62. American Mathematical Soc., 1998.

[24] Roger A Horn and Charles R Johnson. Matrix analysis cambridge university press. New York, 37, 1985.

[25] Mnist dataset http://yann.lecun.com/exdb/mnist/.

[26] Covertype dataset https://archive.ics.uci.edu/ml/datasets/covertype.

[27] Steve Webb, James Caverlee, and Calton Pu. Introducing the webb spam corpus: Using email spam to identify web spam automatically. In CEAS, 2006.

[28] Yelp dataset https://www.yelp.com/dataset/challenge.

A Some Important Lemmas

In this section, we give several important definitions and lemmas which will be used in the proof of the convergence analysis (Section. 4 in this paper).

Definition 1. (Orthogonal Projectors [17]) An orthogonal projector is an Hermitian matrix P that satisfies the polynomial . This identity implies . An orthogonal projector is completely determined by its range(column space). For a given matrix M, we write for the unique orthogonal projector with range (M). When M has full column rank, we can express this projector explicitly:

where denotes the conjugate transpose matrix of M. The orthogonal projector onto the complementary subspace, Lemma A.1. (Matrix Concentration Inequalities [21]) Let X be a finite set of Hermitian matrices in where

Given its size, let S denote a uniformly random sample from {1, 2, ..., |X|} with or without replacement. Then we have

Lemma A.2. (Conjugation Rule) Suppose that , for every real matrix A, the matrix . In particular,

Proof. The positive semidefinite matrices form a convex cone, which induces a partial ordering on the linear space of Hermitian matrices: if and only if is positive semidefinite, which means

For any real vector u, v = Au satisfies

Lemma A.3. Let P be an orthogonal projector, and let M be a real square matrix. For each positive number q, we have

Proof. Define the SVD of M as where U and V are unitary matrices. Thus, we have where denotes the i-th largest singular value of matrix M. Then, we have

where is an orthogonal projector, X is a nonnegative diagonal matrix and . This relation follows immediately from Theorem. IX.2.10 in [22].

Lemma A.4. (Proposition 10.2 in [17]) Draw a standard Gaussian matrix

Lemma A.5. (Proposition 10.4 in [17]) Let Gaussian matrix where

Lemma A.6. (Theorem. 4.5.7 in [23]). Suppose that h is a Lipschitz function on matrices:

Draw a standard Gaussian matrix G, then

Lemma A.7. Suppose . Then, for each real matrix A, it holds that and that

Proof. By applying Definition 1 and Lemma A.2, we have and . Since

That is to say, . Then, we have

The second statement follows from the first by taking orthogonal complements.

Lemma A.8. Suppose a real matrix satisfies

Proof. Define as the positive semidefinite square root of M promised by Theorem 7.2.6 in [24], we have

The first equality can be verified algebraically. The second holds because rational functions of a diagonalizable matrix, such as A, commute. The last relation is because the conjugate rule

Lemma A.9. (Proposition 8.3 in [17]) We have for each partitioned positive semidefinite matrix

Lemma A.10. Let be two positive definite matrices. If there is existing that

the minimum singular values of

Proof. To make a contradiction with Eq. (8), we assume that the minimum singular values of A and B have

For the singular vectors which respectively correspond to

where the inequality. Since the inequalitycontradict with the given condition Eq. (7), the assumption Eq. (8) will never be satisfied. Hence, we have

Lemma A.11. Let be a positive definite matrix. A and a matrix which can be presented as

where Q is an unitary matrix and

Choose a test matrix (), and construct the sample matrix . Assuming that has full row rank, the orthogonal projector

wheremeans the Moore–Penrose inverse of matrix

Proof. For the orthogonal projector and the unitary matrix Q, it is clear that is an orthogonal projector due to Definition 1. Evidently,

Since ranges determine orthogonal projectors, we have and the following equation

where the first euqation is because the unitary invariance of the spectral norm. Subsequently, we construct a matrix Z whose column space belongs to the range of to scale the right-most item of the Eq.(11) as follows

uses the full row rank properties of , and the construction of Z denotes that Lemma A.7 implies

where the matrix Z has a full column rank because of Eq. (12). Thus, we have

Applying Lemma A.8, the top-left block verifies

Besides, the bottom-right block satisfies

For convenience, we abbreviate , the matrix

As our construction in Eq. (12), we have . Thus, we have

Finally, after introducing the inequality into Eq. (14), the proof is completed.

B Proofs in Section 4

B.1 Proof of Lemma 4.1

Proof. According to the construction of (Step. 5 to Step. 7 in Algorithm 1) and Z (Step. 9 in Algorithm 1), we have

where

Here, we denote the orthogonal projector of Y as . With the approximate Hessian (Eq. (6) in Section. 2.2), we have the Hessian approximation error satisfies

With algorithm settings and positive definite properties of , we reformulate the Y and introduce an auxiliary variable in the following

where we set the SVD of the batch Hessian and diagonal elements of

Thus, we have the following inequality according to Lemma A.3

Subsequently, we can introduce a test matrix with some special properties to scale Eq.(18). We set as a standard Gaussian matrix, and . As a result, the matrixis also a standard Gaussian matrix because of the unitary of Q and the rotational invariance of . Besides, and are nonoverlapping submatrices of, and these two matrices are stochastically independent. Hence, we can learn how the error depends on the matrix by conditioning on the event that is relatively regular. We set a event as

According to Lemma A.5, we find

Consider the function whose Lipschitz constant L can be calculated in the following

That is to say,

Applying the concentration measure inequality, Lemma A.6, conditionally to the random variable

According to Eq. (19), we have a explicit bound under the event as follows

With the fact Eq.(20), we can remove the conditioning and obtain

Because of the SVDs presented in Eq. (16), we have

After introducing Eq.(21) and Eq.(22) to Eq.(18), we have

with probability at least . For making the upper bound clear, we set , and have

Hence, we make some abbreviations as

and Eq.(23) can be reformulated and scaled as follows

Finally, we introduce Eq.(24) and Eq.(17) to Eq.(15) to have the following error upper bound with probability at least

Due to the fact

That is to say, when the power iteration q satisfy

Corollary 1. Frame the hypotheses of Lemma A.11, and assume

with failure probability at most

B.2 Proof of Theorem 4.2

Proof. According to Step. 11 in SPAN Algorithm, we have

whereestablish because of the symmetry of

With part 1.1 in Eq. (28) and the construction of

where we denote

for abbreviation. Besides, for the definition of v, there are

For part 1.1.1, we have

whereestablish because of the Young’s inequality and the Cauchy–Schwarz inequality.

For part 1.1.2, with the similar proof technique, we obtain

For part 1.1.3, we can easily find

Submitting Eq.(32), Eq.(33) and Eq.(34) back into Eq.(29), we obtain that, for any feasible v, the part 1.1 of Eq. (28) satisfies

where establish because of the fact . For any specific and , there is existing some which satisfies the inequality . To clarify the proof procedure, we assume

Notice that, the proof techniques presented in the following are not constrained by such specific . To minimize our convergence constant, we required that

where we set . Hence, we request that the positive root about in Eq.(37) should satisfy . Then, we present the analytic form of

According to Corollary 1, the sufficient condition for the establishment of the last equation of Eq. (38) can be formulated as

with probability at least . Combining Eq.(39) and Eq.(36), when k = 1, we have

To summarize these equations, we have the following conclusion. When , we can set

with probability at least . In Eq.(41),is from Eq. (38). Part 1.1 in Eq.(28) has

With Similar to the tricks on the part 1.1, for part 1.2 in Eq.(28), we have

where establish because of Eq. (29). Utilizing the similar proof techniques presented in Eq. (32) and Eq. (33) on part 1.2.2 and part 1.2.3, we respectively have

where are introduced as Eq. (30). Besides satisfies Eq. (37). With such settings, we have the following inequalities for part 1.2.1 and part 1.2.4

Therefore, with Eq. (43), Eq. (44) and Eq. (45), we can obtain that

With the condition shown in Eq. (40) and our requirements, we have

Thus, we can derive

as

when

The sufficient conditioncomes from the upper bound of shown in Eq. (41).

For the part 2 in Eq.(27): We denote the stochastic bias matrix and the sub-sample hessian matrix as follows

With Assumption 2.3 and the construction Eq.(54), we can get some properties

According to the matrix Bernstein’s inequality given in Lemma A.1, for any available we have bound the part 2 in Eq.(27) as

Hence, for any , if we want with probability at least , the sampled size of B, b, should satisfy

Hence, the Eq.(27) can be rewrited in the following with probability at least

due to Eq. (50), Eq. (52), Eq. (53) and Eq. (54). Since , we can conclude the convergence rate as

with probability at least . For clarify the coefficients of convergence further, we have

Thus, the coefficients of convergence satisfy

C Additional Experiments

C.1 Exact Experimental setting on Logistic Regression

In this section, we detail our experiments on Logistic Regression. We set the objective function as

Datasets: In this experiment, we use two datasets for the binary classification task including MNIST [25] and CovType [26].

• MNIST: The MNIST database is a dataset of handwritten digits. It has 60,000 training samples, and 10,000 test samples. Each image is represented by 28x28 pixels, each containing a value 0 - 255 with its grayscale value.

• CovType: Predicting forest cover type from cartographic variables only (no remotely sensed data). The actual forest cover type for a given observation (30 x 30 meter cell) was determined from US Forest Service (USFS) Region 2 Resource Information System (RIS) data. Independent variables were derived from data originally obtained from US Geological Survey (USGS) and USFS data. Data is in raw form (not scaled) and contains binary (0 or 1) columns of data for qualitative independent variables (wilderness areas and soil types).

Data Preprocessing: For the two datasets proposed above, we will list our data preprocessing in the following

• MNIST: We first select the samples whose label is 4 or 9 (Logistic Regression is used in binary classification tasks). Then we normalize the samples as

Preiteration: As the preiteration requirement of LiSSA [13], for the sake of fairness, we will perform 2 epoch SVRG [4] iteration for all the methods after initializing the decision variable

Experimental Results: We have shown our experimental results about Logistic Regression in our paper.

Parameters Selection: We have shown all the experimental parameters in Path: ./code/lr_data/params

C.2 Extra Experiments on Huber SVM

In this section, we conduct our experiments on Support Vector Machines. To satisfy assumptions of all baselines and SPAN, we set the loss function l(.) in the SVM

as smoothed Huber loss presented as

Figure 5: Training loss versus running time. The first two columns are for webspam dataset. The last two are for Yelp Review Polarity dataset. Note SPAN achieves the best or near the best performance with respect to wall-clock time.

Figure 6: Hessian approximation error. Note that SPAN achieves near lowest error.

Datasets: In this experiment, we use two text classification datasets including webspam [27] and Yelp Review Polarity [28].

• webspam: For each instance, continuous 1 bytes are treated as a word, and use word count as the feature value. In the end, there are 254 features for each sample, and the number of sample is about 160, 000.

• Yelp Review Polarity: The Yelp reviews polarity dataset is constructed by considering stars 1 and 2 negative, and 3 and 4 positive. For each polarity 280,000 training samples and 19,000 testing samples are take randomly. In total there are 560,000 trainig samples and 38,000 testing samples.

Data Preprocessing: For the two datasets proposed above, we will list our data preprocessing in the following

• Yelp Review Polarity: We want all baselines (including the second-order methods) to converge in a relatively short period of time. Hence, we use CountVectorizer to preprocess the data, and normalize each sample as we do in webspam. Then, we uniformly pick 1000 features to describe each sample. If the number of samples which are described as the zero vector is over a half of the total sample number, we will re-sample the features

Preiteration: As the preiteration requirement of LiSSA [13], for the sake of fairness, we will perform 2 epoch SVRG [4] iteration for all the methods after initializing the decision variable

Experimental Results: We first show the graph about the training loss versus running time in different condition numbers in Fig 5. Second, we give the Hessian approximation error to validate the quality of our approximate Hessian in Fig 6.

Parameters Selection: We have shown all the experimental parameters in Path: ./code/svm_data/params

designed for accessibility and to further open science