b

DiscoverSearch
About
My stuff
SPAN: A Stochastic Projected Approximate Newton Method
2020·arXiv
ABSTRACT
ABSTRACT

Second-order optimization methods have desirable convergence properties. However, the exact Newton method requires expensive computation for the Hessian and its inverse. In this paper, we propose SPAN, a novel approximate and fast Newton method. SPAN computes the inverse of the Hessian matrix via low-rank approximation and stochastic Hessian-vector products. Our experiments on multiple benchmark datasets demonstrate that SPAN outperforms existing first-order and second-order optimization methods in terms of the convergence wall-clock time. Furthermore, we provide a theoretical analysis of the per-iteration complexity, the approximation error, and the convergence rate. Both the theoretical analysis and experimental results show that our proposed method achieves a better trade-off between the convergence rate and the per-iteration efficiency.

Mathematical optimization plays an important role in machine learning. Many learning tasks can be formulated as a problem of minimizing a finite sum objective:

image

where N is the number of samples, d is the dimension of parameters and  fi(x)denotes the loss function for sample i. In order to solve Eq.(1), many first- and second-order methods have been proposed and has the following update paradigm:

image

where  g(xt)is the gradient and  ηtis the step size at t-th iteration. The term  H−1(xt)can be set differently in different methods. First-order methods set  H−1(xt)as an identity matrix. The resulting updating procedure becomes gradient descent (GD) or stochastic gradient descent (SGD) depending on whether the gradient  g(xt)is calculated over the whole sample set or one random sample [1, 2, 3]. A series of first-order linearly convergent methods and their variance reduction variants were proposed to accelerate the above iteration updates, including SVRG [4], SAGA [5], SDCA [6], etc. Their main intuition is to balance the computational cost of  g(xt)and the approximation gap between the stochastic gradient  g(xt)and the global gradient  ∇F(xt).

Compared with first-order optimizers, second-order optimization methods regard  H−1(xt)in Eq. (2) as the Hessian inverse (the standard Newton Method) or certain carefully constructed Hessian inverse (quasi-Newton methods). The matrix  H−1(xt)can be thought to adjust the step size and gradient coordinates through the high order information or the quasi-Newton condition. Second-order methods usually achieve a better convergence rate than first-order ones.

However, second-order algorithms are not widely used in large-scale optimization problems due to expensive computation cost. Computing the Hessian matrix and its inverse requires high computation complexity and consumes large memory. Standard Newton method takes  O(Nd2 + d3)per iteration, where N and d are the number of samples and the number of unknown parameters, respectively. Quasi-Newton methods like BFGS and L-BFGS [7] are faster, but still requires  O(Nd + d2). However, they do not maintain the local quadratic convergence rate. For problems with large parameters, it is not affordable even if the inverse calculation could fit in memory.

Our goal is to develop an approximate yet efficient Newton method with provable convergence guarantee. We aim to reduce per-iteration computation cost for the Hessian inverse calculation while maintaining the approximation accuracy. To this end, we propose Stochastic Projected Approximate Newton method (SPAN), a novel and generic approach to speed up second-order optimization calculation. The main contributions of the paper are as follows:

We propose a novel second-order optimization method, SPAN, to achieve a better trade-off between Hessian approximation error and per-iteration efficiency. Inside the method, we propose a stochastic sampling technique to construct the Hessian approximately. Since it only requires first-order oracles and Hessian-vector products, the Hessian and its inverse can be computed very efficiently.

We present a theoretical analysis about the Hessian approximation error and the convergence rate. SPAN achieves linear-quadratic convergence with a provable Hessian approximation error bound.

We conduct experiments on multiple real datasets against existing state-of-the-art methods. The results validate that our proposed method achieves state-of-the-art performance in terms of the wall-clock time, with almost no sacrificing on the construction accuracy of the Hessian.

In order to improve the per-iteration efficiency, several stochastic second-order methods have tried to seek the trade-off between per-iteration computational complexity and convergence rate, which can be divided into two main categories.

Stochastic Newton Methods. A series of sub-sampled Newton methods are proposed to solve the problems where the sample size N is far larger than the feature size d. In these methods, the Hessian matrix is approximated with a small subset of training samples. NewSamp [8] and similar methods by [9, 10] sample from function set  {fi} randomlyand construct a regularized m-rank approximate Hessian with truncated singular value decomposition to improve the per-iteration efficiency. However, they need to compute the sub-sampled Hessian, which has an  O(d2)computational cost even if there is only one single instance. Removing the requirements of second-order oracle in NewSamp-type methods, our SPAN further accelerates the iteration with stochastic low-rank approximation techniques, which improves the complexity by a factor nearly O(d/m).

Sketch Newton method in [11] adopts sketching techniques to approximate Hessian. A non-uniform probability distribution was introduced to sample rows of�∇2F(x)in [12]. Both of them use new approximate Hessian construction and proper sampling methods to improve Hessian estimation efficiency. Differ from such sketch Newton methods, our SPAN does not need the Hessian Decomposition assumption for sketched updates in [12, 11].

Rather than estimating the Hessian, LiSSA by [13] approximates the Hessian inverse with matrix Taylor expansion. The most elegant part of LiSSA is that the Hessian inverse estimation is unbiased, and the approximation error only depends on the matrix concentration inequalities. Compared with LiSSA, SPAN guarantees more robust per-iteration complexity when the objective function is not Generalized Linear Model (GLM). Moreover, we do not require the initial solution is close to the optimum. Additionally, the procedure of calculating the approximate Hessian for each sample in SPAN is independent, which means our method has better parallelism compared with LiSSA.

image

Table 1: Important mathematical notations in this paper.

Stochastic Quasi-Newton Methods. The quasi-Newton methods can also be improved with stochastic approximation techniques. S-LBFGS by [14] adopts the randomization to the classical L-BFGS and integrates the widely used gradient variance reduction techniques. Subsequently, SB-BFGS in [15] extends BFGS with matrix sketching to approximate Hessian inverse. Stochastic quasi-Newton methods have a better per-iteration complexity compared with stochastic Newton methods. However, most of them cannot explicitly demonstrate the benefit of introducing the curvature information in the convergence analysis. Such problem is solved in our SPAN by establishing the Hessian approximation error bound which can hardly be analyzed in stochastic quasi-Newton methods.

In this section, we will first present the overall idea and intuition of our proposed SPAN. Then, we will describe three technical components and the full algorithm. For better illustration, we list some important notations and their descriptions in Table 1.

The goal of SPAN method is to optimize Eq. (1) with respect to the decision variable x (i.e., model parameters) using the mini-batched iterative update:

image

where t is the iteration index. Note that the straightforward calculation of Eq. (3) requires  O(Nd + bd2 + d3) which istime-consuming for models with a large d.

To make the computation faster, instead of calculating the batch Hessian  HB(the abbreviation of  HB(xt)) and then taking the inverse explicitly, our idea is to replace  HB in Eq. (3)with an approximate Hessian ˆHB. Ideally, ˆHB shouldbe bounded within a small region around the true Hessian  HB. We propose to use the projected approximation for the Hessian ˆHB = PHBP Twhere P is a carefully constructed orthogonal projector [16]. To further ensure ˆHBinvertible, we add an additional perturbation term  ∆H, ˆHB = PHBP T + ∆H. Note that such formulation is sufficient to approximate  HB, since we can consider  P ∗ = Σki=1pi(HB)pi(HB)Tto obtain the optimal k-rank approximation of  HB. However, in practice, we can hardly find  pi(HB)exactly without knowing the batch Hessian. Thus, it is challenging to construct P with desired efficiency and approximation accuracy. At the first glance, this seems infeasible, while our main intuition is based on the observation that affine transformation A over random vectors tends to have larger components on the top singular vectors of A. Such intuition is helpful to find an accurate approximation of orthonormal projector  P ∗. More details follow later.

In the remaining of this section, we present the construction of an approximate Hessian and the resulting optimization algorithm. In particular, we design the algorithm components guided by the following questions.

1. How to design a structure for  P so that ˆHB = PHBP T can be computed efficiently without knowing  HB?

2. How to make the inverse robust — permitting ˆHBto be invertible in any circumstance?

3. How to balance the Hessian approximation error  ∥ ˆHB − HB∥and iteration complexity flexibly?

image

3.1 Stochastic Projected Approximation

We construct the approximate Hessian as ˆHB = PHBP T which can be calculated without knowing  HB. A proper Pessentially decides the direction where the Hessian  HBwould project onto. We decompose P into the product of the form  P = UU T , where U ∈ Rd×l is an orthonormal matrix. With that, we construct an approximate Hessian ˆHB by

image

One can easily verify that the Moore–Penrose inverse of ˆHBcan be computed efficiently via ˆH†B = U�U T HBU�−1 U Tsince the size of  U T HBU ∈ Rl×l is much smaller than  HBand can be calculated by the product of  U T and HBU.

In the following, we develop a method to obtain a U while keeping  ∥ ˆHB − HB∥small. We derive a method to calculate HBUwithout requiring Hessian  HB.

Our method is to randomly choose a set of column vectors  Ωand project them to the space expanded by the singular vectors of  HB. From the projection, one can extract an orthonormal basis U to expand a low-dimensional space for HBto project to. The procedure is inspired by the Proto-Algorithm [17], combining with the secant equation for Hessian-vector products calculation. We utilize the standard Gaussian distribution to generate such  Ω ∈ Rd×land calculate the projection  Y = HBΩdirectly. To make the computation more efficient, the set of random vectors  Ω onlycontains l elements where  l ≪ d.

Notice that for any matrix V whose i-th column vector is presented as  vi, the extended Hessian-vector product on a sample batch B, denoted as  ΨB(x, V ), is

image

In practice, the Hessian-vector product  HB(x)vican be calculated by the finite difference of gradients like Algorithm 1.

U is then constructed as follows:

image

With the above-constructed U, ˆHB(x)can be further constructed again using the extended Hessian-vector product, ˆHB(x) = UU T ΨB(x, U)U T . But there is no need to calculate  ˆHBexplicitly, it suffices to calculate its Moore-Penrose inverse. One can verify that with such constructed ˆHB, the error in the term of  ∥ ˆHB(x)−HB(x)∥is well-bounded [17].

3.2 Robust Hessian Inversion

The above constructed approximate Hessian is simplified with one flaw ( ˆHBis actually not invertible) since the orthonormal matrix U is low-rank. This can be alleviated with a perturbed version of Eq.(4).

image

where  λis a carefully chosen constant with  λ > 0.

The purpose of the perturbation term  λ�I − UU T �is to introduce the invertibility of the approximate matrix ˆHB(x)presented in Eq. (4). We will give an informal analysis here to show the importance of choosing a proper  λ. On one hand, a large  λwill impair captures of the main actions of  HB, since it increases the lower bound of the Hessian approximation error  ∥ ˆHB − HB∥ as

image

image

Figure 1: The normalized projection, i.e.,  HBΩ, H4BΩand  H10B Ω(LTR) of standard gaussian vectors  Ω(blue points) where  HB = diag{1, 2, 3}.

On the other hand, from an iteration perspective, we can neither choose a tiny  λsince it will lose the benefit of the curvature information induced by approximate Newton. In particular, if we regard the SVDs of  I − UU Tand U T HB(x)Uas follows

image

The SVD of constructed Hessian ˆHB(xt)in Eq. (5) and its inverse can be formulated as

image

It can be observed that a tiny  λwill make the singular vectors associated with  λ−1dominate the action of ˆH−1B (x), which impairs the introduction of curvature information taken by  U ˆU and Σ.

Hence, a proper  λis needed to balance the Hessian approximation error and the  l2norm of constructed Hessian inverse. Specifically, we will further discuss  λ(see Theorem 4.2) and the Hessian approximation error  ∥ ˆHB(x) − HB(x)∥(seeLemma 4.1) in Section 4. Rigorous proof will be deferred to our supplementary materials.

3.3 Power Iteration

In this part, we introduce a power iteration technique to balance  ∥ ˆHB − HB∥and iteration complexity more flexibly. That is to say, with limit iteration complexity sacrificing, the construction of ˆHBin Eq. (5) can be improved through some auxiliary steps in the main algorithm. Generally speaking, any matrix U obtained as in Section 3.1 is valid for ˆHB’s construction. However, in practice, a better orthogonal projector  UU Tmaintains that top singular vectors of HB(xt)are rotated less after performing the projection on  HB(xt), e.g., UU T HB(xt)UU T . The matrix U is usually obtained from the projection  HB(xt)Ωof random vectors. Thus, we hope basis vectors of the projection have larger components on top singular vectors e.g.,  p1(HB), p2(HB), compared with those on  pd−1(HB)and  pd(HB). Such a requirement can be satisfied by taking the power of the Hessian.

Specifically, as shown in Figure 1 showed, normalized  HBΩ(green dots) seems to like a unit ball, while they degenerate to a unit circle (red dots) when random vectors  Ωare projected through  H4B(red dots) because the component of projection on [1, 0, 0] almost have become 0. A similar phenomenon also happens for the component of projection on [0, 1, 0] when taking a larger power for Hessian, i.e.,  H10B Ω(yellow dots). As a result, the normalized projections nearly collapse to two points (yellow dots), i.e., [0, 0, 1] and  [0, 0, −1]. That is to say, the component of projection on bottom singular vectors tends to vanish. Besides, such a phenomenon becomes more significant as the power of Hessian increases. With such an observation, we can calculate a better U with the following steps:

1. generate a standard Gaussian matrix  Ω and set Y0 = Ω;

2. iteratively use the extended Hessian-vector product to get  Yj = [HB(x)]jΩ = ΨB(x, Yj−1);

3. calculate the basis U via QR decomposition  Yq = UR.

3.4 Details of SPAN

In this section, we show the complete algorithm about SPAN in Algorithm 2. Also, we explain the iteration complexity of SPAN, and compare it with the state-of-the-art optimizers.

image

Table 2: Per-iteration complexity comparisons among stochastic first-order optimization methods, sub-sampled Newton Methods, LiSSA and ours. Notice that  κb,k ≤ ˆκ and ˙κb,k ≤ ˜κb,k.

In Algorithm 2, we use  σi,tas the abbreviation of  σi(HB(xt)). At each iteration, we first select a sample batch (Step 3) and generate some random matrix (Step 4). With the random matrix  Ω, a proper Ucan be found through the process we introduced in Section 3.3 (Step 5 to Step 8). After that, we compute  HB(xt)U(Step 9) as an intermediate variable and select the perturbation constant  λ. Finally, we calculate the constructed Hessian inverse ˆH−1B (xt)(Step 10) like Eq. (5) and update the decision variables (Step 11).

Iteration Efficiency. In order to illustrate the computational complexity of each iteration, we first introduce some condition numbers, which are designed with respect to the component functions, i.e.,  fi(·) and fB(·). In such case, one typically assumes that each component is bounded by  βb,k(x) ≜ maxB σk(HB)and  αb,k(x) ≜ minB σk(HB)like LiSSA [13], we define

image

Such condition numbers have the following relations

image

We compare per-iteration complexity among state-of-the-art optimizers, including SPAN, and list the results in Table 2 where we ignore log terms of d and different  κs for brevity. The iteration complexity of SPAN consists of three terms. The first term O(Nd) represents the time complexity of the full gradient computation (Step 11). The second term indicates the complexity of power iteration (Step 5 to Step 7). The calculation of  ΨB(·)in Step 6 requires O(bld) where the fact  q ∝ log dand  b = Θ(˙κ2b,1 ˙κ2b,m log d)will be detailedly demonstrated in our supplementary materials. The last term  O(l2d)shows the complexity of QR decomposition (Step 8). In particular, the most time-consuming step in constructing ˆH−1B (xt)is to calculate  (ZT U)−1, which leads to an  O(l3)time complexity. For a given  g(xt), the computational cost of ˆH−1B (xt)g(xt)is only O(ld) when the order of matrix multiplication is appropriately arranged. Owing to the fact that  l ≪ d, such computational cost is much smaller than  O(l2d)(the third term).

Comparing with the NewSamp [8] which updates decision variables with a sub-sampled constructed Hessian inverse, SPAN takes a significant acceleration at each iteration through the only first-order oracle and the Hessian-vector product requirements when the m-th singular value is close to the minimum singular value with  ˙κ2b,m ≤�d/m. In addition, per-iteration complexity in LiSSA [13] is even worse than NewSamp in general cases. However, in GLM models, LiSSA has better performance because of the fast calculation of Hessian-vector products. Even in GLM models, our SLAN can still be faster than LiSSA when the approximate Hessian is good enough, or the condition number  κb,1is large with  ˙κ2b,ml ≤ κb,1. Notice that, the condition numbers of batch loss chosen, i.e.,  ˙κ2b,1 ≥ d, will not be too large. Otherwise, per-iteration efficiency is governed by batch Hessian or Hessian-vector products calculation, which impairs the acceleration achieved by matrix approximation techniques.

We will give a theoretical analysis of our results in this section. We will first introduce some standard assumptions for the optimization problems, and then bound the error between our approximate Hessian and the sub-sampled one. Finally, we will show the local linear-quadratic convergence of our main algorithm (Algorithm 2), with the detailed coefficients of the convergence rate. We further compare the convergence rate of our method with NewSamp [8] and LiSSA [13]. Due to space limitations, the details of proof arguments are provided in the supplementary materials.

Assumption 1. (Hessian Lipschitz Continuity) For any subset  B ⊂ [N]and a second-order differentiable objective function F like the Eq.(1), there exists a constant  Mbdepending on b, such that  ∀x, ˆx ∈ D

image

Assumption 2. (Gradient Lipschitz Continuity and Strong Convexity) For any subset  B ⊂ [N]and a second-order differentiable function F like the Eq.(1), there exists constants  µb and Lbdepending on the b, such that for any  x, ˆx ∈ D

image

Assumption 3. (Bounded Hessian) For any  i ∈ {1, 2, ..., N}, and the Hessian of the function  fi(x) in Eq. (1), ∇2fi(x)is upper bounded by an absolute constant K, i.e.,

image

Lemma 4.1. Suppose the Assumption 1, 2 and 3 hold. For every iteration t in Algorithm 2, if the parameters satisfy:

image

As far as we know, if we approximate the Hessian with an m-rank+sparse construction, the optimal Hessian approximation error will be  σm+1,t − σd,twhich comes from NewSamp [8]. Corresponding to such an approximation error, the construction complexity of NewSamp requires  O( ˙κb,1d2 + md2)which can hardly be endured with a slightly larger instance dimension. While, from Lemma 4.1, SPAN improves the complexity by a factor at least  O(d/(l ˙κ2b,m))(l ≪ d) when the stochastic Hessian can be well approximated and keeps the approximation error nearly three times the optimal (σn,tis a tiny constant). Such a guarantee cannot be promised for any quasi-Newton method. Although LiSSA [13] has a similar error bound for the approximate Hessian inverse, that approximation error is not comparable because it depends almost entirely on the concentration inequalities of the sub-sample processing but not the matrix approximation techniques. Additionally, we only bound the Hessian approximation error when q = O(log d) and λ ≤ min�σm+1,t, 12σmin(ZT U)�. In fact, such upper bounds on q and  λare possibly pessimistic and can likely be improved to a more average quantity. However, since the parameters q = O(1) and  λ = 12σm+1(ZT U)suffice for convergence in our experimental settings, we have not tried to optimize it further.

Theorem 4.2. Suppose  λmin ≤ 2σm+1,t. Frame the hypotheses of Lemma. 4.1, if the parameters satisfy:

image

The coefficients  c1,t and c2,t are

image

image

Figure 2: Training loss versus running time. The first two columns are for MNIST4-9 dataset. The last two are for CovType dataset. Note SPAN achieves the best or near the best performance with respect to wall-clock time.

Remark. Notice that, we require  λmin,t ≤ 2σm+1,tin Theorem 4.2, which is only introduced to simplify the formulation of coefficients, i.e.,  c1,tand  c2,t. For any  λmin,t ≤ γσm+1,t (γ ≥ 2), we can obtain a similar convergence rate by setting  λ = γ−1λmin,t. Thus, the theorem is without loss of generality.

Theorem 4.2 shows that SPAN has a similar composite convergence rate like NewSamp [8] and LiSSA [13] where ct1and  ct2are coefficients of the linear and quadratic terms, respectively. Comparing the convergence with NewSamp whose first-order coefficient  c1,tis abbreviated as  1 − α, ours can be similar to  1 − α/6. In particular, truncated SVD is introduced in NewSamp to find an optimal m-rank approximate Hessian so that  ct1in NewSamp can be regarded as the optimal coefficient in most stochastic Newton method settings. For LiSSA, its convergence rate is constrained by the initial decision variable  x0and the sample size for calculating Hessian-vector product strictly, which causes that we cannot compare its convergence rate with ours directly. However, SPAN has a better convergence wall-clock time which is validated in our experimental results (see Section 5). In addition, during the analysis of the convergence rate, we notice that promoting the Hessian approximation error  ϵHnot only reduces the first-order coefficient in the linear-quadratic convergence rate but also expands the range of step size. This coincides with our intuition that a faster convergence usually requires a smaller Hessian approximation error. In summary, SPAN is designed to achieve a better trade-off between the iteration complexity and the convergence rate. With randomizing the process of constructing the special approximate Hessian, SPAN both accelerates the iteration and limits Hessian approximation error.

To evaluate the performances and fully demonstrate the advantages of our proposed method, we conduct our experiments on several machine learning tasks with different objective functions, including binary image classification and text classification. As we get similar conclusions on these two tasks, we only present the experimental result on the binary image classification in this section. We will further show all experimental details, including the parameters of all optimizers and the results on text classification tasks in the supplementary materials due to space limitations.

For the image classification tasks, we refer to the experimental settings in [13] and [8]. We utilize the following objective function:

image

where  θi ∈ Rdand  yi ∈ {−1, 1}are the instances and labels, respectively, and a is the  L2regularization parameter which influences the condition number of the objective.

We choose state-of-the-art first- and second-order optimization methods as baselines, including SVRG [4], NewSamp [8], S-LBFGS [14], SB-BFGS [15], and LiSSA [13]. We implement all the methods in C++ and Intel Math Kernel Library(MKL). All the code for our experiments can be found in our supplementary material.

We adopt two datasets in our experiment, including the MNIST4-9 dataset [13] consisting of approximate  1.2 × 104instances, and CovType dataset [8] consisting of approximate  5.0 × 105instances. For each dataset, we evaluate all the methods on two different condition numbers, respectively. Besides, for the fairness of comparison, in each set of experiments, we pick the optimal hyperparameters for every method and present the wall-clock time of all the methods in Figure 2.

image

Figure 3: Empirical convergence rate comparison on MNIST4-9 dataset. Four columns represent different batch-size for sub-sampled Hessian.

image

Figure 4: Hessian approximation error. Notice that SPAN achieves near lowest error.

From Figure 2, we can find that our method SPAN achieves the best results in almost all experimental settings. NewSamp gets a very close performance on the CovType dataset but consumes a lot of time on the MNIST dataset. LiSSA performs very close on the MNIST dataset but does not perform well on the CovType dataset. Meanwhile, there are significant gaps between SPAN and the other methods in all the experimental settings.

To further explain the gaps between the second-order baselines and SPAN, we evaluate the empirical approximation error between the real full Hessian and the constructed approximate Hessian in Figure 4. We can see that the approximation error of our method is very close to that of the theoretical optimal solution (NewSamp) but much more accurate than the other methods. The results of the experiment demonstrates that our constructed approximate Hessian in SPAN makes better use of second-order information than other stochastic second-order optimizers (S-LBFGS, SB-BFGS, and LiSSA). Moreover, Figure 3 illustrates the empirical convergence rate of these methods. As we can see, the empirical convergence rate of our proposed method is much better than LiSSA but slightly worse than NewSamp, which confirms our theoretical results and insights again.

Even though NewSamp guarantees an excellent convergence rate, the efficiency of the algorithm mainly depends on the number of parameters. NewSamp needs a long time to perform an iteration, even when there are only 784 parameters in MNIST dataset. As a comparison, our proposed SPAN is much more competitive in terms of the robustness of the feature dimension and the per-iteration efficiency in various scenarios.

Considering the experiments mentioned above, SPAN has a tolerable Hessian approximation error which results in the runner up with respect to the empirical convergence rate. Besides, it enjoys the benefit of per-iteration efficiency, which makes it outperform others in practice. In summary, we conclude that SPAN achieves a better trade-off between per-iteration efficiency and convergence rate. Furthermore, it is robust for the sub-sampled batch size.

Newton and quasi-Newton methods converge at faster rates than gradient descent methods. However, they are often expensive, computationally. In this paper, we propose SPAN, a novel method to optimize a smooth-strongly convex objective function. SPAN utilizes the first-order oracle for Hessian approximation. Therefore, it is much faster than Newton method and its alike. We give a theoretical analysis of its approximation accuracy and convergence rate. Experiments on several real datasets demonstrate that our proposed method outperforms previous state-of-the-art methods.

For the future work, the constructed approximate Hessian of SPAN can be combined with advanced first-order methods, i.e., SVRG [4] and SAGA [5], or help to introduce second-order information to complex objective functions, e.g., cubic regularization [18, 19, 20], neural networks with low complexity and competitive convergence rate.

Thanks the anonymous reviewers for their helpful comments and suggestions. We also thank Zhe Wang, Linyun Yu, Yitan Li, and Hao Zhou for their valuable discussions.

[1] Herbert Robbins and Sutton Monro. A stochastic approximation method. In Herbert Robbins Selected Papers, pages 102–109. Springer, 1985.

[2] Mu Li, Tong Zhang, Yuqiang Chen, and Alexander J Smola. Efficient mini-batch training for stochastic optimization. In Proceedings of the 20th ACM SIGKDD international conference on Knowledge discovery and data mining, pages 661–670. ACM, 2014.

[3] Andrew Cotter, Ohad Shamir, Nati Srebro, and Karthik Sridharan. Better mini-batch algorithms via accelerated gradient methods. In Advances in neural information processing systems, pages 1647–1655, 2011.

[4] Rie Johnson and Tong Zhang. Accelerating stochastic gradient descent using predictive variance reduction. In Advances in neural information processing systems, pages 315–323, 2013.

[5] Aaron Defazio, Francis Bach, and Simon Lacoste-Julien. Saga: A fast incremental gradient method with support for non-strongly convex composite objectives. In Advances in neural information processing systems, pages 1646–1654, 2014.

[6] Shai Shalev-Shwartz and Tong Zhang. Accelerated proximal stochastic dual coordinate ascent for regularized loss minimization. In International Conference on Machine Learning, pages 64–72, 2014.

[7] Dong C Liu and Jorge Nocedal. On the limited memory bfgs method for large scale optimization. Mathematical programming, 45(1-3):503–528, 1989.

[8] Murat A Erdogdu and Andrea Montanari. Convergence rates of sub-sampled newton methods. arXiv preprint arXiv:1508.02810, 2015.

[9] Farbod Roosta-Khorasani and Michael W Mahoney. Sub-sampled newton methods ii: local convergence rates. arXiv preprint arXiv:1601.04738, 2016.

[10] Richard H Byrd, Gillian M Chin, Will Neveitt, and Jorge Nocedal. On the use of stochastic hessian information in optimization methods for machine learning. SIAM Journal on Optimization, 21(3):977–995, 2011.

[11] Mert Pilanci and Martin J Wainwright. Newton sketch: A near linear-time optimization algorithm with linear-quadratic convergence. SIAM Journal on Optimization, 27(1):205–245, 2017.

[12] Peng Xu, Jiyan Yang, Farbod Roosta-Khorasani, Christopher Ré, and Michael W Mahoney. Sub-sampled newton methods with non-uniform sampling. In Advances in Neural Information Processing Systems, pages 3000–3008, 2016.

[13] Naman Agarwal, Brian Bullins, and Elad Hazan. Second-order stochastic optimization for machine learning in linear time. The Journal of Machine Learning Research, 18(1):4148–4187, 2017.

[14] Philipp Moritz, Robert Nishihara, and Michael Jordan. A linearly-convergent stochastic l-bfgs algorithm. In Artificial Intelligence and Statistics, pages 249–258, 2016.

[15] Robert Gower, Donald Goldfarb, and Peter Richtárik. Stochastic block bfgs: Squeezing more curvature out of data. In International Conference on Machine Learning, pages 1869–1878, 2016.

[16] Roger A Horn and Charles R Johnson. Matrix analysis. Cambridge university press, 2012.

[17] Nathan Halko, Per-Gunnar Martinsson, and Joel A Tropp. Finding structure with randomness: Probabilistic algorithms for constructing approximate matrix decompositions. SIAM review, 53(2):217–288, 2011.

[18] Yurii Nesterov and Boris T Polyak. Cubic regularization of newton method and its global performance. Mathematical Programming, 108(1):177–205, 2006.

[19] Jonas Moritz Kohler and Aurelien Lucchi. Sub-sampled cubic regularization for non-convex optimization. In Proceedings of the 34th International Conference on Machine Learning-Volume 70, pages 1895–1904. JMLR. org, 2017.

[20] Nilesh Tripuraneni, Mitchell Stern, Chi Jin, Jeffrey Regier, and Michael I Jordan. Stochastic cubic regularization for fast nonconvex optimization. In Advances in Neural Information Processing Systems, pages 2904–2913, 2018.

[21] David Gross and Vincent Nesme. Note on sampling without replacing from a finite collection of matrices. arXiv preprint arXiv:1001.2738, 2010.

[22] Rajendra Bhatia. Matrix analysis, volume 169. Springer Science & Business Media, 2013.

[23] Vladimir Igorevich Bogachev. Gaussian measures. Number 62. American Mathematical Soc., 1998.

[24] Roger A Horn and Charles R Johnson. Matrix analysis cambridge university press. New York, 37, 1985.

[25] Mnist dataset http://yann.lecun.com/exdb/mnist/.

[26] Covertype dataset https://archive.ics.uci.edu/ml/datasets/covertype.

[27] Steve Webb, James Caverlee, and Calton Pu. Introducing the webb spam corpus: Using email spam to identify web spam automatically. In CEAS, 2006.

[28] Yelp dataset https://www.yelp.com/dataset/challenge.

In this section, we give several important definitions and lemmas which will be used in the proof of the convergence analysis (Section. 4 in this paper).

Definition 1. (Orthogonal Projectors [17]) An orthogonal projector is an Hermitian matrix P that satisfies the polynomial  P 2 = P. This identity implies  0 ⪯ P ⪯ I. An orthogonal projector is completely determined by its range(column space). For a given matrix M, we write  PMfor the unique orthogonal projector with  range (PM) =range (M). When M has full column rank, we can express this projector explicitly:

image

where  M ∗ denotes the conjugate transpose matrix of M. The orthogonal projector onto the complementary subspace, range (P)⊥, is I − P.Lemma A.1. (Matrix Concentration Inequalities [21]) Let X be a finite set of Hermitian matrices in  Rd×dwhere

image

Given its size, let S denote a uniformly random sample from {1, 2, ..., |X|} with or without replacement. Then we have

image

Lemma A.2. (Conjugation Rule) Suppose that  M ⪰ 0, for every real matrix A, the matrix  AT MA ⪰ 0. In particular,

image

Proof. The positive semidefinite matrices form a convex cone, which induces a partial ordering on the linear space of Hermitian matrices:  M ⪯ Nif and only if  N − Mis positive semidefinite, which means

image

For any real vector u, v = Au satisfies

image

Lemma A.3. Let P be an orthogonal projector, and let M be a real square matrix. For each positive number q, we have

image

Proof. Define the SVD of M as  M = UΣV Twhere U and V are unitary matrices. Thus, we have  Σ =diag�σ1, σ2, . . . , σrank(M), 0, 0, ..., 0�where  σidenotes the i-th largest singular value of matrix M. Then, we have

image

where ˆPis an orthogonal projector, X is a nonnegative diagonal matrix and  t ≥ 1. This relation follows immediately from Theorem. IX.2.10 in [22].

Lemma A.4. (Proposition 10.2 in [17]) Draw a  m × lstandard Gaussian matrix  G with m ≥ 2 and l ≥ m + 2, then

image

Lemma A.5. (Proposition 10.4 in [17]) Let  G be a m × lGaussian matrix where  l ≥ m + 4. For all θ ≥ 1,

image

Lemma A.6. (Theorem. 4.5.7 in [23]). Suppose that h is a Lipschitz function on matrices:

image

Draw a standard Gaussian matrix G, then

image

Lemma A.7. Suppose  range(N) ⊆ range(M). Then, for each real matrix A, it holds that  ∥PNA∥ ≤ ∥PMA∥and that  ∥(I − PM) A∥ ≤ ∥(I − PN) A∥.

Proof. By applying Definition 1 and Lemma A.2, we have  PN ⪯ Iand  PMPNPM ⪯ PM. Since  range(N) ⊂

image

That is to say,  PN ⪯ PM provides AT PNA ⪯ AT PMA. Then, we have

image

The second statement  ∥(I − PM) A∥ ≤ ∥(I − PN) A∥follows from the first  ∥PNA∥ ≤ ∥PMA∥by taking orthogonal complements.

Lemma A.8. Suppose a real matrix satisfies  M ⪰ 0. Then

image

Proof. Define  A = M 1/2 as the positive semidefinite square root of M promised by Theorem 7.2.6 in [24], we have

image

The first equality can be verified algebraically. The second holds because rational functions of a diagonalizable matrix, such as A, commute. The last relation is because the conjugate rule�I + A2�−1 ⪯ I.

Lemma A.9. (Proposition 8.3 in [17]) We have  ∥M∥ ≤ ∥A∥ + ∥C∥for each partitioned positive semidefinite matrix

image

Lemma A.10. Let  M, N ∈ Rd×d be two positive definite matrices. If there is existing that

image

the minimum singular values of  A and B, σmin(M) and σmin(N), satisfy

image

Proof. To make a contradiction with Eq. (8), we assume that the minimum singular values of A and B have

image

For the singular vectors  vM and vN of M and Nwhich respectively correspond to  σmin(M) and σmin(N), we have

image

where the inequality1 uses vTNMvN ≥ vTMMvM. Since the inequality2contradict with the given condition Eq. (7), the assumption Eq. (8) will never be satisfied. Hence, we have

image

Lemma A.11. Let  A ∈ Rd×d be a positive definite matrix. A and a matrix ˜Awhich can be presented as

image

where Q is an unitary matrix and

image

Choose a test matrix  Ω ∈ Rd×l(m ≤ l), and construct the sample matrix  Y = AΩ. Assuming that  QT∥ Ωhas full row rank, the orthogonal projector  PY satisfies

image

where�QT∥ Ω�†means the Moore–Penrose inverse of matrix  QT∥ Ω.

Proof. For the orthogonal projector  PYand the unitary matrix Q, it is clear that  QT PY Qis an orthogonal projector due to Definition 1. Evidently,

image

Since ranges determine orthogonal projectors, we have  QT PY Q = PQT Yand the following equation

image

where the first euqation is because the unitary invariance of the spectral norm. Subsequently, we construct a matrix Z whose column space belongs to the range of  QT Yto scale the right-most item of the Eq.(11) as follows

image

1uses the full row rank properties of  QT∥ Ω, and the construction of Z denotes that  range(Z) ⊂ range�QT Y�. Thus,Lemma A.7 implies

image

where the matrix Z has a full column rank because of Eq. (12). Thus, we have

image

Applying Lemma A.8, the top-left block verifies

image

Besides, the bottom-right block satisfies

image

For convenience, we abbreviate  D = −�I + F T F�−1 F T, the matrix  I − PZ satisfies

image

As our construction in Eq. (12), we have  F = Σ2QT⊥Ω ·�QT∥ Ω�†Σ−11 . Thus, we have

image

Finally, after introducing the inequality into Eq. (14), the proof is completed.

B.1 Proof of Lemma 4.1

Proof. According to the construction of  Y2q+1(Step. 5 to Step. 7 in Algorithm 1) and Z (Step. 9 in Algorithm 1), we have

image

where  U and Y2q+1 satisfy

image

Here, we denote the orthogonal projector of Y as  PY = UU T. With the approximate Hessian ˆH−1B (xt)(Eq. (6) in Section. 2.2), we have the Hessian approximation error satisfies

image

With algorithm settings and positive definite properties of  ∇2fB(xt), we reformulate the Y and introduce an auxiliary variable ˜Ain the following

image

where we set the SVD of the batch Hessian and diagonal elements of  Σ as

image

Thus, we have the following inequality according to Lemma A.3

image

Subsequently, we can introduce a test matrix  Ωwith some special properties to scale Eq.(18). We set  Ω ∈ Rd×las a standard Gaussian matrix, and  m ≤ l−4 where l ≪ d. As a result, the matrix�QT∥ Ω, QT⊥Ω�is also a standard Gaussian matrix because of the unitary of Q and the rotational invariance of  Ω. Besides,  QT∥ Ωand  QT⊥Ωare nonoverlapping submatrices of�QT∥ Ω, QT⊥Ω�, and these two matrices are stochastically independent. Hence, we can learn how the error depends on the matrix  QT⊥Ωby conditioning on the event that  QT∥ Ωis relatively regular. We set a event as

image

According to Lemma A.5, we find

image

Consider the function  h(X) =����Σ2q+12 X�QT∥ Ω�†����whose Lipschitz constant L can be calculated in the following

image

That is to say,  L ≤���Σ2q+12 ���

image

Applying the concentration measure inequality, Lemma A.6, conditionally to the random variable  h�QT⊥Ω�results in

image

According to Eq. (19), we have a explicit bound under the event  Etas follows

image

With the fact Eq.(20), we can remove the conditioning and obtain

image

Because of the SVDs presented in Eq. (16), we have

image

After introducing Eq.(21) and Eq.(22) to Eq.(18), we have

image

with probability at least  1 − 5θm−l − e−u22. For making the upper bound clear, we set  θ = e, u =�2(l − m), and have

image

Hence, we make some abbreviations as

image

and Eq.(23) can be reformulated and scaled as follows

image

Finally, we introduce Eq.(24) and Eq.(17) to Eq.(15) to have the following error upper bound with probability at least

image

Due to the fact  λ ≤ min{σtm+1, σmin(U T ∇fBH(xt)U)}, we have γ < 1 and

image

That is to say, when the power iteration q satisfy

image

image

Corollary 1. Frame the hypotheses of Lemma A.11, and assume  l − m ≥ 4. Then

image

with failure probability at most  6em−l.

B.2 Proof of Theorem 4.2

Proof. According to Step. 11 in SPAN Algorithm, we have

image

where1 and2establish because of the symmetry of ˆH−1B (xt) and ∇2fB(xt).

With part 1.1 in Eq. (28) and the construction of ˆHB(xt), we have

image

where we denote

image

for abbreviation. Besides, for the definition of v, there are

image

For part 1.1.1, we have

image

where1 and2establish because of the Young’s inequality and the Cauchy–Schwarz inequality.

For part 1.1.2, with the similar proof technique, we obtain

image

For part 1.1.3, we can easily find

image

Submitting Eq.(32), Eq.(33) and Eq.(34) back into Eq.(29), we obtain that, for any feasible v, the part 1.1 of Eq. (28) satisfies

image

where 1establish because of the fact  ∥v∥2 = ∥a∥2 + ∥b∥2. For any specific  λmand  σm+1, there is existing some α ≤ 0.5which satisfies the inequality  λm ≤ α−1σm+1. To clarify the proof procedure, we assume

image

Notice that, the proof techniques presented in the following are not constrained by such specific  α. To minimize our convergence constant, we required that

image

where we set  λ = αλm. Hence, we request that the positive root about  ρin Eq.(37) should satisfy  ρ+ ∈ (0, 1). Then, we present the analytic form of  ρ+ as

image

According to Corollary 1, the sufficient condition for the establishment of the last equation of Eq. (38) can be formulated as

image

with probability at least  1 − 6em−l(l ≥ m + 4). Combining Eq.(39) and Eq.(36), when k = 1, we have

image

To summarize these equations, we have the following conclusion. When  k = 1, α = 0.5 and λm ≤ 2σm+1, we can set

image

with probability at least  1 − 6em−l(l ≥ m + 4). In Eq.(41),1is from Eq. (38). Part 1.1 in Eq.(28) has

image

With Similar to the tricks on the part 1.1, for part 1.2 in Eq.(28), we have

image

where 1establish because of Eq. (29). Utilizing the similar proof techniques presented in Eq. (32) and Eq. (33) on part 1.2.2 and part 1.2.3, we respectively have

image

where  a, b and ϵHare introduced as Eq. (30). Besides  ρ+satisfies Eq. (37). With such settings, we have the following inequalities for part 1.2.1 and part 1.2.4

image

Therefore, with Eq. (43), Eq. (44) and Eq. (45), we can obtain that

image

With the condition shown in Eq. (40) and our requirements, we have

image

Thus, we can derive

image

image

as

image

when  ηt satisfies

image

The sufficient condition1comes from the upper bound of  ρ+shown in Eq. (41).

For the part 2 in Eq.(27): We denote the stochastic bias matrix  ∆i(xt)and the sub-sample hessian matrix  ∇2fB(xt)as follows

image

With Assumption 2.3 and the construction Eq.(54), we can get some properties

image

According to the matrix Bernstein’s inequality given in Lemma A.1, for any available  xtwe have bound the part 2 in Eq.(27) as

image

Hence, for any  ˜ϵ, if we want  ∥∇2fB(xt) − ∇2F(xt)∥≤ ˜ϵwith probability at least  1 − em−l, the sampled size of B, b, should satisfy

image

Hence, the Eq.(27) can be rewrited in the following with probability at least  1 − 6em−l(l ≥ m + 4) as����I − ηt ˆH−1B (xt)� 10 ∇2F(x∗ + τ(xt − x∗))(xt − x∗)dτ����

image

due to Eq. (50), Eq. (52), Eq. (53) and Eq. (54). Since  λ ≤ min{σtm+1, σmin(U T ∇fB(xt)U)}, we can conclude the convergence rate as

image

with probability at least  1 − 6em−l(l ≥ m + 4). For clarify the coefficients of convergence further, we have

image

Thus, the coefficients of convergence satisfy

image

C.1 Exact Experimental setting on Logistic Regression

In this section, we detail our experiments on Logistic Regression. We set the objective function as

image

Datasets: In this experiment, we use two datasets for the binary classification task including MNIST [25] and CovType [26].

MNIST: The MNIST database is a dataset of handwritten digits. It has 60,000 training samples, and 10,000 test samples. Each image is represented by 28x28 pixels, each containing a value 0 - 255 with its grayscale value.

CovType: Predicting forest cover type from cartographic variables only (no remotely sensed data). The actual forest cover type for a given observation (30 x 30 meter cell) was determined from US Forest Service (USFS) Region 2 Resource Information System (RIS) data. Independent variables were derived from data originally obtained from US Geological Survey (USGS) and USFS data. Data is in raw form (not scaled) and contains binary (0 or 1) columns of data for qualitative independent variables (wilderness areas and soil types).

Data Preprocessing: For the two datasets proposed above, we will list our data preprocessing in the following

MNIST: We first select the samples whose label is 4 or 9 (Logistic Regression is used in binary classification tasks). Then we normalize the samples as

image

Preiteration: As the preiteration requirement of LiSSA [13], for the sake of fairness, we will perform 2 epoch SVRG [4] iteration for all the methods after initializing the decision variable  x0.

Experimental Results: We have shown our experimental results about Logistic Regression in our paper.

Parameters Selection: We have shown all the experimental parameters in Path: ./code/lr_data/params

C.2 Extra Experiments on Huber SVM

In this section, we conduct our experiments on Support Vector Machines. To satisfy assumptions of all baselines and SPAN, we set the loss function l(.) in the SVM

image

as smoothed Huber loss presented as

image

image

Figure 5: Training loss versus running time. The first two columns are for webspam dataset. The last two are for Yelp Review Polarity dataset. Note SPAN achieves the best or near the best performance with respect to wall-clock time.

image

Figure 6: Hessian approximation error. Note that SPAN achieves near lowest error.

Datasets: In this experiment, we use two text classification datasets including webspam [27] and Yelp Review Polarity [28].

webspam: For each instance, continuous 1 bytes are treated as a word, and use word count as the feature value. In the end, there are 254 features for each sample, and the number of sample is about 160, 000.

Yelp Review Polarity: The Yelp reviews polarity dataset is constructed by considering stars 1 and 2 negative, and 3 and 4 positive. For each polarity 280,000 training samples and 19,000 testing samples are take randomly. In total there are 560,000 trainig samples and 38,000 testing samples.

Data Preprocessing: For the two datasets proposed above, we will list our data preprocessing in the following

image

Yelp Review Polarity: We want all baselines (including the second-order methods) to converge in a relatively short period of time. Hence, we use CountVectorizer to preprocess the data, and normalize each sample as we do in webspam. Then, we uniformly pick 1000 features to describe each sample. If the number of samples which are described as the zero vector is over a half of the total sample number, we will re-sample the features

Preiteration: As the preiteration requirement of LiSSA [13], for the sake of fairness, we will perform 2 epoch SVRG [4] iteration for all the methods after initializing the decision variable  x0.

Experimental Results: We first show the graph about the training loss versus running time in different condition numbers in Fig 5. Second, we give the Hessian approximation error to validate the quality of our approximate Hessian in Fig 6.

Parameters Selection: We have shown all the experimental parameters in Path: ./code/svm_data/params


Designed for Accessibility and to further Open Science