b

DiscoverSearch
About
My stuff
A Note on Inexact Condition for Cubic Regularized Newton's Method
2018·arXiv
Abstract
Abstract

This note considers the inexact cubic-regularized Newton’s method (CR), which has been shown in Cartis et al. [2011a] to achieve the same order-level convergence rate to a secondary stationary point as the exact CR [Nesterov and Polyak, 2006]. However, the inexactness condition in Cartis et al. [2011a] is not implementable due to its dependence on future iterates variable. This note fixes such an issue by proving the same convergence rate for nonconvex optimization under an inexact adaptive condition that depends on only the current iterate. Our proof controls the sufficient decrease of the function value over the total iterations rather than each iteration as used in the previous studies, which can be of independent interest in other contexts.

The cubic-regularized (CR) Newton’s method [Nesterov and Polyak, 2006] is a popular approach that solves the following general nonconvex optimization problem

image

where f is a differentiable and nonconvex function. Starting from an arbitrary initial point  x0, the update rule of CR can be written as

image

Nesterov and Polyak [2006] showed that CR converges to a second-order stationary point x of the objective function, i.e.,

image

Such a desirable property allows CR to escape strict saddle points. However, the algorithm needs to compute a full Hessian at each iteration, and is hence computationally intensive. Cartis et al. [2011a,b] proposed to use an inexact approximation  Hkto replace the full Hessian  ∇2f(xk)in the CR update, leading to the following inexact CR algorithm

image

image

Cartis et al. [2011a,b] showed that if  Hksatisfies the following inexactness condition,

image

then inexact CR achieves the same order-level convergence rate to a second-order stationary point as exact CR.

This condition has been used in many situations [Cartis et al., 2012a,b, Kohler and Lucchi, 2017] Observe that the above inexact condition involves  ∥sk+1∥(and hence  xk+1), which is not available at iteration k. Thus, the inexact condition in eq. (6) is not practically implementable. More recent research studies Kohler and Lucchi [2017], Wang et al. [2018] used  skto replace  sk+1to implement inexact CR numerically, i.e., the condition in eq. (6) becomes

image

These studies demonstrated that inexact CR performs well in experiments under the condition in eq. (7), but did not provide theoretical convergence guarantee of inexact CR under such a condition. The main contribution of this note is to establish convergence guarantee for the inexact CR under the inexact condition for Hessian (eq. (7)) and a similar inexact condition for gradient (see eq. (12) below), which achieves the same order of convergence rate as the exact CR. In contrast to existing proof techniques, our proof relies on an idea of the overall control of the sufficient decrease of the function value rather than requiring a sufficient decrease at each iteration. More specifically, the inexact error  ∥Hk − ∇2f(xk)∥ ⩽ C∥sk∥at current iteration is incorporated into the bound on the previous iteration, which yields a successful analysis over all iterations under a more relaxed (and practical) condition eq. (7).

We also note that another inexact condition proposed by Xu et al. [2017], which takes the form

image

where  ǫis a pre-defined small constant and is required to smaller than the lower bound of the set  {∥si∥}ki=1before the algorithm terminates. Similar condition has been used in Ghadimi et al. [2017], Jiang et al. [2017], Tripuraneni et al. [2017], Yao et al. [2018]. Compared to eq. (8), eq. (7) is adapted to  ∥sk∥so that the increment  ∥sk∥can be large than  ǫin the most phase of the algorithm to enable more progress towards the convergent point.

Notation: For a vector  x, ∥x∥denotes the  ℓ2norm of the vector x. For a matrix  H, ∥H∥denotes the spectral norm of the matrix H. We let I denote the identity matrix. For a function  f(·), ∇f(·)and  ∇2f(·)denote its gradient and Hessian, respectively.  R, R+and  Rddenote the set of all real numbers, non-negative real numbers and d-dimension real vectors, respectively. S denotes the set of all symmetric matrices.

Our analysis takes the following standard assumption as in the previous studies of CR.

Assumption 1. The objective function in eq. (1) satisfies:

1.  f(·)is twice-continuously differentiable and bounded below, i.e.,  f ⋆ ≜ infx∈Rd f(x) > −∞; 2. The Hessian  ∇2f(·)is L-Lipschitz continuous.

In our analysis, we allow both the gradient and the Hessian to be replaced by their inexact approximation, and hence the CR iterate becomes

image

We assume that  gkand  Hksatisfy the following inexact conditions, which depend on only the current iteration information, and are hence implementable.

Assumption 2. There exist two constants  α, β ∈ R+, such that the inexact gradient  gkand inexact Hessian  Hksatisfy, for all  k ⩾ 0,

image

We next state our main theorem, which guarantees that after k iterations, the inexact gradient and Hessian CR must pass an approximate second order saddle point with error within  O(1/k2/3)and O(1/k1/3)for the gradient and Hessian, respectively, under the inexact condition in Assumption 2. Theorem 1. Let Assumptions 1 and 2 hold. Then, after k iterations, the sequence  {xi}i⩾1generated by inexact CR contains a point  ˜xsuch that

image

where  C1and  C2are universal constants, and are specified in the proof.

The proof of Theorem 1 is based on the following two useful lemmas. Lemma 2 (Nesterov and Polyak [2006], Lemma 1). Let the Hessian  ∇2f(·)of the function  f(·)be L-Lipschitz continuous with L > 0. Then, for any  x, y ∈ Rd, we have

image

We then establish Lemma 3, which provides the properties of the minimizer of eq. (10) for a more general setting. Lemma 3. Let  M ∈ R, g ∈ Rd, H ∈ Sd×d, and

image

Then, the following statements hold:

image

To further explain, eq. (16) corresponds to the first-order necessary optimality condition, eq. (17) corresponds to the second-order necessary optimality condition but with a tighter form due to the specific form of this optimization problem, and eq. (18) guarantees a sufficient decrease at this minimizer.

Proof of Lemma 3. First, eq. (16) follows from the first-order necessary optimality condition of eq. (15), and eq. (17) follows from the Proposition 1 in Nesterov and Polyak [2006]. We next prove eq. (18). Following similar steps as those in Nesterov and Polyak [2006], we obtain that

image

where (i) follows from eq. (16), and (ii) follows from eq. (17), which implies that − 12s⊤ �H + M2 ∥s∥I�s ⩽ 0.

Now, we are ready to prove our main theorem.

Proof of Theorem 1. Consider any iteration k, we obtain that

image

where (i) follows from Lemma 2 with  y = xk+1, x = xkand  sk+1 = xk+1 − xk, (ii) follows from eq. (18) in Lemma 3 with  g = gk, H = Hkand  s = sk+1, (iii) follows from Assumption 2, and (vi) follows from the inequality that for  a, b ∈ R+, a2b ⩽ a3 + b3, which can be verified by checking the cases with a < b and  a ⩾ b, respectively.

Summing eq. (19) from 0 to  k − 1, we obtain that

image

Therefore, we have

image

where (i) follows eq. (20).

Therefore, we have

image

Next, we prove the convergence rate of  ∇f(·)and  ∇2f(·). We first derive

image

where (i) follows from eq. (16) with  g = gm, H = Hmand  s = sm+1, (ii) follows from eq. (13) in Lemma 2 and Assumption 2, and (iii) follows from eq. (21).

We next prove the the convergence rate of  ∇2f(·).

image

where (i) follows from Weyl’s inequality, (ii) follows from eq. (17) with  H = Hmand  s = sm+1, (iii) follows from Assumption 2 and the fact that  ∇2f(·)is  L−Lipschitz, and (vi) follows from eq. (21).

In this note, we study the cubic-regularized Newton’s method under a more practical inexact condition, which depends only on the current iteration information, rather than the future iteration in previous studies. Under such an inexact condition for both the gradient and the Hessian, we establish the convergence of the inexact CR method to a second-order stationary point, and show that the convergence rate is as fast as that of CR in nonconvex optimization.

Cartis, C., Gould, N., and Toint, P. L. (2012a). Complexity bounds for second-order optimality in unconstrained optimization. Journal of Complexity, 28(1):93 – 108.

Cartis, C., Gould, N. I. M., and Toint, P. L. (2011a). Adaptive cubic regularization methods for unconstrained optimization. Part I : Motivation, convergence and numerical results. Mathematical Programming, 127(2):245–295.

Cartis, C., Gould, N. I. M., and Toint, P. L. (2011b). Adaptive cubic regularization methods for un- constrained optimization. Part II worst-case function- and derivative-evaluation complexity. Mathematical Programming, 130(2):295–319.

Cartis, C., Gould, N. I. M., and Toint, P. L. (2012b). An adaptive cubic regularization algorithm for nonconvex optimization with convex constraints and its function-evaluation complexity. IMA Journal of Numerical Analysis, 32(4).

Ghadimi, S., Liu, H., and Zhang, T. (2017). Second-order methods with cubic regularization under inexact information. arXiv: 1710.05782.

Jiang, B., Lin, T., and Zhang, S. (2017). A unified scheme to accelerate adaptive cubic regularization and gradient methods for convex optimization. arXiv:1710.04788.

Kohler, J. M. and Lucchi, A. (2017). Sub-sampled cubic regularization for non-convex optimization. In Proc. 34th International Conference on Machine Learning (ICML), volume 70, pages 1895– 1904.

Nesterov, Y. and Polyak, B. T. (2006). Cubic regularization of newton method and its global perfor- mance. Mathematical Programming, 108(1):177–205.

Tripuraneni, N., Stern, M., Jin, C., Regier, J., and Jordan, M. I. (2017). Stochastic cubic regulariza- tion for fast nonconvex optimization. arXiv: 711.02838.

Wang, Z., Zhou, Y., Liang, Y., and Lan, G. (2018). Sample complexity of stochastic variance- reduced cubic regularization for nonconvex optimization. arXiv:1802.07372.

Xu, P., Roosta-Khorasani, F., and Mahoney, M. W. (2017). Newton-type methods for non-convex optimization under inexact hessian information. arXiv: 1708.07164.

Yao, Z., Xu, P., Roosta-Khorasani, F., and Mahoney, M. W. (2018). Inexact Non-Convex Newton- Type Methods. ArXiv:1802.06925.


Designed for Accessibility and to further Open Science