A Note on Inexact Condition for Cubic Regularized Newton's Method

2018·arXiv

Abstract

Abstract

This note considers the inexact cubic-regularized Newton’s method (CR), which has been shown in Cartis et al. [2011a] to achieve the same order-level convergence rate to a secondary stationary point as the exact CR [Nesterov and Polyak, 2006]. However, the inexactness condition in Cartis et al. [2011a] is not implementable due to its dependence on future iterates variable. This note fixes such an issue by proving the same convergence rate for nonconvex optimization under an inexact adaptive condition that depends on only the current iterate. Our proof controls the sufficient decrease of the function value over the total iterations rather than each iteration as used in the previous studies, which can be of independent interest in other contexts.

1 Introduction

The cubic-regularized (CR) Newton’s method [Nesterov and Polyak, 2006] is a popular approach that solves the following general nonconvex optimization problem

where f is a differentiable and nonconvex function. Starting from an arbitrary initial point , the update rule of CR can be written as

Nesterov and Polyak [2006] showed that CR converges to a second-order stationary point x of the objective function, i.e.,

Such a desirable property allows CR to escape strict saddle points. However, the algorithm needs to compute a full Hessian at each iteration, and is hence computationally intensive. Cartis et al. [2011a,b] proposed to use an inexact approximation to replace the full Hessian in the CR update, leading to the following inexact CR algorithm

Cartis et al. [2011a,b] showed that if satisfies the following inexactness condition,

then inexact CR achieves the same order-level convergence rate to a second-order stationary point as exact CR.

This condition has been used in many situations [Cartis et al., 2012a,b, Kohler and Lucchi, 2017] Observe that the above inexact condition involves (and hence ), which is not available at iteration k. Thus, the inexact condition in eq. (6) is not practically implementable. More recent research studies Kohler and Lucchi [2017], Wang et al. [2018] used to replace to implement inexact CR numerically, i.e., the condition in eq. (6) becomes

These studies demonstrated that inexact CR performs well in experiments under the condition in eq. (7), but did not provide theoretical convergence guarantee of inexact CR under such a condition. The main contribution of this note is to establish convergence guarantee for the inexact CR under the inexact condition for Hessian (eq. (7)) and a similar inexact condition for gradient (see eq. (12) below), which achieves the same order of convergence rate as the exact CR. In contrast to existing proof techniques, our proof relies on an idea of the overall control of the sufficient decrease of the function value rather than requiring a sufficient decrease at each iteration. More specifically, the inexact error at current iteration is incorporated into the bound on the previous iteration, which yields a successful analysis over all iterations under a more relaxed (and practical) condition eq. (7).

We also note that another inexact condition proposed by Xu et al. [2017], which takes the form

where is a pre-defined small constant and is required to smaller than the lower bound of the set before the algorithm terminates. Similar condition has been used in Ghadimi et al. [2017], Jiang et al. [2017], Tripuraneni et al. [2017], Yao et al. [2018]. Compared to eq. (8), eq. (7) is adapted to so that the increment can be large than in the most phase of the algorithm to enable more progress towards the convergent point.

Notation: For a vector denotes the norm of the vector x. For a matrix denotes the spectral norm of the matrix H. We let I denote the identity matrix. For a function and denote its gradient and Hessian, respectively. and denote the set of all real numbers, non-negative real numbers and d-dimension real vectors, respectively. S denotes the set of all symmetric matrices.

2 Main Result

Our analysis takes the following standard assumption as in the previous studies of CR.

Assumption 1. The objective function in eq. (1) satisfies:

1. is twice-continuously differentiable and bounded below, i.e., ; 2. The Hessian is L-Lipschitz continuous.

In our analysis, we allow both the gradient and the Hessian to be replaced by their inexact approximation, and hence the CR iterate becomes

We assume that and satisfy the following inexact conditions, which depend on only the current iteration information, and are hence implementable.

Assumption 2. There exist two constants , such that the inexact gradient and inexact Hessian satisfy, for all ,

We next state our main theorem, which guarantees that after k iterations, the inexact gradient and Hessian CR must pass an approximate second order saddle point with error within and for the gradient and Hessian, respectively, under the inexact condition in Assumption 2. Theorem 1. Let Assumptions 1 and 2 hold. Then, after k iterations, the sequence generated by inexact CR contains a point such that

where and are universal constants, and are specified in the proof.

The proof of Theorem 1 is based on the following two useful lemmas. Lemma 2 (Nesterov and Polyak [2006], Lemma 1). Let the Hessian of the function be L-Lipschitz continuous with L > 0. Then, for any , we have

We then establish Lemma 3, which provides the properties of the minimizer of eq. (10) for a more general setting. Lemma 3. Let , and

Then, the following statements hold:

To further explain, eq. (16) corresponds to the first-order necessary optimality condition, eq. (17) corresponds to the second-order necessary optimality condition but with a tighter form due to the specific form of this optimization problem, and eq. (18) guarantees a sufficient decrease at this minimizer.

Proof of Lemma 3. First, eq. (16) follows from the first-order necessary optimality condition of eq. (15), and eq. (17) follows from the Proposition 1 in Nesterov and Polyak [2006]. We next prove eq. (18). Following similar steps as those in Nesterov and Polyak [2006], we obtain that

where (i) follows from eq. (16), and (ii) follows from eq. (17), which implies that .

Now, we are ready to prove our main theorem.

Proof of Theorem 1. Consider any iteration k, we obtain that

where (i) follows from Lemma 2 with and , (ii) follows from eq. (18) in Lemma 3 with and , (iii) follows from Assumption 2, and (vi) follows from the inequality that for , which can be verified by checking the cases with a < b and , respectively.

Summing eq. (19) from 0 to , we obtain that

Therefore, we have

where (i) follows eq. (20).

Therefore, we have

Next, we prove the convergence rate of and . We first derive

where (i) follows from eq. (16) with and , (ii) follows from eq. (13) in Lemma 2 and Assumption 2, and (iii) follows from eq. (21).

We next prove the the convergence rate of .

where (i) follows from Weyl’s inequality, (ii) follows from eq. (17) with and , (iii) follows from Assumption 2 and the fact that is Lipschitz, and (vi) follows from eq. (21).

3 Conclusion

In this note, we study the cubic-regularized Newton’s method under a more practical inexact condition, which depends only on the current iteration information, rather than the future iteration in previous studies. Under such an inexact condition for both the gradient and the Hessian, we establish the convergence of the inexact CR method to a second-order stationary point, and show that the convergence rate is as fast as that of CR in nonconvex optimization.

References

Cartis, C., Gould, N., and Toint, P. L. (2012a). Complexity bounds for second-order optimality in unconstrained optimization. Journal of Complexity, 28(1):93 – 108.

Cartis, C., Gould, N. I. M., and Toint, P. L. (2011a). Adaptive cubic regularization methods for unconstrained optimization. Part I : Motivation, convergence and numerical results. Mathematical Programming, 127(2):245–295.

Cartis, C., Gould, N. I. M., and Toint, P. L. (2011b). Adaptive cubic regularization methods for un- constrained optimization. Part II worst-case function- and derivative-evaluation complexity. Mathematical Programming, 130(2):295–319.

Cartis, C., Gould, N. I. M., and Toint, P. L. (2012b). An adaptive cubic regularization algorithm for nonconvex optimization with convex constraints and its function-evaluation complexity. IMA Journal of Numerical Analysis, 32(4).

Ghadimi, S., Liu, H., and Zhang, T. (2017). Second-order methods with cubic regularization under inexact information. arXiv: 1710.05782.

Jiang, B., Lin, T., and Zhang, S. (2017). A unified scheme to accelerate adaptive cubic regularization and gradient methods for convex optimization. arXiv:1710.04788.

Kohler, J. M. and Lucchi, A. (2017). Sub-sampled cubic regularization for non-convex optimization. In Proc. 34th International Conference on Machine Learning (ICML), volume 70, pages 1895– 1904.

Nesterov, Y. and Polyak, B. T. (2006). Cubic regularization of newton method and its global perfor- mance. Mathematical Programming, 108(1):177–205.

Tripuraneni, N., Stern, M., Jin, C., Regier, J., and Jordan, M. I. (2017). Stochastic cubic regulariza- tion for fast nonconvex optimization. arXiv: 711.02838.

Wang, Z., Zhou, Y., Liang, Y., and Lan, G. (2018). Sample complexity of stochastic variance- reduced cubic regularization for nonconvex optimization. arXiv:1802.07372.

Xu, P., Roosta-Khorasani, F., and Mahoney, M. W. (2017). Newton-type methods for non-convex optimization under inexact hessian information. arXiv: 1708.07164.

Yao, Z., Xu, P., Roosta-Khorasani, F., and Mahoney, M. W. (2018). Inexact Non-Convex Newton- Type Methods. ArXiv:1802.06925.

Designed for Accessibility and to further Open Science