Matrix Completion with Nonconvex Regularization: Spectral Operators and Scalable Algorithms

2018·Arxiv

Abstract

Abstract

In this paper, we study the popularly dubbed matrix completion problem, where the task is to “fill in” the unobserved entries of a matrix from a small subset of observed entries, under the assumption that the underlying matrix is of low-rank. Our contributions herein, enhance our prior work on nuclear norm regularized problems for matrix completion (Mazumder et al., 2010) by incorporating a continuum of nonconvex penalty functions between the convex nuclear norm and nonconvex rank functions. Inspired by Soft-Impute (Mazumder et al., 2010; Hastie et al., 2016), we propose NC-Impute — an EM-flavored algorithmic framework for computing a family of nonconvex penalized matrix completion problems with warm-starts. We present a systematic study of the associated spectral thresholding operators, which play an important role in the overall algorithm. We study convergence properties of the algorithm. Using structured low-rank SVD computations, we demonstrate the computational scalability of our proposal for problems up to the Netflix size (approximately, a 500000 matrix with 10observed entries). We demonstrate that on a wide range of synthetic and real data instances, our proposed nonconvex regularization framework leads to low-rank solutions with better predictive performance when compared to those obtained from nuclear norm problems. Implementations of algorithms proposed herein, written in the R language, are made available on github.

1 Introduction

In several problems of contemporary interest, arising for instance, in recommender system applications, for example, the Netflix Prize competition (SIGKDD and Netflix, 2007), observed data is in the form of a large sparse matrix, Ω, where Ω . Popularly dubbed as the matrix completion problem (Cand`es and Recht, 2009; Mazumder et al., 2010), the task is to predict the unobserved entries, under the assumption that the underlying matrix is of low-rank. This leads to the natural rank regularized optimization problem:

where, ) denotes the projection of onto the observed indices Ω and is zero otherwise; and denotes the usual Frobenius norm of a matrix. Problem (1), however, is computationally difficult due to the presence of the combinatorial rank constraint (Chistov and Grigor’ev, 1984). A natural convexification (Fazel, 2002; Recht et al., 2010) of rank(, the nuclear norm of X, which leads to the following surrogate of Problem (1):

Cand`es and Recht (2009); Cand`es and Plan (2010) show that under some assumptions on the underlying “population” matrix, a solution to Problem (2) approximates a solution to Problem (1) reasonably well. The estimator obtained from Problem (2) works quite well: the nuclear norm shrinks the singular values and simultaneously sets many of the singular values to zero, thereby encouraging low-rank solutions. It is thus not surprising that Problem (2) has enjoyed a significant amount of attention in the wider statistical community over the last decade. There have been impressive advances in understanding its statistical properties (Cand`es and Plan, 2010; Cand`es and Tao, 2010; Recht et al., 2010; Recht, 2011; Gross, 2011; Rohde and Tsybakov, 2011; Koltchinskii et al., 2011; Negahban and Wainwright, 2011; Chen, 2015; Lecu´e and Mendelson, 2018; Chen et al., 2019b). Motivated by the work of Cand`es and Recht (2009); Cai et al. (2010), the authors in Mazumder et al. (2010) proposed Soft-Impute, an EM-flavored (Dempster et al., 1977) algorithm for optimizing Problem (2). For some other computational work in developing scalable algorithms for Problem (2), see the papers Jaggi and Sulovsk´y (2010); Freund et al. (2015); Hastie et al. (2016), and references therein. Typical assumptions under which the nuclear norm works as a good proxy for the low-rank problem require the entries of the singular vectors of the “true” low-rank matrix to be sufficiently spread, and the missing pattern to be roughly uniform. The proportion of observed entries needs to be sufficiently larger than the number of parameters of the matrix O ((m + n)r), where, r denotes the rank of the true underlying matrix. Some extensions under general sampling distribution has been made in Klopp (2014); Alquier (2015). Negahban and Wainwright (2012) proposes improvements with a (convex) weighted nuclear norm penalty in addition to spikiness constraints for the noisy matrix completion problem.

The nuclear norm penalization framework, however, has limitations. If some conditions mentioned above fail, Problem (2) may fall short of delivering reliable low-rank estimators with good prediction performance (on the missing entries). Since the nuclear norm shrinks the singular values, in order to obtain an estimator with good explanatory power, it often results in a matrix estimator with high numerical rank — thereby leading to models that have higher rank than what might be desirable. The limitations mentioned above, however, should not come as a surprise to an expert — especially, if one draws a parallel connection to the Lasso (Tibshirani, 1996), a popular sparsity inducing shrinkage mechanism effectively used in the context of sparse linear modeling and regression. In the linear regression context, the Lasso often leads to dense models and suffers when the features are highly correlated — the limitations of the Lasso are quite well known in the statistics literature, and there have been major strides in moving beyond the convex -penalty to more aggressive forms of nonconvex penalties (Fan and Li, 2001; Zou and Li, 2008; Mazumder et al., 2011; Zhang, 2010; Zhang and Zhang, 2012; Loh and Wainwright, 2015; Bertsimas et al., 2016; Zheng et al., 2017; Feng and Zhang, 2017). The key principle in these methods is the use of nonconvex regularizers that better approximate the -penalty, leading to possibly nonconvex estimation problems. Thusly motivated, we study herein, the following family of nonconvex regularized estimators for the task of (noisy) matrix completion:

where, 1 are the singular values of ) is a concave penalty function on [0) that takes the value 0. We will denote an estimator obtained from Problem (3) by ˆThe family of penalty functions ) is indexed by the parameters (parameters together control the amount of nonconvexity and shrinkage — see for example Mazumder et al. (2011); Zhang and Zhang (2012) and also Section 2, herein, for examples of such nonconvex families.

A caveat in considering problems of the form (3) is that they lead to nonconvex optimization problems and thus obtaining a certifiably optimal global minimizer is generally difficult. Fairly recently, Bertsimas et al. (2016); Mazumder and Radchenko (2015) have shown that subset selection problems in sparse linear regression can be computed using advances in mixed integer quadratic optimization. Such global optimization methods, however, do not apply to matrix variate problems involving spectralpenalties, as in Problems (1) or (3). The main focus in our work herein is to develop a computationally scalable algorithmic framework that allows us to obtain high quality stationary points or upper boundsfor Problem (3) — we obtain a path of solutions ˆacross a grid of values of (Problem (3) by employing warm-starts, following the path-following scheme proposed in Mazumder et al. (2011). Leveraging problem structure, modern advances in computationally scalable low-rank SVDs and appropriately advancing the tricks successfully employed in Mazumder et al. (2010); Hastie et al. (2016), we empirically demonstrate the computational scalability of our method for problems of the size of the Netflix dataset, a matrix of size (approx.) 480observed entries. Perhaps most importantly, we demonstrate empirically that the resultant estimators lead to better statistical properties (i.e., the estimators have lower rank and enjoy better prediction performance) over nuclear norm based estimates, on a variety of problem instances.

Some recent works (Jain et al., 2010, 2013; Hardt, 2014; Hardt and Wootters, 2014; Chen and Wainwright, 2015; Ma et al., 2017; Chen et al., 2019a) study the scope of alternating minimization or (projected) gradient stylized algorithmic strategies for the rank constrained optimization problem, similar to Problem (1) — see also Hastie et al. (2016) for related discussions. We should emphasize that our work herein, studies the entire family of nonconvex spectral penalized problems of the form of Problem (3), and is hence more general than the class of estimation problems considered in those works. We establish empirically that this flexible family of nonconvex penalized estimators leads to solutions with better statistical properties than those available from particular instantiations of the penalty function — nuclear norm regularization (2) and rank regularization (1). Along the lines of the aforementioned works, there exists an active stream of research on characterizing the global optimality of local algorithms for various matrix factorization based formulations (Bhojanapalli et al., 2016; Ge et al., 2016; Sun and Luo, 2016; Zheng and Lafferty, 2016; Ge et al., 2017; Shapiro et al., 2018). Our paper focuses on a more general family of nonconvex regularization, with admittedly less strong algorithmic guarantees. Finally, a series of iterative reweighted algorithms have been proposed and discussed (Mazumder et al., 2010; Mohan and Fazel, 2010; Fornasier et al., 2011; Mohan and Fazel, 2012; Gu et al., 2017), largely motivated by the reweighting ideas from sparse recovery problems (Zou, 2006; Candes et al., 2008; Daubechies et al., 2010). Different weight formulas have been suggested to improve the statistical and computational efficiency. These are, however, beyond the scope of the current paper.

1.1 Contributions and Outline

The main contributions of our paper can be summarized as follows:

• We propose a computational framework for nonconvex penalized matrix completion problems of the form (3). Our algorithm: NC-Impute, may be thought of as a novel adaptation (with important enhancements and modifications) of the EM-stylized procedure Soft-Impute

• We present an in-depth investigation of nonconvex spectral thresholding operators, which form the main building block of our algorithm. We also study their effective degrees of freedom (df ),

which provide a simple and intuitive way to calibrate the two-dimensional grid of tuning parameters, extending the scope of the method proposed in nonconvex penalized (least squares) regression by Mazumder et al. (2011) to spectral thresholding operators. We propose computationally efficient methods to approximate the df using tools from random matrix theory.

• We provide comprehensive computational guarantees of our algorithm — this includes the number of iterations needed to reach a first order stationary point and the asymptotic convergence of the sequence of estimates produced by NC-Impute.

• Every iteration of NC-Impute requires the computation of a low-rank SVD of a structured matrix, for which we propose new methods. Using efficient warm-start tricks to speed up the low-rank computations, we demonstrate the effectiveness of our proposal to large scale instances up to the Netflix size in reasonable computation times.

• Over a wide range of synthetic and real-data examples, we show that our proposed nonconvex penalized framework leads to high quality solutions with excellent statistical properties, which are often found to be significantly better than nuclear norm regularized solutions in terms of producing low-rank solutions with good predictive performances.

• Implementations of our algorithms in the R programming language have been made publicly available on github at: https://github.com/diegofrasal/ncImpute.

The remainder of the paper is organized as follows. Section 2 studies several properties of nonconvex spectral penalties and associated spectral thresholding operators, including their effective degrees of freedom. Section 3 describes our algorithmic framework NC-Impute and studies the convergence properties of the algorithm. Section 4 presents numerical experiments demonstrating the usefulness of nonconvex penalized estimation procedures in terms of superior statistical properties on several synthetic datasets — we also show the usefulness of these estimators on several real data instances. Section 5 contains the conclusions and discusses several important future research directions. To improve readability, some technical materials and empirical results are relegated to Section 6.

Notation: For a matrix , we denote its (i, j)th entry by ) is a matrix with its (i, j)th entry given by Ω and zero otherwise, with Ω use the notation ) to denote the projection of A onto the complement of Ω. Let denote the singular values of ) – we will use the notation ) to denote the vector of singular values. When clear from the context, we will simply write instead of ). For a vector , we will use the notation diag(a) to denote an diagonal matrix with ith diagonal entry being

2 Spectral Thresholding Operators

We begin our analysis by considering the fully observed version of Problem (3), given by:

where, for a given matrix Z, a minimizer of the function g(X), denoted by thresholding operator induced by the spectral penalty the SVD of Z. For the nuclear norm regularized problem with the penalty function the corresponding thresholding operator, denoted by ) (say), is given by the familiar soft-thresholding operator (Cai et al., 2010; Mazumder et al., 2010):

where, th entry of separability of the thresholding operator). Here, ) is the the soft-thresholding operator on the singular values of Z and plays a crucial role in the Soft-Impute algorithm (Mazumder et al., 2010). For the rank regularized problem, with

the thresholding operator denoted by ) is given by the hard-thresholding operator (Mazumder et al., 2010):

with A closely related thresholding operator that retains the top r singular values and sets the remaining to zero formed the basis of the Hard-Impute algorithm in Mazumder et al. (2010); Troyanskaya et al. (2001). The results in (5) and (6) suggest a curious link — the spectral thresholding operators (for the two specific choices of the spectral penalty functions given above) are tied to the corresponding thresholding functions that operate only on the singular values of the matrix — in other words, the operators change the singular vectors of the matrix Z. It turns out that a similar result holds true for more general spectral penalty functions as the following proposition illustrates.

denote the SVD of denote the following thresholding operator on the singular values of Z:

Proof. Note that by the Wielandt-Hoffman inequality (Horn and Johnson, 2012) we have that: where, for a vector denotes the standard Euclidean norm. Equality holds when X and Z share the same left and right singular vectors. This leads to:

In the above inequality, note that the left hand side is g(X) (defined in (4)) and right hand side is ¯)) (defined in (7)). It follows that

where, we used the observation that as defined in (7) minimizes ¯addition, this minimum is attained by the function g(X), at the choice completes the proof of the proposition.

Due to the separability of the optimization Problem (7) across the coordinates, i.e., ¯) (where, ¯) is defined in (9)), it suffices to consider each of the subproblems separately. Let ) denote a minimizer of ¯

It is easy to see that the ith coordinate of ) is given by ). This discussion suggests that our understanding of the spectral thresholding operator ) is intimately tied to the univariate thresholding operator (9). Thusly motivated, in the following, we present a concise discussion about univariate penalty functions and the resultant thresholding operators. We begin with some examples of concave penalties that are popularly used in statistics in the context of sparse linear modeling.

Families of Nonconvex Penalty Functions: Several types of nonconvex penalties are popularly used in high-dimensional regression frameworks—see for example, Nikolova (2000); Lv and Fan (2009); Zhang and Zhang (2012). For our setup, since these penalty functions operate on the singular values of a matrix, it suffices to consider nonconvex functions that are defined only on the nonnegative real numbers. We present a few examples below:

• The penalty (Frank and Friedman, 1993) given by

• The SCAD penalty (Fan and Li, 2001) is defined via:

where ) denotes the derivative of

• The MC+ penalty (Zhang, 2010; Mazumder et al., 2011) defined as

Figure 1 shows some members of the above nonconvex penalty families. The penalty function is non differentiable at = 0, due to the unboundedness of 0+. The nonzero derivative at = 0+ encourages sparsity. The penalty functions show a clear transition from the penalty to the penalty — similarly, the resultant thresholding operators show a passage from the soft-thresholding to the hard-thresholding operator. Let us examine the analytic form of the thresholding function induced by the MC+ penalty (for any

Figure 1: [Top panel] Examples of nonconvex penalties = 1 for different values of . [Bottom Panel] The corresponding scalar thresholding operators: = 1, some of the thresholding operators corresponding to the penalty function are discontinuous, and some of the other thresholding functions are “close” to being so.

It is interesting to note that for the MC+ penalty, the derivatives are all bounded and the thresholding functions are continuous for all , the threshold operator (10) coincides with the soft-thresholding operator. However, as 1+ the threshold operator approaches the discontinuous hard-thresholding operator ) — this is illustrated in Figure 1 and can also be observed by inspecting (10). Note that the penalty penalizes small and large singular values in a similar fashion, thereby incurring an increased bias in estimating the larger coefficients. For the MC+ and SCAD penalties, we observe that they penalize the larger coefficients less severely than the simultaneously, they penalize the smaller coefficients in a manner similar to that of the On the other hand, the penalty (for small values of ) imposes a more severe penalty for values of 0, quite different from the behavior of other penalty functions. In general, for a given family of nonconvex penalties ), the effect of () on the nonconvexity can be characterized through the general concavity quantity that is to be introduced in (11).

2.1 Properties of Spectral Thresholding Operators

The nonconvex penalty functions described in the previous section are concave functions on the nonnegative real line. We will now discuss measures that may be thought (loosely speaking) to measure the amount of concavity in the functions. For a univariate penalty function assumed to be differentiable on (0), we introduce the following quantity () that measures the amount of concavity (see also, Zhang (2010)) of

where ) denotes the derivative of

holds:

for some . In inequality (12), subgradient (assuming it exists) of = 0 then the function is simply convex. Using standard properties of spectral functions (Borwein and Lewis, 2006; Lewis, 1995), it follows that -strongly convex iff the vector function:

is -strongly convex on ) denotes the singular values of Z. Let us recall the separable decomposition of ¯) as defined in (9). Clearly, the function -strongly convex (on the nonnegative reals) iff each summand ¯-strongly convex on 0. Towards this end, notice that ¯) is convex on 0 — in particular, ¯is -strongly convex with parameter , provided this number is nonnegative. In this vein, we have the following proposition:

, then the function -strongly convex with

For the MC+ penalty, the condition 0 is equivalent to 1. For the function, with 1, the parameter , and thus the function g(X) is not strongly convex.

is Lipschitz continuous with constant i.e, for all

Proof. We rewrite g(X) as:

), and rearranging the terms in (15), it follows that ), a minimizer of g(X), is given by:

If 0, the function ) is convex for every 0, then the first term appearing in the objective function in (16) is convex. Thus, assuming 0 both summands in the above objective function are convex. In particular, the optimization problem (16) is convex and ) can be viewed as a convex proximal map (Rockafellar, 1970). Using standard contraction properties of proximal maps, we have that:

Since the above holds true for any as chosen above, optimizing over the value of Problem (16) remains convex gives us ˆ), thereby leading to (14).

2.2 Effective Degrees of Freedom for Spectral Thresholding Operators

In this section, to better understand the statistical properties of spectral thresholding operators, we study their degrees of freedom. The effective degrees of freedom or df is a popularly used statistical notion that measures the amount of “fitting” performed by an estimator (Efron et al., 2004; Hastie et al., 2009; Stein, 1981). In the case of classical linear regression, for example, df is simply given by the number of features used in the linear model. This notion applies more generally to additive fitting procedures. Following Efron et al. (2004); Stein (1981), let us consider an additive model of the form:

for ), for the fully observed model above, i.e., (17) is given by:

where denotes the (i, j)th entry of the matrix . For the particular case of a spectral thresholding operator we have ˆ) satisfies a weak differentiability condition, the df may be computed via a divergence formula (Stein, 1981; Efron et al., 2004):

where (For the spectral thresholding operator ), expression (18) holds if the map ) is Lipschitz and hence weakly differentiable — see for example, Cand`es et al. (2013). In the light of Proposition 3, the map ) is Lipschitz when Under the model (17), the singular values of Z will have a multiplicity of one with probability one. We assume that the univariate thresholding operators are differentiable, i.e., ) exists. With these assumptions in place, the divergence formula for ) can be obtained following Cand`es et al. (2013); Mazumder and Weng (2020), as presented in the following proposition.

Proposition 4. Assume that 1 + and the model (17) is in place. Then the degrees of freedom of the estimator is given by:

where the ’s are the singular values of Z.

We note that the above expression is true for any value of 1 + 0. For the MC+ penalty function, expression (19) holds for 1. As soon as 1, the above method of deriving df does not apply due to the discontinuity in the map ). Values of close to one (but larger),

Figure 2: Figure showing the df for the MC+ thresholding operator for a matrix with and v = 1. The df profile as a function of (in the log scale) is shown for three values of . The dashed lines correspond to the df of the spectral soft-thresholding operator, corresponding to . We propose calibrating the () grid to a () grid such that the df corresponding to every value of matches the df of the soft-thresholding operator — as shown in Figure 3.

however, give an expression for the df near the hard-thresholding spectral operator, which corresponds to

To understand the behavior of the df as a function of (), let us consider the null model with = 0 and the MC+ penalty function. In this case, for a fixed (see Figure 2 with a fixed df is seen to increase with smaller values: the soft-thresholding function shrinks the large coefficients and sets all coefficients smaller than to be zero; the more aggressive (closer to the hard thresholding operator) shrinkage operators ()) shrink less for larger values of and set all coefficients smaller than to zero. Thus, intuitively, the more aggressive thresholding operators should have larger df since they do more “fitting” — this is indeed observed in Figure 2. Mazumder et al. (2011) studied the df of the univariate thresholding operators in the linear regression problem, and observed a similar pattern in the behavior of the ) values. For the linear regression problem, Mazumder et al. (2011) argued that it is desirable to choose a parametrization for () such that for a fixed , as one moves across should be the same. We follow the same strategy for the spectral regularization problem considered herein — we reparametrize a two-dimensional grid of (to a two-dimensional grid of () values, such that the df remain calibrated in the sense described above — this is illustrated in Figure 2 (see the horizontal dashed lines corresponding to the constant df values, after calibration). Figure 3 shows the lattice of () after calibration. The values of (˜on each curve induce the same moves down (the penalty becomes more “nonconvex”), the corresponding ˜(the shrinkage) has to increase to maintain the same df.

The study of df presented herein provides a simple and intuitive explanation about the roles played by the parameters () for the fully observed problem. The notion of calibration provides a new parametrization of the family of penalties. From a computational viewpoint, since the general algorithmic framework presented in this paper (see Section 3) computes a regularization surface using warm-starts across adjacent () values on a two-dimensional grid; it is desirable for the adjacent values to be close — the df calibration also ensures this in a simple and intuitive manner.

Computation of df : The df estimate as implied by Proposition 4 depends only upon singular values (and not the singular vectors) of a matrix and can hence be computed with cost The expectation can be approximated via Monte-Carlo simulation — these computations are easy to

Figure 3: Figure showing the calibrated () lattice — for every fixed value of of the MC+ spectral threshold operators are the same across different values. The df computations have been performed on a null model using Proposition 5.

parallelize and can be done offline. Since we compute the df for the null model, for larger values of m, n we recommend using the Marchenko-Pastur law for iid Gaussian matrix ensembles to approximate the df expression (19). We illustrate the method using the MC+ penalty for 1. Towards this end, let us define a function on

For the following proposition, we will assume (for simplicity) that

, then under the model

where is the thresholding operator corresponding to the MC+ penalty with the expectation is taken with respect to independently generated from the Marchenko-Pastur distribution (see Lemma 1, Section 6.1).

Proof. For a proof, see Section 6.1.1.

Note that the variance vin model (17) can always be assumed to be one (by adjusting the value of the tuning parameter accordingly

3 The NC-Impute Algorithm

In this section, we present algorithm NC-Impute. The algorithm is inspired by an EM-stylized procedure, similar to Soft-Impute (Mazumder et al., 2010), but has important innovations, as we will discuss shortly. It is helpful to recall that, for observed data: ), the algorithm SoftImpute relies on the following update sequence

which can be interpreted as computing the nuclear norm regularized spectral thresholding operator for the following “fully observed” problem:

where, the missing entries are filled in by the current estimate, i.e., ). We refer the reader to Mazumder et al. (2010) for a detailed study of the algorithm. Mazumder et al. (2010) suggest, in passing, the notion of extending Soft-Impute to more general thresholding operators; however, such generalizations were not pursued by the authors. In this paper, we present a thorough investigation about nonconvex generalized thresholding operators — we study their convergence properties, scalability aspects and demonstrate their superior statistical performance across a wide range of numerical experiments. Update (20) suggests a natural generalization to more general nonconvex penalty functions, by simply replacing the spectral soft thresholding operator ) with more general spectral operators

While the above update rule works quite well in our numerical experiments, it enjoys limited computational guarantees, as suggested by our convergence analysis in Section 3.1. We thus propose and study a seemingly minor generalization of the rule (21) — this modified rule enjoys superior finite time convergence rates to a first order stationary point. We develop our algorithmic framework below.

Let us define the following function:

for 0. Note that ) majorizes the objective function f(X) defined in (3), i.e., , with equality holding at . In an attempt to obtain a minimum of Problem (3), we propose to iteratively minimize ), an upper bound to f(X), to obtain — more formally, this leads to the following update sequence:

Note that is easy to compute; by some rearrangement of (22) we see:

where + 1). Note that (24) is a minor modification of (21) — in particular, if = 0, then these two update rules coincide.

The sequence defined via (24) has desirable convergence properties, as we discuss in Section 3.1. In particular, as , the sequence reaches (in a sense that will be made more precise later) a first order stationary point for Problem (3). We also provide a finite time convergence analysis of the update sequence (24).

We intend to compute an entire regularization surface of solutions to Problem (3) over a two-dimensional grid of ()-values, using warm-starts. We take the MC+ family of functions as a running example, with (beginning, we compute a path of solutions for the nuclear norm penalized problem, i.e., Problem (3) with on a grid of values. For a fixed value of , we compute solutions to Problem (3) for smaller values of , gradually moving away from the convex problems. In this continuation scheme, we found the following strategies useful:

• For every value of (), we apply two copies of the iterative scheme (23) initialized with solutions obtained from its two neighboring points (). From these two candidates, we select the one that leads to a smaller value of the objective function (

• Instead of using a two-dimensional rectangular lattice, one can also use the recalibrated lattice, suggested in Section 2.2, as the two-dimensional grid of tuning parameters.

The algorithm outlined above, called NC-Impute is summarized as Algorithm 1.

We now present an elementary convergence analysis of the update sequence (24). Since the problems under investigation herein are nonconvex, our analysis requires new ideas and techniques beyond those used in Mazumder et al. (2010) for the convex nuclear norm regularized problem.

3.1 Convergence Analysis

By the definition of we have that:

Let us define the quantities:

where, if 0, then the function )-strongly convex. In particular, from (23), it follows that ), a subgradient of the map ) (evaluated at zero. We thus have:

Now note that, by the definition of , we always have: which combined with (25) leads to (replacing

In addition, we have:

Combining (26) and (27), and observing that ), we have:

Since ∆0, the above inequality immediately implies that and the improvement in objective values is at least as large as the quantity ∆). The term ∆) is a measure of progress of the algorithm, as formalized by the following proposition.

and for any , let us consider the update Then the following are equivalent:

(iii) is a fixed point, i.e.,

= 0, we have that . The condition implies that

where the term on the right equals is a fixed point.

say — this implies that ∆. Let us now consider two cases, depending upon the value of 0, then we have . On the other hand, if the quantities = 0, the conclusion needs to be modified: ∆0 implies that

Motivated by the above discussion, we make the following definition of a first order stationary point for Problem (3).

is said to be a first order stationary point for Problem (3) if ∆is said to be an -accurate first order stationary point for Problem (3) if ∆

Proposition 7. The sequence is decreasing and suppose it converges to ˆf. Then the rate of convergence of to this first order stationary point is given by:

Proof. The arguments presented preceding Proposition 7 establish that the sequence