MM for Penalized Estimation

2019·Arxiv

Abstract

Abstract

Penalized estimation can conduct variable selection and parameter estimation simultaneously. The general framework is to minimize a loss function subject to a penalty designed to generate sparse variable selection. The majorization-minimization (MM) algorithm is a computational scheme for stability and simplicity, and the MM algorithm has been widely applied in penalized estimation. Much of the previous work has focused on convex loss functions such as generalized linear models. When data are contaminated with outliers, robust loss functions can generate more reliable estimates. Recent literature has witnessed a growing impact of nonconvex loss-based methods, which can generate robust estimation for data contaminated with outliers. This article investigates MM algorithm for penalized estimation, provides innovative optimality conditions and establishes convergence theory with both convex and nonconvex loss functions. With respect to applications, we focus on several nonconvex loss functions, which were formerly studied in machine learning for regression and classification problems. Performance of the proposed algorithms are evaluated on simulated and real data including cancer clinical status. Efficient implementations of the algorithms are available in the R package mpath in CRAN.

Keywords classification MM algorithm nonconvex quadratic majorization regression robust estimation variable selection

1 Introduction

In predictive modeling, the data may contain a large number of predictors such as clinical risk factors of hospital patients, or gene expression profiles of cancer patients. It is a common interest to identify effective predictors and estimate predictor parameters simultaneously. A popular approach is to optimize an objective function, log-likelihood for instance, subject to a penalty. Penalty functions including the least absolute shrinkage and selection operator (LASSO), smoothly clipped absolute deviation (SCAD) and minimax concave penalty (MCP) have been widely applied to many statistical problems (Tib- shirani, 1996; Fan and Li, 2001; Zhang, 2010).

To simplify computations, the majorization-minimization (MM) algorithm provides a useful framework to decomposing a complex minimization problem into a sequence of relatively elementary problems. The MM algorithm has been an interesting research topic in statistics, mathematics and engineering (Lange, 2016; Byrne, 2018; Razaviyayn et al., 2013). In particular, penalized regression with MM has attracted some attention, see Schifano et al. (2010) and references therein. The article emphasizes the quadratic majorization (QM) for the generalized linear models. Despite convergence analysis of the QM algorithm, the loss function was convex. Chi and Scott (2014) developed a QM algorithm to a LASSO type minimum distance estimator for classification, where the limiting behavior relies on more strict assumptions on the penalties: it is an elastic net type of penalty, not the original LASSO, nor the nonconvex penalty SCAD.

In this article, we provide new results for the MM algorithm. The major contributions are three-fold: First, we construct optimality conditions for penalized estimation and penalized surrogate estimation in the MM algorithm. The connection between the two optimization problems are established such that penalty tuning parameters can be conveniently determined from the latter. Second, we establish contemporary theory that the QM algorithm attains the minimum value with convex loss and penalty. Furthermore, we develop the limiting behavior of the MM algorithm for nonconvex loss functions with the LASSO, SCAD or MCP penalty. Finally, several penalized nonconvex loss functions, for the first time, are applied with QM algorithm for robust regression and classification.

Nonconvex loss functions play a critical role in robust estimation. Compared to traditional convex loss functions, a nonconvex function may provide more resistant estimation when data are contaminated by outliers. For instance, the least squares loss is not robust to outliers (Alfons et al., 2013). Furthermore, variable selection is also strongly influenced by outliers. Robust estimation can be obtained by downweighting the impact of the extreme values. Examples include the Huberized LASSO for regression and classification (Rosset and Zhu, 2007) and penalized least absolute deviations (LAD) (Wang et al., 2007). However, a breakdown point analysis shows that convex loss functions such as the LAD-LASSO or penalized M-estimators are not robust enough and even a single outlier can send the regression coefficients to infinity (Alfons et al., 2013). On the other hand, bounded nonconvex loss functions have shown more robust to outliers than traditional convex loss functions in regression and classification (Wu and Liu, 2007; Chi and Scott, 2014; Li and Bradic, 2018; Wang, 2018a,b). Therefore, this article focuses on applications of penalized nonconvex loss functions.

The rest of this article is organized as follows. In Section 2, we study optimality conditions of penalized loss functions. In Section 3, we describe the MM algorithm, provide optimality conditions of penalized surrogate loss functions in the MM algorithm, and establish connections of minimizers between the original and surrogate optimization problems. We provide details with applications to nonconvex loss functions. In Section 4, convergence analysis is conducted. In Section 5, a simulation study in regression and classification is presented. In Section 6, the proposed methods are evaluated on data application predicting breast cancer clinical status. We conclude with a discussion in Section 7. Some numerical results, technical details such as Clarke subd-ifferential and regular at a point, and proofs are presented in the online supplement.

2 Penalized estimation

For training data (= 1, ..., n, j = 0, 1, ..., p, let = (denote a (p+1)-dimensional predictor with the first entry 1, (is the response vector, = (is a (p + 1)-dimensional coefficient vector with the intercept. Consider a loss function ), for instance, ) = in linear regression, where f is the prediction of y. For some link function ) define

The goal is to minimize the empirical loss ). For the purpose of variable selection, define a penalized loss function

where 1, and ) is the penalty function satisfying the following assumptions (Schifano et al., 2010):

) is continuously differentiable, nondecreasing and concave on (0) with (0) = 0 and 0 (0+) .

Minimizing the penalized loss function ) can provide shrinkage estimates thus leading to zero values for small coefficients in magnitude. The following penalty functions have been widely studied:

(i) the LASSO penalty (Tibshirani, 1996), for 0

(ii) the SCAD penalty (Fan and Li, 2001)

The SCAD and MCP penalty functions with large value a are comparable to the LASSO since if . These penalty functions satisfy Assumption 1 and Assumption 2 (more restrictive than Assumption 1).

) is continuously differentiable, nondecreasing and concave on (0) with (0) = 0, 0 (0+) and directional derivative at 0 in the direction given by (0; ) = for some 0.

In particular, = 1 for the LASSO, SCAD and MCP penalty. Assumption 2 also covers hard thresholding penalty.

Assumption 1 ensures the penalty function is locally Lipschitz continuous at every point including the origin due to a bounded derivative. A penalty function may violate Assumption 1. For instance, the bridge penalty ) = for (0, 1) has (0+) = . In other words, the bridge penalty is not Lipschitz continuous near = 0.

We establish conditions for minimizing ). Let = (de- note a minimizer and corresponding value = () without the in- tercept. Since is not penalized, we focus on optimality conditions on , since an optimality condition on is a trivial problem in the current setting. Whether is a local or global minimizer is specified in the context. If is a minimizer, then usually a set of conditions constraining must be satisfied. If these conditions are comprehensively constructed, we call them complete optimality conditions, which is the typical case when dealing with convex optimization problems. On the other hand, if only a subset of complete optimality conditions are known, we call them incomplete optimality conditions. This would occur in nonconvex optimization problems.

Theorem 1 Suppose is a local minimizer of ), where ) is continuously differentiable. Then:

(a) Under Assumption 1, optimality conditions for are given by

(b) Under Assumption 2, an optimality condition for = 0, j = 1, ..., p is given by

For the LASSO, SCAD and MCP penalty functions with = 1, from (7) we define

In real data analysis, we often need to determine a good value , which can be chosen from select values closely related to the quantity . From Theorem 1, if = 0, j = 1, ..., p, we deduce that . On the other hand, the converse is not valid, unless stronger conditions are posed as in the next theorem.

Theorem 2 Assume that ) is differentiable convex, and ) is the LASSO penalty function. Then is a global minimizer of ) if and only if

where ) is the subdifferential of ) at . Or equivalently

for j = 1, ..., p.

The optimality condition in Theorem 2 is standard for convex functions, yet helpful when investigating the relationship between the function ) and its surrogate function in the MM algorithm described below. The value has much stronger implications under Theorem 2 due to the convexity assumption. From(10) Theorem 2 concludes that is the smallest such that = 0, j = 1, ..., p. Without convexity assumption, since (7) is not sufficient and/or incomplete, the value may neither guarantee = 0 nor produce the smallest with = 0.

3 MM algorithms

3.1 Penalized estimation using MM algorithms

The MM algorithm is an iterative procedure. Given an estimate

in the kth iteration, ) is minimized at the k + 1 iteration to obtain an updated minimizer . The MM algorithm then updates z with the new estimate . This process is repeated until convergence. From (11) the MM algorithm generates a descent sequence of estimates:

This relationship leads to the following result.

Lemma 1 Suppose ) majorizes ) at z. If is a local (global) minimizer of ), is a local (global) minimizer of ).

The converse of Lemma 1 is not true. Consider ) = = 0. Let ) = . Therefore ) majorizes = argmin ) and ) = ) = 0. However, the converse is invalid. To overcome this problem, for differentiable convex functions, we develop a connection between a function and its “surrogate” function in a broader sense, including but not limited to the induced functions by majorization.

Lemma 2 Suppose ) and ) are differentiable convex functions. Then is a global minimizer of ) satisfying ) = ) if and only if is a global minimizer of ).

To accommodate predictors, define

Obviously the transformation ) preserves the relationship in (11) such that ) majorizes ) at the current estimate = (). Along the line of Lemma 2, we characterize a class of surrogate functions:

Consider the following surrogate penalized loss:

From Lemma 1 we deduce that if is a minimizer of ), it must be a minimizer of ). The following results lay out not only the optimality conditions of the surrogate function, but also the connection between the surrogate and its original function deduced from.

Theorem 3 Assume that ) and ) are continuously differentiable functions, and ) majorizes ) at . Suppose is a local minimizer of ), and Assumption 3 holds at . Then:

(a) Under Assumption 1, an optimality condition is

(b) Under Assumption 2, an optimality condition for = 0, j = 1, ..., p is

For the LASSO, SCAD and MCP penalty functions, from (14) we define

Notice ˜is the same as in (8) under the specified assumptions. Theorem 3 implies that if = 0, j = 1, ..., p, then ˜. Like before, the converse is generally invalid unless stronger conditions are provided as in the next theorem.

Theorem 4 Assume that ) and ) are differentiable convex functions, and ) is the LASSO penalty function, Then is a global minimizer of ), where Assumption 3 holds at , if and only if is a global minimizer of ) or equivalently

Theorem 4 on the penalized surrogate loss ) is a mirror of Theorem 2 on the penalized loss ). Furthermore, it is reasonable to deduce that similar results hold for the EM algorithm (a special case of the MM algorithm) in penalized estimation. Such a connection, due to the convexity assumption, is stronger than that between Theorem 1 and Theorem 3. These results suggest that instead of minimizing the original objective function ), we can minimize the surrogate function ). The iterative MM procedure is summarized in Algorithm 1.

3.2 Quadratic majorization

The quadratic majorization (QM) is based on the second-order Taylor expansion for a twice differentiable function ) such that

for some on the line segment connecting z and u. Suppose ˜B is positive definite and ˜) is semipositive definite for all arguments w, then

majorizes ) at z. Since ) = ), Assumption 3 holds at z for functions ) and ). The QM is a special application of the more general MM algorithm when a suitable bound ˜B exists (Lange, 2016). The generic terms defined by ) in (12) can be replaced with the express formula (18).

While an upper bound ˜B may not be unique, to simplify computation as often possible, assume ˜B is a square diagonal matrix with all its main diagonal entries equal, i.e., ˜B = BI, where B is a scalar and I is the identity matrix. Furthermore the link function ) takes different forms:

Define in regression problems. By (18) and the chain rule, we have

For classification problems with margin , we have

With , minimizing unpenalized loss ) in (1) amounts to minimizing

where

The constant weight B has no impact on the estimation given ), in contrast to the penalized estimation described below. Putting together, with iterated least squares estimation for the updated response variable, Algorithm 2 can be used to minimize ) in (1) and estimate parameters.

For penalized estimation, the surrogate loss function (13) simplifies to:

Since the surrogate function ) satisfies Assumption 3, Theorem 3 is applicable. Theorem 4 is effective for convex loss functions such as those in generalized linear models (GLM), where quadratic majorization may result in possibly different form than (21). Minimizing ) in (2) can be achieved with Algorithm 3, and technical details are discussed in the subsequent context and supplement.

3.3 Penalized least squares

Minimizing the penalized least squares (22) is hampered by the fact that the penalty function ) is not differentiable at 0. Let ) be a function that approximates the penalty function ), the following methods have been proposed in the literature.

(iii) Fan and Li (2001); Hunter and Li (2005) considered a local quadratic approximation (LQA):

In these proposals ) essentially majorizes ) at . We implement method (i) in the algorithm since the efficient coordinate decent algorithm can easily provide penalized least squares estimates for a variety of penalty functions including the LASSO, SCAD and MCP (Friedman et al., 2010; Breheny and Huang, 2011). We nevertheless investigate properties of all three approaches in Section 4.

3.4 Applications to nonconvex loss functions

Traditional regression is based on minimizing a loss function ), where ) = . The loss is not robust to outliers. Robust regression has utilized absolute difference ) = loss) and Huber loss, both are convex loss functions. The nonconvex loss ClossR with a parameter is described in Table 1 (Wang, 2018a). The ClossR with = 0.9 is plotted in Figure 1, along with the aforementioned loss functions in regression. It is clear that bounded ClossR downweights large values more than other convex loss functions, effectively having better resistance to outliers.

For a binary outcome y taking values +1 and 1, denote by u = yf the margin of a classifier f. Used in a logistic regression, ) = log(1+exp()) is a popular choice in statistics. However, the logistic loss is unbounded and can’t control outliers well. Nonconvex loss functions have been proposed to address the robustness of classification, and three examples are listed in Table 1. These loss functions approximate the 0-1 loss, as shown in Figure 2. The value controls robustness of a loss function. For Closs, a smaller value is more robust to outliers while for other loss functions, a larger value is more robust. A typical strategy in practice is working from less robust to more robust and selecting a proper value of according to prediction.

For the loss functions in Table 1, we can obtain penalized estimates with Algorithm 3, where factor B has been derived by Wang (2018a). Since the functions are continuously differentiable, Theorem 1 and Theorem 3 are applicable. We list in the supplement Table 5. The starting estimates of are based on a functional gradient decent boosting algorithm (Wang, 2018b).

4 Convergence analysis

To conduct a convergence analysis of Algorithm 1 for penalized estimation in convex and nonconvex problems, we construct a unified form of majorizers for the loss functions in Section 3.4 as well as those in the GLM. Consider a twice differentiable function ) in (1). The Taylor expansion at a given value

is

for some ˜on the line segment connecting and , where

For ) defined in (19), we have

which can be applied to functions including but not limited to the nonconvex functions in Section 3.4. For the GLM, the negative log-likelihood function is

where ) and ) are known functions, is the canonical parameter with ) (if the link is canonical, ), and is the known dispersion parameter. The negative log-likelihood function has the form:

We have

() = 1)

For the canonical link ) = z, we have

Let X denote the + 1) matrix of values, diagonal matrix of weights with ith diagonal element ). Then we have

A matrix W is constructed to majorize ). Let ˜= 0, 1, ..., p denote the eigenvalues of ). Assume that

and let W = I, where I is the identity matrix. If X is full rank, we may define

Obviously the (p + 1) + 1) matrix W is positive definite, and all the eigenvalues of W are positive. Let We have

where the inequality means that is positive semidefinite. Denote

For ) defined in Section 3.3 and = (, denote

Finally we define the surrogate loss

A similar result as in Schifano et al. (2010) is obtained.

Theorem 5 Assume twice differentiable function ) satisfies (26) or (27), ) is given by (30), and penalty ) satisfies Assumption 1. Then: ) majorizes ) at :

The inequality holds if .

Theorem 5 implies that Algorithm 1 provides a nonincreasing sequence ). We analyze Algorithm 1 under different assumptions in the sequel. The following convergence analysis does not cover (25) which is not defined when = 0. See Hunter and Li (2005) for a modified majorization to avoid such a degenerate case. We first focus on convex optimization problems.

Theorem 6 Assume twice differentiable convex function ) satisfies (26) or (27), ) is given by (30), the surrogate penalty ) = 0, and the iterates are generated by Algorithm 1.Then the loss values ) are monotonically decreasing and liminf ). Furthermore converges to a minimum point of ) if ) is either (i) coercive or (ii) bounded below with bounded intercept .

The bounded intercept in Theorem 6 ensures that penalty term ) is coercive for . Otherwise ) is not coercive for despite that ) is coercive for meaning implies . To see this, denote

Then . Denote ˜) and let = (0, ..., 0), a p-dimensional vector, then we have = (and ˜) = ) = 0 by definitions. Thus ) is a constant as or . In conclusion, as , ˜doesn’t always hold. This proves that ) is not coercive for if the intercept is not bounded.

The next two theorems are particularly relevant for nonconvex functions. We consider a larger class of surrogate loss ) which covers the QM (29) as a special case while the surrogate loss function ) has the same form as in (30).

Notice that ) given by (29) satisfies assumptions of Theorem 7. We can summarize the implications of Theorem 7: applying the QM to the loss functions in the GLM or Table 1, and applying an appropriate majorization such as (23) or (24) to the penalty functions including the LASSO, SCAD and MCP, then every limit point of the iterates generated by the MM algorithm is a Dini stationary point. While a Dini stationary point is always a Clarke stationary point, the converse is not valid. Under different conditions, the iterates of the MM algorithm converges to a Clarke stationary point.

Theorem 8 Assume continuous function ) is locally Lipschitz continuous, ) is coercive or bounded below with bounded intercept , the set of Clarke stationary points S of ) is a finite set, the surrogate function ) is strongly convex and strictly majorizes ) at , the surrogate loss ) is given by (30), the surrogate penalty ) is convex and majorizes ) at , and the iterates are generated by Algorithm 1. Then converges to a Clarke stationary point of the problem minimizing ).

Theorem 8 can be applied to a broad loss functions including those in the GLM and Table 1. These functions are continuously differentiable, thus locally Lipschitz continuous. Construct a surrogate loss ) given by (30) with . Then ) strictly majorizes ). From ) = we know ) is positive definite. Suppose Assumption 1 holds for ), then ) is convex and majorizes ) at ) is given by (23) for the LASSO or (24). This implies some of the assumptions of Theorem 8 have been met. If additional assumptions in the theorem are valid, then we deduce that Algorithm 1 iterates converge to a Clarke stationary point. Unlike Theorem 7, Theorem 8 doesn’t assume equivalence of the first order Dini derivative between the penalized surrogate loss ) and objective function ). When this does hold, as the QM in (29) and the surrogate penalty function in (23) or (24), Theorem 7 provides stronger conclusion than Theorem 8.

5 Simulation study

In the simulated examples, we evaluate performance of the proposed robust variable selection methods, i.e., MM for penalized estimation with Algorithm 3, and compare with the standard non-robust algorithms. The parameter of SCAD and MCP is fixed at a = 3. This choice works well for the simulation study. The responses in each data set are randomly contaminated with percentage of v=0, 5, 10 and 20. We generate random samples of training/tuning/test data. Training data are used for model estimation. Tuning samples are used to select optimal penalty parameters with the smallest loss values and classifica-tion errors in regression and classification problems, respectively. Test data are used to evaluate prediction accuracy except for example 1 whose prediction accuracy is not from the test data. To evaluate variable selection, we compute sensitivity and specificity, where sensitivity is the proportion of number of correctly selected predictors among truly effective predictors, and specificity is the proportion of number of correctly non-selected predictors among truly ineffective predictors. The intercept is not counted as a variable. A good variable selection method should obtain both sensitivity and specificity close to 1. We repeat Monte Carlo simulation 100 times in the following examples.

5.1 Example in regression

Example 1: Let , where = 1= (3, 1.5, 0, 0, 2, 0, 0, 0)N(0, 3)) with = 0for i, j = 1, ..., 8.

Without the intercept, similar data was generated in Fan and Li (2001). The response variable y is contaminated by randomly multiplying 1, 000 for v = 0, 5, 10 and 20 percent of training (n=200) and tuning data (n=200). Different estimation methods are evaluated based on the mean squared prediction error E[(ˆ] (Hunter and Li, 2005). We fix = 10 in ClossR for which larger value is more resistant to outliers. Table 2 and Table 6-8 in the supplement display the results for penalized ClossR methods along with standard least squares (LS), LS-based penalized methods and penalized Huber regression proposed by Yi and Huang (2017). The best prediction results with the associated variable selection are highlighted in bold.

The ClossR-based methods have the best prediction and outperform the HuberLASSO in all four settings of outliers. For the LASSO penalty, ClossR has better prediction than Huber but also selects more variables. For the clean data, ClossR-based methods are slightly better or comparable to LS-based methods when the penalty function is the same. If one believes that prediction accuracy from the penalized LS methods should be superior to the penalized ClossR in uncontaminated cases, the tiny differences in Table 2 perhaps can be explained by random variation in finite samples and selection of penalty parameters. An interesting research topic is to theoretically compare prediction between penalized LS and penalized ClossR.

For clean data, all effective variables are selected, and ClossR-based SCAD or MCP has better selection than LASSO, consistent with LS-based penalty functions. For contaminated data, all correct variables are selected in robust estimation, but not in the penalized LS estimation. In addition, ClossR-based SCAD or MCP determines fewer ineffective variables than LASSO.

For nonconvex optimization, Algorithm 3 can generate different local estimates depending on the initial values. To illustrate, two methods of initial values elaborated in supplement A.1 are compared in Figure 3 for loss ClossR and nonconvex penalty SCAD with v=20. Prediction from boosting-supported backward solution path is more reliable than the forward path. The two largest errors from the latter ultimately generates a greater mean prediction error despite a smaller median value.

5.2 Examples in classification

The following data generations are from Wu and Liu (2007) and have also been investigated in Wang (2018a,b) for robust classification. These examples include both linear and nonlinear classifiers.

Example 2: Predictor variables () are uniformly sampled from a unit disk + 1 and y = 1 if and -1 otherwise.

Example 3: Predictor variables () are uniformly sampled from a unit disk + 1 and y = 1 if (+ 0 and -1 otherwise.

In each example, we also generate 18 noise variables from uniform[-1, 1]. To add outliers, we randomly select v percent of the data and switch their class labels. With the simulated data, the performance of an algorithm can be compared with the optimal Bayes errors. The training/tuning/test sample sizes are n = 200/200/10, 000 in Example 2 and 1, 000/1, 000/10, 000 in Example 3. To accommodate nonlinear classifiers in Example 3, the predictor space is doubled to p = 40 including quadratic terms for i = 1, 2, ..., 20. In Example 2 and 3, no-intercept models have better prediction accuracy from logistic-based and nonconvex loss-based variable selection methods. Since the data are not generated from logistic regression models, estimation from a logistic regression model is not a clear-cut winner even for the clean data. We fix Closs = 0.9, Gloss = 1.1 and Qloss = 0.2 in the estimation.

The results for Example 2 are demonstrated in Table 3 and Table 9 in the supplement. It is clear that the performance of logistic-based methods relies on the cleanliness of the data, and is not robust with outliers. On the other hand, the nonconvex loss-based variable selection methods are better resistant to outliers. In particular, with data contaminated by v=0, 5, 10 and 20 percentage, the best prediction results 0.0047(QlossSCAD), 0.0548(GlossSCAD), 0.1055(GlossSCAD), 0.2067(QlossSCAD) are the closest to the corresponding Bayes errors. The corresponding variable selections in highlighted numbers in Table 9 almost perfectly match the Bayes optimal classifier. Since all effective variables are correctly selected, specificity is not reported in lieu of supplement Table 9. It can be interesting to compare the LASSO and boosting due to its similarity (Efron et al., 2004). The same random data was generated in Wang (2018a), where the robust boosting algorithm provided similar prediction results and slightly more sparse variable selection. For instance, the boosting errors from the Gloss are 0.0108, 0.0628, 0.1178, 0.2217 for outlier percentage v=0, 5, 10, 20, respectively (Wang, 2018a), while the corresponding QlossLASSO errors are 0.0154, 0.0672, 0.1187, 0.2223 in Table 3. As the percentage of outliers changes, the average number of variables selected in the boosting algorithm ranges from 2.5 to 2.8 (Wang, 2018a) while the QlossLASSO ranges from 4.1 to 4.9 in supplement Table 9.

The results for Example 3 are summarized in the supplement, and similar conclusions like Example 2 can be drawn. In all outlier scenarios, the nonconvex loss-based methods triumph over the logistic-based methods on prediction. In particular, the best average test errors are 0.0016 (ClossMCP), 0.0520 (QlossSCAD), 0.1018 (QlossSCAD) and 0.2019 (QlossSCAD) for percentage of outliers v=0, 5, 10 and 20, respectively. In addition, the corresponding variable selections in featured numbers in Table 11 resemble the Bayes optimal classifier. Except for v=0, all variables are more accurately identified by the nonconvex loss-based methods than the logistic-based methods. Both and are selected in all estimation methods in each simulation run. Supplement Table 12 counts variable selection for or in 100 simulations. Across all methods, the average specificity excluding is 98. Hence, the details are unreported. In conclusion, all methods correctly select and , and the remaining variables including and are typically eliminated by the variable selection procedures.

6 Predicting breast cancer clinical status

Shi et al. (2010) used gene expressions to predict breast cancer clinical status with 164 estrogen receptor positive cases and 114 negative cases. The same data set has been evaluated for a variety of robust classification algorithms (Li and Bradic, 2018; Wang, 2018a,b). Applying MM for penalized estimation, the analysis described below is reproduced in a vignette distributed with the R package mpath. Following the same pre-selection procedure, the data set is pre-screened to keep 3000 most significant genes from 22,282 genes with twosample t-test. These genes are then standardized. The data set is randomly split to training data and test data, and the training data contains 50/50 positive/negative cases and the test data has 114/64 positive/negative cases. The positive/negative status in the training data is randomly flipped with percentage v=0, 5, 10, 15. The parameter of SCAD and MCP is fixed at a = 3.7 and a = 12, respectively. The predictive models are aided with a five-fold cross-validation to select the optimal penalty parameters with the smallest classification errors. The analysis is conducted with Closs = 0.9, Gloss = 1.01 and Qloss = 0.5 (Wang, 2018a). This process is repeated 100 times and the results are summarized in Table 4 and supplement Table 13.

Prediction accuracy is more similar in the clean data than those contaminated, while the logistic-based methods suffer from outliers. The best prediction with the Closs-based methods are 0.0818, 0.0839, 0.0910, 0.1143 for v=0, 5, 10, 15, respectively. This is the best prediction for this data set in the robust prediction literature. See Li and Bradic (2018); Wang (2018a,b) and the references therein. With more data contamination, more variables are selected. Loss function and penalty method also contribute to different variable selection results. In addition, sparse variable selection is maintained even with outliers. For instance, with 15% outliers, ClossSCAD not only has more accurate prediction than LogisSCAD (0.1143 vs 0.1566), but also selected only 2.8 genes on average, compared with 19.1 genes by LogisSCAD.

7 Discussion

Outliers can substantially cause bias in traditional estimation methods and produce inaccurate variable selection problem. This problem has been tackled by robust estimation, which has a long history in statistics methodology research and applications. The previous methods are largely based on convexbased loss functions, which may not be sufficient to guard the impact of outliers. Adding to this is the variable selection challenge in predictive modeling, specifically for the high-dimensional data.

In this article, we propose penalized nonconvex loss functions for several popular penalty functions. We have illustrated that the proposed methods can improve estimation and variable selection compared to conventional estimation and robust estimation. The estimation problem is decoupled into an iterative MM algorithm with each iteration conveniently conducted by existing optimization algorithms. The QM embedded in the MM algorithm is particularly appealing since the induced penalized least squares problem has been well studied. For tuning parameter identification, it is a typical practice to construct a solution path along with penalty parameters. A popular procedure is to start with a penalty parameter such that all predictor coefficients are zero. As is well-known in nonconvex optimization problems, different starting values may lead to different solutions if local minimizers are achieved. We propose an alternative yet effective approach to compute a solution path. The so-called backward path generates starting values utilizing the boosting algorithm and find solutions backwards towards zero coefficients.

This article evaluates optimality conditions for both convex and nonconvex problems in the context of penalized estimation. As expected, we could only partially characterize the optimality conditions for nonconvex problems. The link between the original optimization problem and the surrogate optimization problem, nevertheless, provides a useful utility to compute . The convergence analysis provides new results for both convex and nonconvex problems. For convex problems, the iterated estimates of the MM algorithm with QM converges to a minimum point. For nonconvex problems, the iterates of the MM algorithm with QM converges to a local stationary point in the framework of the Dini or Clarke derivative.

Future direction of research includes expanded penalty functions including those not locally Lipschitz continuous. For instance, the bridge penalty for (0, 1). Since the generalized derivatives like Dini and Clarke derivative are both restricted to locally Lipschitz continuous functions, additional mathematical tools are required to assess properties of the penalized loss functions.

References

Alfons, A., Croux, C., Gelper, S., et al. (2013). Sparse least trimmed squares regression for analyzing high-dimensional large data sets. The Annals of Applied Statistics, 7(1):226–248.

Bertsekas, D. P. (1999). Nonlinear Programming. Athena Scientific, second edition.

Breheny, P. and Huang, J. (2011). Coordinate descent algorithms for noncon- vex penalized regression, with applications to biological feature selection. The Annals of Applied Statistics, 5(1):232–253.

Byrne, C. L. (2018). Auxiliary-function minimization algorithms. Applied Analysis and Optimization, 2(2):171–198.

Chi, E. C. and Scott, D. W. (2014). Robust parametric classification and vari- able selection by a minimum distance criterion. Journal of Computational and Graphical Statistics, 23(1):111–128.

Clarke, F. (2013). Functional Analysis, Calculus of Variations and Optimal Control, volume 264. Springer Science & Business Media.

Efron, B., Hastie, T., Johnstone, I., and Tibshirani, R. (2004). Least angle regression. Annals of Statistics, 32(2):407–499.

Fan, J. and Li, R. (2001). Variable selection via nonconcave penalized likeli- hood and its oracle properties. Journal of the American Statistical Association, 96(456):1348–1360.

Friedman, J., Hastie, T., and Tibshirani, R. (2010). Regularization paths for generalized linear models via coordinate descent. Journal of Statistical Software, 33:1–22.

Hunter, D. R. and Li, R. (2005). Variable selection using MM algorithms. Annals of Statistics, 33(4):1617–1642.

Lange, K. (2016). MM Optimization Algorithms. Philadelphia: SIAM.

Li, A. H. and Bradic, J. (2018). Boosting in the presence of outliers: adap- tive classification with nonconvex loss functions. Journal of the American Statistical Association, pages 1–15.

Nesterov, Y. (2004). Introductory Lectures on Convex Optimization: A Basic Course. Springer Science & Business Media.

Razaviyayn, M., Hong, M., and Luo, Z.-Q. (2013). A unified convergence anal- ysis of block successive minimization methods for nonsmooth optimization. SIAM Journal on Optimization, 23(2):1126–1153.

Rosset, S. and Zhu, J. (2007). Piecewise linear regularized solution paths. The Annals of Statistics, 35(3):1012–1030.

Schifano, E. D., Strawderman, R. L., Wells, M. T., et al. (2010). Majorization- minimization algorithms for nonsmoothly penalized objective functions. Electronic Journal of Statistics, 4:1258–1299.

Shi, L., Campbell, G., Jones, W., Campagne, F., et al. (2010). The MicroArray Quality Control (MAQC)-II study of common practices for the development and validation of microarray-based predictive models. Nature Biotechnology, 28(8):827–838. https://goo.gl/8bdBDE.

Tibshirani, R. (1996). Regression shrinkage and selection via the lasso. Journal of the Royal Statistical Society, Series B, 58:267–288.

Wang, H., Li, G., and Jiang, G. (2007). Robust regression shrinkage and consistent variable selection through the LAD-Lasso. Journal of Business & Economic Statistics, 25(3):347–355.

Wang, Z. (2018a). Quadratic majorization for nonconvex loss with applica- tions to the boosting algorithm. Journal of Computational and Graphical Statistics, 27(3):491–502.

Wang, Z. (2018b). Robust boosting with truncated loss functions. Electronic Journal of Statistics, 12(1):599–650.

Wu, Y. and Liu, Y. (2007). Robust truncated hinge loss support vector ma- chines. Journal of the American Statistical Association, 102(479):974–983.

Yi, C. and Huang, J. (2017). Semismooth Newton coordinate descent algo- rithm for elastic-net penalized Huber loss regression and quantile regression. Journal of Computational and Graphical Statistics, 26(3):547–557.

Zhang, C.-H. (2010). Nearly unbiased variable selection under minimax con- cave penalty. Annals of Statistics, 38(2):894–942.

Zou, H. and Li, R. (2008). One-step sparse estimates in nonconcave penalized likelihood models. Annals of Statistics, 36(4):1509.

Fig. 1 Loss functions for regression

Fig. 2 Loss functions for classification

Fig. 3 Prediction of ClossRSCAD from different solution paths in Example 1 and v=20

Table 1 Nonconvex loss function

Table 2 Mean squared prediction errors and standard deviations (in parentheses) for v percentage of outliers in Example 1. Numbers > 1e + 4 are suppressed.

Table 3 Mean test errors and standard deviations (in parentheses) for v pencentage of outliers in Example 2.

Table 4 Mean test errors and standard deviations (in parentheses) for v pencentage of outliers in the breast cancer data.

Appendix A Technical details of Algorithm 3

A.1 Solution paths

To optimize ), tuning parameters such as can be chosen by cross-

validation or other data driven methods from some prespecified values. We

make use of in (8) or equivalently ˜in (15). The latter can be handy

with existing computer codes for penalized least squares estimation. We com-

pute the solutions along a regularized path for a sequence of . A forward path

begins with all initial non-intercept parameters zero for a fixed value, then

compute solutions for a decreasing sequence of = 1, ..., K, ranging

from to for a small number 0 1 (Friedman et al.,

2010). For nonconvex estimation in Section 3.4, Algorithm 3 typically seeks

a local optimization solution. It is well-known that there could be potentially

different estimates depending on the initial values. In practice, different initial

values can be utilized to obtain the optimal result. This article emphasizes

boosting as an innovation for initial value estimation, coupled with an alter-

native backward path, an increasing sequence of 1 from to

. This would provide different initial values for nonconvex optimization.

Most importantly, the numerical results in the simulation and data analysis

warrant this choice.

A.2 Update on active set

To speed up computations, we may implement the active set strategy in Al-

gorithm 3. First, we determine the active set, a set of variables with non-zero

coefficients, from the initial value in step 1. Run step 2 to 5 with variables in

the active set. Then run step 4 once with all variables in the full set to update

the active set. If there is no change, the computation finishes. Otherwise, re-

peat the above procedure. Second, we employ the active set internally within

step 4 when minimizing penalized least squares. We begin with an initial active

set containing all variables, cycle through all coefficients in the active set with

coordinate descent algorithm until convergence, then cycle through the full set

with all variables, but only once. This generates a new active set. Compared

with the previous active set, if no change, then step 4 converges. Otherwise,

we repeat the above process until convergence. The first two procedures invoke

the full set so that variable selection is not predominated or trapped by the

initial estimation including boosting. Indeed, if all the initial parameters are

zero (forward path) in ste