In predictive modeling, the data may contain a large number of predictors such as clinical risk factors of hospital patients, or gene expression profiles of cancer patients. It is a common interest to identify effective predictors and estimate predictor parameters simultaneously. A popular approach is to optimize an objective function, log-likelihood for instance, subject to a penalty. Penalty functions including the least absolute shrinkage and selection operator (LASSO), smoothly clipped absolute deviation (SCAD) and minimax concave penalty (MCP) have been widely applied to many statistical problems (Tib- shirani, 1996; Fan and Li, 2001; Zhang, 2010).
To simplify computations, the majorization-minimization (MM) algorithm provides a useful framework to decomposing a complex minimization problem into a sequence of relatively elementary problems. The MM algorithm has been an interesting research topic in statistics, mathematics and engineering (Lange, 2016; Byrne, 2018; Razaviyayn et al., 2013). In particular, penalized regression with MM has attracted some attention, see Schifano et al. (2010) and references therein. The article emphasizes the quadratic majorization (QM) for the generalized linear models. Despite convergence analysis of the QM algorithm, the loss function was convex. Chi and Scott (2014) developed a QM algorithm to a LASSO type minimum distance estimator for classification, where the limiting behavior relies on more strict assumptions on the penalties: it is an elastic net type of penalty, not the original LASSO, nor the nonconvex penalty SCAD.
In this article, we provide new results for the MM algorithm. The major contributions are three-fold: First, we construct optimality conditions for penalized estimation and penalized surrogate estimation in the MM algorithm. The connection between the two optimization problems are established such that penalty tuning parameters can be conveniently determined from the latter. Second, we establish contemporary theory that the QM algorithm attains the minimum value with convex loss and penalty. Furthermore, we develop the limiting behavior of the MM algorithm for nonconvex loss functions with the LASSO, SCAD or MCP penalty. Finally, several penalized nonconvex loss functions, for the first time, are applied with QM algorithm for robust regression and classification.
Nonconvex loss functions play a critical role in robust estimation. Compared to traditional convex loss functions, a nonconvex function may provide more resistant estimation when data are contaminated by outliers. For instance, the least squares loss is not robust to outliers (Alfons et al., 2013). Furthermore, variable selection is also strongly influenced by outliers. Robust estimation can be obtained by downweighting the impact of the extreme values. Examples include the Huberized LASSO for regression and classification (Rosset and Zhu, 2007) and penalized least absolute deviations (LAD) (Wang et al., 2007). However, a breakdown point analysis shows that convex loss functions such as the LAD-LASSO or penalized M-estimators are not robust enough and even a single outlier can send the regression coefficients to infinity (Alfons et al., 2013). On the other hand, bounded nonconvex loss functions have shown more robust to outliers than traditional convex loss functions in regression and classification (Wu and Liu, 2007; Chi and Scott, 2014; Li and Bradic, 2018; Wang, 2018a,b). Therefore, this article focuses on applications of penalized nonconvex loss functions.
The rest of this article is organized as follows. In Section 2, we study optimality conditions of penalized loss functions. In Section 3, we describe the MM algorithm, provide optimality conditions of penalized surrogate loss functions in the MM algorithm, and establish connections of minimizers between the original and surrogate optimization problems. We provide details with applications to nonconvex loss functions. In Section 4, convergence analysis is conducted. In Section 5, a simulation study in regression and classification is presented. In Section 6, the proposed methods are evaluated on data application predicting breast cancer clinical status. We conclude with a discussion in Section 7. Some numerical results, technical details such as Clarke subd-ifferential and regular at a point, and proofs are presented in the online supplement.
For training data (= 1, ..., n, j = 0, 1, ..., p, let
= (
denote a (p+1)-dimensional predictor with the first entry 1, (
is the response vector,
= (
is a (p + 1)-dimensional coefficient vector with
the intercept. Consider a loss function
), for instance,
) =
in linear regression, where f is the prediction of y. For some link function
) define
The goal is to minimize the empirical loss ). For the purpose of variable selection, define a penalized loss function
where 1, and
) is the penalty function satisfying the following assumptions (Schifano et al., 2010):
) is continuously differentiable, nondecreasing and concave on (0
) with
(0) = 0 and 0
(0+)
.
Minimizing the penalized loss function ) can provide shrinkage estimates thus leading to zero values for small coefficients in magnitude. The following penalty functions have been widely studied:
(i) the LASSO penalty (Tibshirani, 1996), for 0
(ii) the SCAD penalty (Fan and Li, 2001)
The SCAD and MCP penalty functions with large value a are comparable to the LASSO since if
. These penalty functions satisfy Assumption 1 and Assumption 2 (more restrictive than Assumption 1).
) is continuously differentiable, nondecreasing and concave on (0
) with
(0) = 0, 0
(0+)
and directional derivative at 0 in the direction
given by
(0;
) =
for some
0.
In particular, = 1 for the LASSO, SCAD and MCP penalty. Assumption 2 also covers hard thresholding penalty.
Assumption 1 ensures the penalty function is locally Lipschitz continuous at every point including the origin due to a bounded derivative. A penalty function may violate Assumption 1. For instance, the bridge penalty ) =
for
(0, 1) has
(0+) =
. In other words, the bridge penalty is not Lipschitz continuous near
= 0.
We establish conditions for minimizing ). Let
= (
de- note a minimizer and corresponding value
= (
) without the in- tercept. Since
is not penalized, we focus on optimality conditions on
, since an optimality condition on
is a trivial problem in the current setting. Whether
is a local or global minimizer is specified in the context. If
is a minimizer, then usually a set of conditions constraining
must be satisfied. If these conditions are comprehensively constructed, we call them complete optimality conditions, which is the typical case when dealing with convex optimization problems. On the other hand, if only a subset of complete optimality conditions are known, we call them incomplete optimality conditions. This would occur in nonconvex optimization problems.
Theorem 1 Suppose is a local minimizer of
), where
) is continuously differentiable. Then:
(a) Under Assumption 1, optimality conditions for are given by
(b) Under Assumption 2, an optimality condition for = 0, j = 1, ..., p is given by
For the LASSO, SCAD and MCP penalty functions with = 1, from (7) we define
In real data analysis, we often need to determine a good value , which can be chosen from select values closely related to the quantity
. From Theorem 1, if
= 0, j = 1, ..., p, we deduce that
. On the other hand, the converse is not valid, unless stronger conditions are posed as in the next theorem.
Theorem 2 Assume that ) is differentiable convex, and
) is the LASSO penalty function. Then
is a global minimizer of
) if and only if
where ) is the subdifferential of
) at
. Or equivalently
for j = 1, ..., p.
The optimality condition in Theorem 2 is standard for convex functions, yet helpful when investigating the relationship between the function ) and its surrogate function in the MM algorithm described below. The value
has much stronger implications under Theorem 2 due to the convexity assumption. From(10) Theorem 2 concludes that
is the smallest
such that
= 0, j = 1, ..., p. Without convexity assumption, since (7) is not sufficient and/or incomplete, the value
may neither guarantee
= 0 nor produce the smallest
with
= 0.
3.1 Penalized estimation using MM algorithms
The MM algorithm is an iterative procedure. Given an estimate
in the kth iteration, ) is minimized at the k + 1 iteration to obtain an updated minimizer
. The MM algorithm then updates z with the new estimate
. This process is repeated until convergence. From (11) the MM algorithm generates a descent sequence of estimates:
This relationship leads to the following result.
Lemma 1 Suppose ) majorizes
) at z. If
is a local (global) minimizer of
),
is a local (global) minimizer of
).
The converse of Lemma 1 is not true. Consider ) =
= 0. Let
) =
. Therefore
) majorizes
= argmin
) and
) =
) = 0. However, the converse is invalid. To overcome this problem, for differentiable convex functions, we develop a connection between a function and its “surrogate” function in a broader sense, including but not limited to the induced functions by majorization.
Lemma 2 Suppose ) and
) are differentiable convex functions. Then
is a global minimizer of
) satisfying
) =
) if and only if
is a global minimizer of
).
To accommodate predictors, define
Obviously the transformation ) preserves the relationship in (11) such that
) majorizes
) at the current estimate
= (
)
. Along the line of Lemma 2, we characterize a class of surrogate functions:
Consider the following surrogate penalized loss:
From Lemma 1 we deduce that if is a minimizer of
), it must be a minimizer of
). The following results lay out not only the optimality conditions of the surrogate function, but also the connection between the surrogate and its original function deduced from.
Theorem 3 Assume that ) and
) are continuously differentiable functions, and
) majorizes
) at
. Suppose
is a local minimizer of
), and Assumption 3 holds at
. Then:
(a) Under Assumption 1, an optimality condition is
(b) Under Assumption 2, an optimality condition for = 0, j = 1, ..., p is
For the LASSO, SCAD and MCP penalty functions, from (14) we define
Notice ˜is the same as
in (8) under the specified assumptions. Theorem 3 implies that if
= 0, j = 1, ..., p, then
˜
. Like before, the converse is generally invalid unless stronger conditions are provided as in the next theorem.
Theorem 4 Assume that ) and
) are differentiable convex functions, and
) is the LASSO penalty function, Then
is a global minimizer of
), where Assumption 3 holds at
, if and only if
is a global minimizer of
) or equivalently
Theorem 4 on the penalized surrogate loss ) is a mirror of Theorem 2 on the penalized loss
). Furthermore, it is reasonable to deduce that similar results hold for the EM algorithm (a special case of the MM algorithm) in penalized estimation. Such a connection, due to the convexity assumption, is stronger than that between Theorem 1 and Theorem 3. These results suggest that instead of minimizing the original objective function
), we can minimize the surrogate function
). The iterative MM procedure is summarized in Algorithm 1.
3.2 Quadratic majorization
The quadratic majorization (QM) is based on the second-order Taylor expansion for a twice differentiable function ) such that
for some on the line segment connecting z and u. Suppose ˜B is positive definite and ˜
) is semipositive definite for all arguments w, then
majorizes ) at z. Since
) =
), Assumption 3 holds at z for functions
) and
). The QM is a special application of the more general MM algorithm when a suitable bound ˜B exists (Lange, 2016). The generic terms defined by
) in (12) can be replaced with the express formula (18).
While an upper bound ˜B may not be unique, to simplify computation as often possible, assume ˜B is a square diagonal matrix with all its main diagonal entries equal, i.e., ˜B = BI, where B is a scalar and I is the identity matrix. Furthermore the link function ) takes different forms:
Define in regression problems. By (18) and the chain rule, we have
For classification problems with margin , we have
With , minimizing unpenalized loss
) in (1) amounts to minimizing
where
The constant weight B has no impact on the estimation given ), in contrast to the penalized estimation described below. Putting together, with iterated least squares estimation for the updated response variable, Algorithm 2 can be used to minimize
) in (1) and estimate parameters.
For penalized estimation, the surrogate loss function (13) simplifies to:
Since the surrogate function ) satisfies Assumption 3, Theorem 3 is applicable. Theorem 4 is effective for convex loss functions such as those in generalized linear models (GLM), where quadratic majorization may result in possibly different form than (21). Minimizing
) in (2) can be achieved with Algorithm 3, and technical details are discussed in the subsequent context and supplement.
3.3 Penalized least squares
Minimizing the penalized least squares (22) is hampered by the fact that the penalty function ) is not differentiable at 0. Let
) be a function that approximates the penalty function
), the following methods have been proposed in the literature.
(iii) Fan and Li (2001); Hunter and Li (2005) considered a local quadratic approximation (LQA):
In these proposals ) essentially majorizes
) at
. We implement method (i) in the algorithm since the efficient coordinate decent algorithm can easily provide penalized least squares estimates for a variety of penalty functions including the LASSO, SCAD and MCP (Friedman et al., 2010; Breheny and Huang, 2011). We nevertheless investigate properties of all three approaches in Section 4.
3.4 Applications to nonconvex loss functions
Traditional regression is based on minimizing a loss function ), where
) =
. The
loss is not robust to outliers. Robust regression has utilized absolute difference
) =
loss) and Huber loss, both are convex loss functions. The nonconvex loss ClossR with a parameter
is described in Table 1 (Wang, 2018a). The ClossR with
= 0.9 is plotted in Figure 1, along with the aforementioned loss functions in regression. It is clear that bounded ClossR downweights large values more than other convex loss functions, effectively having better resistance to outliers.
For a binary outcome y taking values +1 and 1, denote by u = yf the margin of a classifier f. Used in a logistic regression,
) = log
(1+exp(
)) is a popular choice in statistics. However, the logistic loss is unbounded and can’t control outliers well. Nonconvex loss functions have been proposed to address the robustness of classification, and three examples are listed in Table 1. These loss functions approximate the 0-1 loss, as shown in Figure 2. The
value controls robustness of a loss function. For Closs, a smaller value is more robust to outliers while for other loss functions, a larger value is more robust. A typical strategy in practice is working from less robust to more robust and selecting a proper value of
according to prediction.
For the loss functions in Table 1, we can obtain penalized estimates with Algorithm 3, where factor B has been derived by Wang (2018a). Since the functions are continuously differentiable, Theorem 1 and Theorem 3 are applicable. We list in the supplement Table 5. The starting estimates of
are based on a functional gradient decent boosting algorithm (Wang, 2018b).
To conduct a convergence analysis of Algorithm 1 for penalized estimation in convex and nonconvex problems, we construct a unified form of majorizers for the loss functions in Section 3.4 as well as those in the GLM. Consider a twice differentiable function ) in (1). The Taylor expansion at a given value
is
for some ˜on the line segment connecting
and
, where
For ) defined in (19), we have
which can be applied to functions including but not limited to the nonconvex functions in Section 3.4. For the GLM, the negative log-likelihood function is
where ) and
) are known functions,
is the canonical parameter with
) (if the link is canonical,
), and
is the known dispersion parameter. The negative log-likelihood function has the form:
We have
(
) = 1
)
For the canonical link ) = z, we have
Let X denote the + 1) matrix of
values,
diagonal matrix of weights with ith diagonal element
). Then we have
A matrix W is constructed to majorize ). Let ˜
= 0, 1, ..., p denote the eigenvalues of
). Assume that
and let W = I, where I is the identity matrix. If X is full rank, we may define
Obviously the (p + 1) + 1) matrix W is positive definite, and all the eigenvalues of W are positive. Let
We have
where the inequality means that
is positive semidefinite. Denote
For ) defined in Section 3.3 and
= (
, denote
Finally we define the surrogate loss
A similar result as in Schifano et al. (2010) is obtained.
Theorem 5 Assume twice differentiable function ) satisfies (26) or (27),
) is given by (30), and penalty
) satisfies Assumption 1. Then:
) majorizes
) at
:
The inequality holds if .
Theorem 5 implies that Algorithm 1 provides a nonincreasing sequence ). We analyze Algorithm 1 under different assumptions in the sequel. The following convergence analysis does not cover (25) which is not defined when
= 0. See Hunter and Li (2005) for a modified majorization to avoid such a degenerate case. We first focus on convex optimization problems.
Theorem 6 Assume twice differentiable convex function ) satisfies (26) or (27),
) is given by (30), the surrogate penalty
) =
0, and the iterates
are generated by Algorithm 1.Then the loss values
) are monotonically decreasing and lim
inf
). Furthermore
converges to a minimum point of
) if
) is either (i) coercive or (ii) bounded below with bounded intercept
.
The bounded intercept in Theorem 6 ensures that penalty term ) is coercive for
. Otherwise
) is not coercive for
despite that
) is coercive for
meaning
implies
. To see this, denote
Then . Denote ˜
) and let
= (0, ..., 0)
, a p-dimensional vector, then we have
= (
and ˜
) =
) = 0 by definitions. Thus
) is a constant as
or
. In conclusion, as
, ˜
doesn’t always hold. This proves that
) is not coercive for
if the intercept is not bounded.
The next two theorems are particularly relevant for nonconvex functions. We consider a larger class of surrogate loss ) which covers the QM (29) as a special case while the surrogate loss function
) has the same form as in (30).
Notice that ) given by (29) satisfies assumptions of Theorem 7. We can summarize the implications of Theorem 7: applying the QM to the loss functions in the GLM or Table 1, and applying an appropriate majorization such as (23) or (24) to the penalty functions including the LASSO, SCAD and MCP, then every limit point of the iterates generated by the MM algorithm is a Dini stationary point. While a Dini stationary point is always a Clarke stationary point, the converse is not valid. Under different conditions, the iterates of the MM algorithm converges to a Clarke stationary point.
Theorem 8 Assume continuous function ) is locally Lipschitz continuous,
) is coercive or bounded below with bounded intercept
, the set of Clarke stationary points S of
) is a finite set, the surrogate function
) is strongly convex and strictly majorizes
) at
, the surrogate loss
) is given by (30), the surrogate penalty
) is convex and majorizes
) at
, and the iterates
are generated by Algorithm 1. Then
converges to a Clarke stationary point of the problem minimizing
).
Theorem 8 can be applied to a broad loss functions including those in the GLM and Table 1. These functions are continuously differentiable, thus locally Lipschitz continuous. Construct a surrogate loss ) given by (30) with
. Then
) strictly majorizes
). From
) =
we know
) is positive definite. Suppose Assumption 1 holds for
), then
) is convex and majorizes
) at
) is given by (23) for the LASSO or (24). This implies some of the assumptions of Theorem 8 have been met. If additional assumptions in the theorem are valid, then we deduce that Algorithm 1 iterates converge to a Clarke stationary point. Unlike Theorem 7, Theorem 8 doesn’t assume equivalence of the first order Dini derivative between the penalized surrogate loss
) and objective function
). When this does hold, as the QM in (29) and the surrogate penalty function in (23) or (24), Theorem 7 provides stronger conclusion than Theorem 8.
In the simulated examples, we evaluate performance of the proposed robust variable selection methods, i.e., MM for penalized estimation with Algorithm 3, and compare with the standard non-robust algorithms. The parameter of SCAD and MCP is fixed at a = 3. This choice works well for the simulation study. The responses in each data set are randomly contaminated with percentage of v=0, 5, 10 and 20. We generate random samples of training/tuning/test data. Training data are used for model estimation. Tuning samples are used to select optimal penalty parameters with the smallest loss values and classifica-tion errors in regression and classification problems, respectively. Test data are used to evaluate prediction accuracy except for example 1 whose prediction accuracy is not from the test data. To evaluate variable selection, we compute sensitivity and specificity, where sensitivity is the proportion of number of correctly selected predictors among truly effective predictors, and specificity is the proportion of number of correctly non-selected predictors among truly ineffective predictors. The intercept is not counted as a variable. A good variable selection method should obtain both sensitivity and specificity close to 1. We repeat Monte Carlo simulation 100 times in the following examples.
5.1 Example in regression
Example 1: Let , where
= 1
= (3, 1.5, 0, 0, 2, 0, 0, 0)
N(0, 3)
) with
= 0
for i, j = 1, ..., 8.
Without the intercept, similar data was generated in Fan and Li (2001). The response variable y is contaminated by randomly multiplying 1, 000 for v = 0, 5, 10 and 20 percent of training (n=200) and tuning data (n=200). Different estimation methods are evaluated based on the mean squared prediction error E[(ˆ] (Hunter and Li, 2005). We fix
= 10 in ClossR for which larger
value is more resistant to outliers. Table 2 and Table 6-8 in the supplement display the results for penalized ClossR methods along with standard least squares (LS), LS-based penalized methods and penalized Huber regression proposed by Yi and Huang (2017). The best prediction results with the associated variable selection are highlighted in bold.
The ClossR-based methods have the best prediction and outperform the HuberLASSO in all four settings of outliers. For the LASSO penalty, ClossR has better prediction than Huber but also selects more variables. For the clean data, ClossR-based methods are slightly better or comparable to LS-based methods when the penalty function is the same. If one believes that prediction accuracy from the penalized LS methods should be superior to the penalized ClossR in uncontaminated cases, the tiny differences in Table 2 perhaps can be explained by random variation in finite samples and selection of penalty parameters. An interesting research topic is to theoretically compare prediction between penalized LS and penalized ClossR.
For clean data, all effective variables are selected, and ClossR-based SCAD or MCP has better selection than LASSO, consistent with LS-based penalty functions. For contaminated data, all correct variables are selected in robust estimation, but not in the penalized LS estimation. In addition, ClossR-based SCAD or MCP determines fewer ineffective variables than LASSO.
For nonconvex optimization, Algorithm 3 can generate different local estimates depending on the initial values. To illustrate, two methods of initial values elaborated in supplement A.1 are compared in Figure 3 for loss ClossR and nonconvex penalty SCAD with v=20. Prediction from boosting-supported backward solution path is more reliable than the forward path. The two largest errors from the latter ultimately generates a greater mean prediction error despite a smaller median value.
5.2 Examples in classification
The following data generations are from Wu and Liu (2007) and have also been investigated in Wang (2018a,b) for robust classification. These examples include both linear and nonlinear classifiers.
Example 2: Predictor variables () are uniformly sampled from a unit disk
+
1 and y = 1 if
and -1 otherwise.
Example 3: Predictor variables () are uniformly sampled from a unit disk
+
1 and y = 1 if (
+
0 and -1 otherwise.
In each example, we also generate 18 noise variables from uniform[-1, 1]. To add outliers, we randomly select v percent of the data and switch their class labels. With the simulated data, the performance of an algorithm can be compared with the optimal Bayes errors. The training/tuning/test sample sizes are n = 200/200/10, 000 in Example 2 and 1, 000/1, 000/10, 000 in Example 3. To accommodate nonlinear classifiers in Example 3, the predictor space is doubled to p = 40 including quadratic terms for i = 1, 2, ..., 20. In Example 2 and 3, no-intercept models have better prediction accuracy from logistic-based and nonconvex loss-based variable selection methods. Since the data are not generated from logistic regression models, estimation from a logistic regression model is not a clear-cut winner even for the clean data. We fix Closs
= 0.9, Gloss
= 1.1 and Qloss
= 0.2 in the estimation.
The results for Example 2 are demonstrated in Table 3 and Table 9 in the supplement. It is clear that the performance of logistic-based methods relies on the cleanliness of the data, and is not robust with outliers. On the other hand, the nonconvex loss-based variable selection methods are better resistant to outliers. In particular, with data contaminated by v=0, 5, 10 and 20 percentage, the best prediction results 0.0047(QlossSCAD), 0.0548(GlossSCAD), 0.1055(GlossSCAD), 0.2067(QlossSCAD) are the closest to the corresponding Bayes errors. The corresponding variable selections in highlighted numbers in Table 9 almost perfectly match the Bayes optimal classifier. Since all effective variables are correctly selected, specificity is not reported in lieu of supplement Table 9. It can be interesting to compare the LASSO and boosting due to its similarity (Efron et al., 2004). The same random data was generated in Wang (2018a), where the robust boosting algorithm provided similar prediction results and slightly more sparse variable selection. For instance, the boosting errors from the Gloss are 0.0108, 0.0628, 0.1178, 0.2217 for outlier percentage v=0, 5, 10, 20, respectively (Wang, 2018a), while the corresponding QlossLASSO errors are 0.0154, 0.0672, 0.1187, 0.2223 in Table 3. As the percentage of outliers changes, the average number of variables selected in the boosting algorithm ranges from 2.5 to 2.8 (Wang, 2018a) while the QlossLASSO ranges from 4.1 to 4.9 in supplement Table 9.
The results for Example 3 are summarized in the supplement, and similar conclusions like Example 2 can be drawn. In all outlier scenarios, the nonconvex loss-based methods triumph over the logistic-based methods on prediction. In particular, the best average test errors are 0.0016 (ClossMCP), 0.0520 (QlossSCAD), 0.1018 (QlossSCAD) and 0.2019 (QlossSCAD) for percentage of outliers v=0, 5, 10 and 20, respectively. In addition, the corresponding variable selections in featured numbers in Table 11 resemble the Bayes optimal classifier. Except for v=0, all variables are more accurately identified by the nonconvex loss-based methods than the logistic-based methods. Both and
are selected in all estimation methods in each simulation run. Supplement Table 12 counts variable selection for
or
in 100 simulations. Across all methods, the average specificity excluding
is
98. Hence, the details are unreported. In conclusion, all methods correctly select
and
, and the remaining variables including
and
are typically eliminated by the variable selection procedures.
Shi et al. (2010) used gene expressions to predict breast cancer clinical status with 164 estrogen receptor positive cases and 114 negative cases. The same data set has been evaluated for a variety of robust classification algorithms (Li and Bradic, 2018; Wang, 2018a,b). Applying MM for penalized estimation, the analysis described below is reproduced in a vignette distributed with the R package mpath. Following the same pre-selection procedure, the data set is pre-screened to keep 3000 most significant genes from 22,282 genes with twosample t-test. These genes are then standardized. The data set is randomly split to training data and test data, and the training data contains 50/50 positive/negative cases and the test data has 114/64 positive/negative cases. The positive/negative status in the training data is randomly flipped with percentage v=0, 5, 10, 15. The parameter of SCAD and MCP is fixed at a = 3.7 and a = 12, respectively. The predictive models are aided with a five-fold cross-validation to select the optimal penalty parameters with the smallest classification errors. The analysis is conducted with Closs = 0.9, Gloss
= 1.01 and Qloss
= 0.5 (Wang, 2018a). This process is repeated 100 times and the results are summarized in Table 4 and supplement Table 13.
Prediction accuracy is more similar in the clean data than those contaminated, while the logistic-based methods suffer from outliers. The best prediction with the Closs-based methods are 0.0818, 0.0839, 0.0910, 0.1143 for v=0, 5, 10, 15, respectively. This is the best prediction for this data set in the robust prediction literature. See Li and Bradic (2018); Wang (2018a,b) and the references therein. With more data contamination, more variables are selected. Loss function and penalty method also contribute to different variable selection results. In addition, sparse variable selection is maintained even with outliers. For instance, with 15% outliers, ClossSCAD not only has more accurate prediction than LogisSCAD (0.1143 vs 0.1566), but also selected only 2.8 genes on average, compared with 19.1 genes by LogisSCAD.
Outliers can substantially cause bias in traditional estimation methods and produce inaccurate variable selection problem. This problem has been tackled by robust estimation, which has a long history in statistics methodology research and applications. The previous methods are largely based on convexbased loss functions, which may not be sufficient to guard the impact of outliers. Adding to this is the variable selection challenge in predictive modeling, specifically for the high-dimensional data.
In this article, we propose penalized nonconvex loss functions for several popular penalty functions. We have illustrated that the proposed methods can improve estimation and variable selection compared to conventional estimation and robust estimation. The estimation problem is decoupled into an iterative MM algorithm with each iteration conveniently conducted by existing optimization algorithms. The QM embedded in the MM algorithm is particularly appealing since the induced penalized least squares problem has been well studied. For tuning parameter identification, it is a typical practice to construct a solution path along with penalty parameters. A popular procedure is to start with a penalty parameter such that all predictor coefficients are zero. As is well-known in nonconvex optimization problems, different starting values may lead to different solutions if local minimizers are achieved. We propose an alternative yet effective approach to compute a solution path. The so-called backward path generates starting values utilizing the boosting algorithm and find solutions backwards towards zero coefficients.
This article evaluates optimality conditions for both convex and nonconvex problems in the context of penalized estimation. As expected, we could only partially characterize the optimality conditions for nonconvex problems. The link between the original optimization problem and the surrogate optimization problem, nevertheless, provides a useful utility to compute . The convergence analysis provides new results for both convex and nonconvex problems. For convex problems, the iterated estimates of the MM algorithm with QM converges to a minimum point. For nonconvex problems, the iterates of the MM algorithm with QM converges to a local stationary point in the framework of the Dini or Clarke derivative.
Future direction of research includes expanded penalty functions including those not locally Lipschitz continuous. For instance, the bridge penalty for (0, 1). Since the generalized derivatives like Dini and Clarke derivative are both restricted to locally Lipschitz continuous functions, additional mathematical tools are required to assess properties of the penalized loss functions.
Alfons, A., Croux, C., Gelper, S., et al. (2013). Sparse least trimmed squares regression for analyzing high-dimensional large data sets. The Annals of Applied Statistics, 7(1):226–248.
Bertsekas, D. P. (1999). Nonlinear Programming. Athena Scientific, second edition.
Breheny, P. and Huang, J. (2011). Coordinate descent algorithms for noncon- vex penalized regression, with applications to biological feature selection. The Annals of Applied Statistics, 5(1):232–253.
Byrne, C. L. (2018). Auxiliary-function minimization algorithms. Applied Analysis and Optimization, 2(2):171–198.
Chi, E. C. and Scott, D. W. (2014). Robust parametric classification and vari- able selection by a minimum distance criterion. Journal of Computational and Graphical Statistics, 23(1):111–128.
Clarke, F. (2013). Functional Analysis, Calculus of Variations and Optimal Control, volume 264. Springer Science & Business Media.
Efron, B., Hastie, T., Johnstone, I., and Tibshirani, R. (2004). Least angle regression. Annals of Statistics, 32(2):407–499.
Fan, J. and Li, R. (2001). Variable selection via nonconcave penalized likeli- hood and its oracle properties. Journal of the American Statistical Association, 96(456):1348–1360.
Friedman, J., Hastie, T., and Tibshirani, R. (2010). Regularization paths for generalized linear models via coordinate descent. Journal of Statistical Software, 33:1–22.
Hunter, D. R. and Li, R. (2005). Variable selection using MM algorithms. Annals of Statistics, 33(4):1617–1642.
Lange, K. (2016). MM Optimization Algorithms. Philadelphia: SIAM.
Li, A. H. and Bradic, J. (2018). Boosting in the presence of outliers: adap- tive classification with nonconvex loss functions. Journal of the American Statistical Association, pages 1–15.
Nesterov, Y. (2004). Introductory Lectures on Convex Optimization: A Basic Course. Springer Science & Business Media.
Razaviyayn, M., Hong, M., and Luo, Z.-Q. (2013). A unified convergence anal- ysis of block successive minimization methods for nonsmooth optimization. SIAM Journal on Optimization, 23(2):1126–1153.
Rosset, S. and Zhu, J. (2007). Piecewise linear regularized solution paths. The Annals of Statistics, 35(3):1012–1030.
Schifano, E. D., Strawderman, R. L., Wells, M. T., et al. (2010). Majorization- minimization algorithms for nonsmoothly penalized objective functions. Electronic Journal of Statistics, 4:1258–1299.
Shi, L., Campbell, G., Jones, W., Campagne, F., et al. (2010). The MicroArray Quality Control (MAQC)-II study of common practices for the development and validation of microarray-based predictive models. Nature Biotechnology, 28(8):827–838. https://goo.gl/8bdBDE.
Tibshirani, R. (1996). Regression shrinkage and selection via the lasso. Journal of the Royal Statistical Society, Series B, 58:267–288.
Wang, H., Li, G., and Jiang, G. (2007). Robust regression shrinkage and consistent variable selection through the LAD-Lasso. Journal of Business & Economic Statistics, 25(3):347–355.
Wang, Z. (2018a). Quadratic majorization for nonconvex loss with applica- tions to the boosting algorithm. Journal of Computational and Graphical Statistics, 27(3):491–502.
Wang, Z. (2018b). Robust boosting with truncated loss functions. Electronic Journal of Statistics, 12(1):599–650.
Wu, Y. and Liu, Y. (2007). Robust truncated hinge loss support vector ma- chines. Journal of the American Statistical Association, 102(479):974–983.
Yi, C. and Huang, J. (2017). Semismooth Newton coordinate descent algo- rithm for elastic-net penalized Huber loss regression and quantile regression. Journal of Computational and Graphical Statistics, 26(3):547–557.
Zhang, C.-H. (2010). Nearly unbiased variable selection under minimax con- cave penalty. Annals of Statistics, 38(2):894–942.
Zou, H. and Li, R. (2008). One-step sparse estimates in nonconcave penalized likelihood models. Annals of Statistics, 36(4):1509.
Fig. 1 Loss functions for regression
Fig. 2 Loss functions for classification
Fig. 3 Prediction of ClossRSCAD from different solution paths in Example 1 and v=20
Table 1 Nonconvex loss function
Table 2 Mean squared prediction errors and standard deviations (in parentheses) for v percentage of outliers in Example 1. Numbers > 1e + 4 are suppressed.
Table 3 Mean test errors and standard deviations (in parentheses) for v pencentage of outliers in Example 2.
Table 4 Mean test errors and standard deviations (in parentheses) for v pencentage of outliers in the breast cancer data.
Appendix A Technical details of Algorithm 3
A.1 Solution paths
To optimize ), tuning parameters such as
can be chosen by cross-
validation or other data driven methods from some prespecified values. We
make use of in (8) or equivalently ˜
in (15). The latter can be handy
with existing computer codes for penalized least squares estimation. We com-
pute the solutions along a regularized path for a sequence of . A forward path
begins with all initial non-intercept parameters zero for a fixed value, then
compute solutions for a decreasing sequence of = 1, ..., K, ranging
from to
for a small number 0
1 (Friedman et al.,
2010). For nonconvex estimation in Section 3.4, Algorithm 3 typically seeks
a local optimization solution. It is well-known that there could be potentially
different estimates depending on the initial values. In practice, different initial
values can be utilized to obtain the optimal result. This article emphasizes
boosting as an innovation for initial value estimation, coupled with an alter-
native backward path, an increasing sequence of 1 from
to
. This would provide different initial values for nonconvex optimization.
Most importantly, the numerical results in the simulation and data analysis
warrant this choice.
A.2 Update on active set
To speed up computations, we may implement the active set strategy in Al-
gorithm 3. First, we determine the active set, a set of variables with non-zero
coefficients, from the initial value in step 1. Run step 2 to 5 with variables in
the active set. Then run step 4 once with all variables in the full set to update
the active set. If there is no change, the computation finishes. Otherwise, re-
peat the above procedure. Second, we employ the active set internally within
step 4 when minimizing penalized least squares. We begin with an initial active
set containing all variables, cycle through all coefficients in the active set with
coordinate descent algorithm until convergence, then cycle through the full set
with all variables, but only once. This generates a new active set. Compared
with the previous active set, if no change, then step 4 converges. Otherwise,
we repeat the above process until convergence. The first two procedures invoke
the full set so that variable selection is not predominated or trapped by the
initial estimation including boosting. Indeed, if all the initial parameters are
zero (forward path) in step 1, the above full set procedures still apply. That
is, the variable selection is not completely relying on step 1. Third, for large
data problems, the active set principle can be aggressively followed through to
develop a solution path for a desired sequence. For an increasing sequence
of , starting from a small
) value, Algorithm 3 generates estimates
and an active set. For ), only the coefficients of the active set are
cycled through while the coefficients of the non-active set remain zero. The
process is repeated until ) is taken.
Employing active set can substantially reduce computing burden especially for large data problems as in Section 6. The results, however, are not always preferred compared to the conventional non-active set, or full set strategy. Therefore, we focus on results from the non-active set strategy in the numerical studies.
Appendix B Generalized derivatives
Generalized derivatives for nonsmooth convex and nonsmooth nonconvex func-
tions have been developed (Nesterov, 2004; Clarke, 2013). Throughout this
paper, we only consider , a locally Lipschitz continuous function.
It is said to be Lipschitz near x if, for some neighborhood of x and some
constant K such that . And f is called
locally Lipschitz continuous on if it is Lipschitz near x for every
.
If for all , then g is called a subgradient of f at x.
The ordinary directional derivative at x in the direction is defined by:
The Clarke directional derivative of f at x in the direction is defined below:
The Clarke subdifferential of f at x is the set (Clarke, 2013, 10.3)
We say that f is regular at x provided f is Lipschitz near x and admits direc-
tional derivatives ) satisfying
) =
. If f is differentiable
at x, then all three directional derivatives at x are the same. However, unlike
the directional derivative ), the Dini and Clarke derivatives always exist.
The point x is a stationary point of ) if
. The point
x is a Dini stationary point of ) if
. The point x is a
Clarke stationary point of ) if
. When a directional
derivative exists, a stationary point is always a Dini stationary point but not
vice versa. A Dini stationary point is always a Clarke stationary point but not
vice versa since the following relationships hold for all x and :
We say that a vector is a limit point of a sequence
in
if there
exists a subsequence of that converges to x.
Appendix C Proofs
Lemma 3 If Assumption 2 holds, then:
where (0) is the Clarke subdifferential of
) at
= 0. Equality holds
if ) is also regular at
= 0.
Remark 1 Lemma 3 can be applied to the LASSO penalty with equality in
(32) since the penalty is convex, thus a regular function (Clarke, 2013, 10.8).
A regular function is either smooth everywhere or else has a corner of convex
type (Clarke, 2013, p. 200). Under Assumption 1 or Assumption 2, )
is not differentiable at = 0, and indeed 0 is a corner. Therefore
) is
regular if and only if ) is convex with 0 corner, i.e., the LASSO penalty.
On the other hand, penalty functions SCAD and MCP are not regular as they
are folded concave functions on R with a corner at the origin. Hence (32) may
not hold for these penalty functions.
Proof of Lemma 3
Let (0;
) denote the Clarke directional derivative at 0 in the direction
, and
(0;
) denote the ordinary directional derivative. Their relationship and Assumption 2 lead to
Let ], i.e.
, then we get
By definition (Clarke, 2013, 10.3), (0). Thus we have shown
If ) is also regular at 0, by its definition and Assumption 2, we then have
Let (0), then according to the definition (Clarke, 2013, 10.3), a
necessary and sufficient condition is
Along with (34), then , which leads to
]. Hence
(0)
]. Combining with (33), we have
(0) = [
Proof of Theorem 1
Part (a): First, ) is locally Lipschitz continuous as
) is a sum of locally Lipschitz continuous functions. Since the first summand
) can be nonconvex and the second summand is nonsmooth, we invoke the Clarke sub-differential. Suppose
is a local minimizer of
), then 0
), where
) is the Clarke subdifferential (Clarke, 2013, 10.7(a)). We have
) continuously differentiable,
continuously differentiable and
) Lipschitz continuous near
, then the calculus of Clarke subdifferential holds (Clarke, 2013, 10.16):
Once is plugged into (35), we then obtain
If = 0,
) is continuously differentiable by Assumption 1, thus
) =
{sign()
)} (Clarke, 2013, 10.8). Hence (5) follows. Otherwise we obtain
(6).
Part (b): First, part (a) holds since Assumption 2 contains Assumption 1. In particular, since (6) holds, it is left to compute (0) utilized in (6). For
the first statement, Lemma 3 implies (0)
]. Thus (6) induces
conditions including but potentially not limited to (7). In other words, it is
possible (7) is an incomplete optimality condition. For the second statement,
we provide two proofs. (i) Lemma 3 implies (0) = [
]. Hence (6)
is equivalent to (7), which becomes the complete optimality condition. (ii) As
argued before, if ) satisfies Assumption 2 and is also regular at 0, then
) is the LASSO penalty. For the continuous convex LASSO penalty, the
Clarke subdifferential is equivalent to the ordinary subdifferential (Clarke,
2013, 10.8). Therefore ) =
) = [
1]. From (6) we again obtain
complete optimal condition (7).
Proof of Theorem 2
We know is a global minimizer if and only if 0
) (Nesterov, 2004). Since
) is a sum of convex functions, we have
where the last equality is obtained since ) and
are differentiable convex
functions. Deduce from this that (9) and (10) are valid by simple calculations.
holds:
where the second inequality holds locally or globally depending on whether
is a local or global minimizer. Thus is a local (global) minimizer of
).
a global minimizer of ) if and only if
) = 0 by the optimality
condition. Thus ) =
) = 0 by assumption. This happens if and
only if is a global minimizer of
) since
) is convex.
minimizer of ). Therefore 0
). As in the proof of Theorem 1
(a), we have
where the second equality is obtained by Assumption 3.
The convexity assumption leads to
where the third equality is obtained by Assumption 3. Therefore 0 )
if and only if 0 ). Deduce from this that
is a global minimizer of
) if and only if
is a global minimizer of
). Thus (16) and (17)
can be obtained from (36) by direct computation.
be:
Suppose the penalty doesn’t change as in (23). Let ) denote the
Dini derivative of ) for direction
at
. It is trivially true that
If the penalty is majorized as in (24), we have
Hence we have
Again (38) is obtained.
Proof of Theorem 5
Consider the second order Taylor expansion
) =
) + (
) + 12(
By the construction of (29), ) majorizes
) and
) strictly
majorizes ) if
Thus to prove that
) majorizes or strictly
majorizes ), we only need to show that
) majorizes
. We consider three cases depending on the construction of
):
c) If ) is defined by (25), by Proposition 3.1 in Hunter and Li (2005),
) majorizes
).
In summary, we conclude that ) majorizes
), and the inequality
in (31) holds if
Proof of Theorem 6
From Theorem 5 we have
We restate (29) below for convenience:
We prove through an adaption of similar arguments in Lange (2016) that
Algorithm 1 with the quadratic upper bound surrogate function satisfies the
following property:
With (28) we deduce that ) is positive semidefinite. Thus
) is convex,
see for instance, Theorem 2.1.4 in Nesterov (2004). From (2), (30), (40) and
) =
) for some convex function
), we rewrite
By convexity we have
where ) and
) are subdifferentials. Since
minimizes
),
then 0 ). Equivalently, for some
) we have
From the definition of subgradient we have
Combining (42) and (43), we have
The last inequality and (41) means that (39) holds. Hence Algorithm 1 is
a sequential unconstrained minimization algorithm (SUMMA), and )
converges to inf ) as shown in Byrne (2018).
We prove the second assertion utilizing Proposition 7.4.1 in Lange (2016) that states if the SUMMA condition holds, ) and
) are continuous,
) is uniformly strongly convex, and
) achieves its minimum, then every sequence
converges to a minimum point of
).
Since the objective function ) is convex, thus continuous. The surrogate
) is also continuous. From (29) we have
where is the smallest eigenvalue of W. Hence
) is uniformly strongly
convex with parameter which is independent of
. Note that
) is
convex provided ) =
) for some convex penalty
). Denote
Then . Denote ˜
). Since the composition of an affine
function A preserves convexity of ), ˜
) is convex. Therefore
) =
) + ˜
) =
) +
) is uniformly strongly convex in
.
We need to show that ) is coercive. First, consider condition (i). Since
) is coercive meaning
implies
, then
) is coercive. Next, consider condition (ii). Since
is bounded, for instance,
= 0, i.e., no intercept, we deduce from
that
, which in turn implies
by Assumption 1. Additionally, since
) is bounded below, then
) is coercive. In conclusion, with either (i) or (ii),
) is coercive. Combining with continuity, hence
) achieves its minimum (Bertsekas, 1999, Proposition A.8).
In summary, all conditions hold for Proposition 7.4.1 in Lange (2016), thus every sequence converges to a minimum point of
Lemma 4 Suppose Assumption 1 holds for ), and
) is given by
(23) or (24). Then the Dini derivative of ) for direction
at
is
the same as that of ) given by:
Proof of Theorem 7
To show convergence, we utilize the following assertions:
The first term ) is jointly continuous in (
) by assumption, and the
remaining terms in (30) are also jointly continuous. Hence assertion (i) holds.
We prove the second assertion. Denote = (
, we then compute
where ) is the Dini directional derivative of
) for direction
at
. On the other hand, we have
= lim inf
+
By Lemma 4 we have
By Assumption 3 we have
We deduce from (45) and (46) that (44) holds. Theorem 5 and assertions (i)-(ii) together imply that Assumption 1 in
Razaviyayn et al. (2013) is satisfied. Thus invoking Theorem 1 in Razaviyayn
et al. (2013) concludes that every limit point of the iterates generated by
Algorithm 1 is a Dini stationary point of the problem minimizing
Proof of Theorem 8
Consider the following regularity conditions:
The above conditions lead to the desired result (Schifano et al., 2010; Chi and
Scott, 2014). In particular, each limit point ˜of
is a fixed point such that
M(˜) = ˜
(Schifano et al., 2010, Theorem A.3), and the set of fixed points
coincides with the Clarke stationary points of ) (Schifano et al., 2010,
Proposition A.8). We now check each condition. By assumption ) is locally
Lipschitz continuous, Assumption 1 implies that ) is locally Lipschitz
continuous due to a bounded derivative at every point including the origin, and
is continuously differentiable thus locally Lipschitz continuous. As a result,
) is locally Lipschitz continuous. With the same arguments in the proof
of Theorem 6, ) is coercive since
) is coercive or bounded below with
bounded. With the additional assumption that the set of Clarke stationary
points S of ) is a finite set, R1 is satisfied. Since
) strictly ma-
jorizes ) at
, and
) majorizes
) at
, then
)
strictly majorizes ) at
. R2 is met. As a sum of continuous functions
on (), the penalized majorization
) is continuous. The surro-
gate penalty ˜) is convex and finite, thus locally Lipschitz continuous. Since
) is also locally Lipschitz,
) defined in (30) is locally Lipschitz
continuous. Thus R3 is met. By assumption ) is strongly convex. Then
) is strongly convex because it is a sum of strongly convex function
) and convex function ˜
). Hence
) has at most one global
minimizer. Since ) and
) is coercive as shown above, then
) is coercive. Combining with continuity,
) achieves its mini-
mum. Together, ) has one and only one global minimizer. R4 is met.
C.1 Additional results in simulation and data analysis
Table 6 Mean number of selected variables and standard deviations (in parentheses) for v percentage of outliers in Example 1.
Table 7 Mean sensitivity and standard deviations (in parentheses) for v percentage of outliers in Example 1.
Table 8 Mean specificity and standard deviations (in parentheses) for v percentage of outliers in Example 1.
Table 9 Mean number of selected variables and standard deviations (in parentheses) for v pencentage of outliers in Example 2.
Table 10 Mean test errors and standard deviations (in parentheses) for v pencentage of outliers in Example 3.
Table 11 Mean number of selected variables and standard deviations (in parentheses) for v pencentage of outliers in Example 3.
Table 12 Variable selection for for v percentage of outliers in Example 3.
Table 13 Mean number of selected variables and standard deviations (in parentheses) for v pencentage of outliers in the breast cancer data.