b

DiscoverSearch
About
My stuff
MM for Penalized Estimation
2019·arXiv
Abstract
Abstract

Penalized estimation can conduct variable selection and parameter estimation simultaneously. The general framework is to minimize a loss function subject to a penalty designed to generate sparse variable selection. The majorization-minimization (MM) algorithm is a computational scheme for stability and simplicity, and the MM algorithm has been widely applied in penalized estimation. Much of the previous work has focused on convex loss functions such as generalized linear models. When data are contaminated with outliers, robust loss functions can generate more reliable estimates. Recent literature has witnessed a growing impact of nonconvex loss-based methods, which can generate robust estimation for data contaminated with outliers. This article investigates MM algorithm for penalized estimation, provides innovative optimality conditions and establishes convergence theory with both convex and nonconvex loss functions. With respect to applications, we focus on several nonconvex loss functions, which were formerly studied in machine learning for regression and classification problems. Performance of the proposed algorithms are evaluated on simulated and real data including cancer clinical status. Efficient implementations of the algorithms are available in the R package mpath in CRAN.

Keywords classification  ·MM algorithm  ·nonconvex  ·quadratic majorization  ·regression  ·robust estimation  ·variable selection

In predictive modeling, the data may contain a large number of predictors such as clinical risk factors of hospital patients, or gene expression profiles of cancer patients. It is a common interest to identify effective predictors and estimate predictor parameters simultaneously. A popular approach is to optimize an objective function, log-likelihood for instance, subject to a penalty. Penalty functions including the least absolute shrinkage and selection operator (LASSO), smoothly clipped absolute deviation (SCAD) and minimax concave penalty (MCP) have been widely applied to many statistical problems (Tib- shirani, 1996; Fan and Li, 2001; Zhang, 2010).

To simplify computations, the majorization-minimization (MM) algorithm provides a useful framework to decomposing a complex minimization problem into a sequence of relatively elementary problems. The MM algorithm has been an interesting research topic in statistics, mathematics and engineering (Lange, 2016; Byrne, 2018; Razaviyayn et al., 2013). In particular, penalized regression with MM has attracted some attention, see Schifano et al. (2010) and references therein. The article emphasizes the quadratic majorization (QM) for the generalized linear models. Despite convergence analysis of the QM algorithm, the loss function was convex. Chi and Scott (2014) developed a QM algorithm to a LASSO type minimum distance estimator for classification, where the limiting behavior relies on more strict assumptions on the penalties: it is an elastic net type of penalty, not the original LASSO, nor the nonconvex penalty SCAD.

In this article, we provide new results for the MM algorithm. The major contributions are three-fold: First, we construct optimality conditions for penalized estimation and penalized surrogate estimation in the MM algorithm. The connection between the two optimization problems are established such that penalty tuning parameters can be conveniently determined from the latter. Second, we establish contemporary theory that the QM algorithm attains the minimum value with convex loss and penalty. Furthermore, we develop the limiting behavior of the MM algorithm for nonconvex loss functions with the LASSO, SCAD or MCP penalty. Finally, several penalized nonconvex loss functions, for the first time, are applied with QM algorithm for robust regression and classification.

Nonconvex loss functions play a critical role in robust estimation. Compared to traditional convex loss functions, a nonconvex function may provide more resistant estimation when data are contaminated by outliers. For instance, the least squares loss is not robust to outliers (Alfons et al., 2013). Furthermore, variable selection is also strongly influenced by outliers. Robust estimation can be obtained by downweighting the impact of the extreme values. Examples include the Huberized LASSO for regression and classification (Rosset and Zhu, 2007) and penalized least absolute deviations (LAD) (Wang et al., 2007). However, a breakdown point analysis shows that convex loss functions such as the LAD-LASSO or penalized M-estimators are not robust enough and even a single outlier can send the regression coefficients to infinity (Alfons et al., 2013). On the other hand, bounded nonconvex loss functions have shown more robust to outliers than traditional convex loss functions in regression and classification (Wu and Liu, 2007; Chi and Scott, 2014; Li and Bradic, 2018; Wang, 2018a,b). Therefore, this article focuses on applications of penalized nonconvex loss functions.

The rest of this article is organized as follows. In Section 2, we study optimality conditions of penalized loss functions. In Section 3, we describe the MM algorithm, provide optimality conditions of penalized surrogate loss functions in the MM algorithm, and establish connections of minimizers between the original and surrogate optimization problems. We provide details with applications to nonconvex loss functions. In Section 4, convergence analysis is conducted. In Section 5, a simulation study in regression and classification is presented. In Section 6, the proposed methods are evaluated on data application predicting breast cancer clinical status. We conclude with a discussion in Section 7. Some numerical results, technical details such as Clarke subd-ifferential  ∂Cand regular at a point, and proofs are presented in the online supplement.

For training data (xij, yi), i= 1, ..., n, j = 0, 1, ..., p, let  xi= (xi0, ..., xip)⊺denote a (p+1)-dimensional predictor with the first entry 1, (y1, ..., yn)⊺is the response vector,  φ= (β0, β1, ..., βp)⊺is a (p + 1)-dimensional coefficient vector with  β0the intercept. Consider a loss function  Γ(u), for instance,  Γ(u) = ∥u∥2, u = y − fin linear regression, where f is the prediction of y. For some link function  ψ(·) define

image

The goal is to minimize the empirical loss  L(φ). For the purpose of variable selection, define a penalized loss function  F : Rp+1 → R

image

where  λ > 0, 0 ≤ α ≤1, and  pλ(|βj|) is the penalty function satisfying the following assumptions (Schifano et al., 2010):

Assumption 1 pλ(θ) is continuously differentiable, nondecreasing and concave on (0, ∞) with  pλ(0) = 0 and 0  < p′λ(0+)  < ∞.

Minimizing the penalized loss function  F(φ) can provide shrinkage estimates thus leading to zero values for small coefficients in magnitude. The following penalty functions have been widely studied:

(i) the LASSO penalty (Tibshirani, 1996), for  λ ≥0

image

(ii) the SCAD penalty (Fan and Li, 2001)

image

The SCAD and MCP penalty functions with large value a are comparable to the LASSO since  p′λ(θ) → λif  a → ∞. These penalty functions satisfy Assumption 1 and Assumption 2 (more restrictive than Assumption 1).

Assumption 2 pλ(θ) is continuously differentiable, nondecreasing and concave on (0, ∞) with  pλ(0) = 0, 0  < p′λ(0+)  < ∞and directional derivative at 0 in the direction  εgiven by  p′λ(0;  ε) =  ζλ|ε|for some  ζ >0.

In particular,  ζ= 1 for the LASSO, SCAD and MCP penalty. Assumption 2 also covers hard thresholding penalty.

Assumption 1 ensures the penalty function is locally Lipschitz continuous at every point including the origin due to a bounded derivative. A penalty function may violate Assumption 1. For instance, the bridge penalty  p(θ) =  θafor  a ∈(0, 1) has  p′λ(0+) =  ∞. In other words, the bridge penalty is not Lipschitz continuous near  θ= 0.

We establish conditions for minimizing  F(φ). Let  φ∗= (β∗0, β∗1..., β∗p)⊺de- note a minimizer and corresponding value  β∗= (β∗1, ..., β∗p) without the in- tercept. Since  β0is not penalized, we focus on optimality conditions on  β∗, since an optimality condition on  β0is a trivial problem in the current setting. Whether  φ∗is a local or global minimizer is specified in the context. If φ∗is a minimizer, then usually a set of conditions constraining  φ∗must be satisfied. If these conditions are comprehensively constructed, we call them complete optimality conditions, which is the typical case when dealing with convex optimization problems. On the other hand, if only a subset of complete optimality conditions are known, we call them incomplete optimality conditions. This would occur in nonconvex optimization problems.

Theorem 1 Suppose  φ∗is a local minimizer of  F(φ), where  L(φ) is continuously differentiable. Then:

(a) Under Assumption 1, optimality conditions for  β∗are given by

image

(b) Under Assumption 2, an optimality condition for  β∗j= 0, j = 1, ..., p is given by

image

For the LASSO, SCAD and MCP penalty functions with  ζ= 1, from (7) we define

image

In real data analysis, we often need to determine a good value  λ, which can be chosen from select values closely related to the quantity  λmax. From Theorem 1, if  βj= 0, j = 1, ..., p, we deduce that  λ ≥ λmax. On the other hand, the converse is not valid, unless stronger conditions are posed as in the next theorem.

Theorem 2 Assume that  L(φ) is differentiable convex, and  pλ(|θ|) is the LASSO penalty function. Then  φ∗is a global minimizer of  F(φ) if and only if

image

where  ∂F(φ∗) is the subdifferential of  F(φ) at  φ = φ∗. Or equivalently

image

for j = 1, ..., p.

The optimality condition in Theorem 2 is standard for convex functions, yet helpful when investigating the relationship between the function  F(φ) and its surrogate function in the MM algorithm described below. The value  λmaxhas much stronger implications under Theorem 2 due to the convexity assumption. From(10) Theorem 2 concludes that  λmaxis the smallest  λsuch that  βj= 0, j = 1, ..., p. Without convexity assumption, since (7) is not sufficient and/or incomplete, the value  λmaxmay neither guarantee  βj= 0 nor produce the smallest  λwith  βj= 0.

3.1 Penalized estimation using MM algorithms

image

The MM algorithm is an iterative procedure. Given an estimate  z = u(k)

in the kth iteration,  γ(u|z) is minimized at the k + 1 iteration to obtain an updated minimizer  u(k+1). The MM algorithm then updates z with the new estimate  u(k+1). This process is repeated until convergence. From (11) the MM algorithm generates a descent sequence of estimates:

image

This relationship leads to the following result.

Lemma 1 Suppose  γ(u|z) majorizes  Γ(u) at z. If  u∗is a local (global) minimizer of  Γ(u),  u∗is a local (global) minimizer of  γ(u|u∗).

The converse of Lemma 1 is not true. Consider  γ(u|u∗) =  u2, u∗= 0. Let  Γ(u) =  −u2. Therefore  γ(u|z) majorizes  Γ(u), u∗= argmin  γ(u|u∗) and ∇Γ(u∗) =  ∇γ(u∗|u∗) = 0. However, the converse is invalid. To overcome this problem, for differentiable convex functions, we develop a connection between a function and its “surrogate” function in a broader sense, including but not limited to the induced functions by majorization.

Lemma 2 Suppose  Γ(u) and  γ(u|z) are differentiable convex functions. Then φ∗is a global minimizer of  Γ(u) satisfying  ∇Γ(u∗) =  ∇γ(u∗|u∗) if and only if  φ∗is a global minimizer of  γ(u|u∗).

To accommodate predictors, define

image

Obviously the transformation  ψ(·) preserves the relationship in (11) such that ℓi(φ|φ(0)) majorizes  Li(φ) at the current estimate  φ(0)= (β(0)0 , β(0)1 , ..., β(0)p)⊺. Along the line of Lemma 2, we characterize a class of surrogate functions:

image

Consider the following surrogate penalized loss:

image

From Lemma 1 we deduce that if  φ∗is a minimizer of  F(φ), it must be a minimizer of  Q(φ|φ∗). The following results lay out not only the optimality conditions of the surrogate function, but also the connection between the surrogate and its original function deduced from.

Theorem 3 Assume that  L(φ) and  ℓ(φ|φ(0)) are continuously differentiable functions, and  ℓ(φ|φ(0)) majorizes  L(φ) at  φ(0). Suppose  φ∗is a local minimizer of  F(φ), and Assumption 3 holds at  φ∗. Then:

(a) Under Assumption 1, an optimality condition is

image

(b) Under Assumption 2, an optimality condition for  β∗j= 0, j = 1, ..., p is

image

For the LASSO, SCAD and MCP penalty functions, from (14) we define

image

Notice ˜λmaxis the same as  λmaxin (8) under the specified assumptions. Theorem 3 implies that if  βj= 0, j = 1, ..., p, then  λ ≥˜λmax. Like before, the converse is generally invalid unless stronger conditions are provided as in the next theorem.

Theorem 4 Assume that  L(φ) and  ℓ(φ|φ(0)) are differentiable convex functions, and  pλ(|θ|) is the LASSO penalty function, Then  φ∗is a global minimizer of  F(φ), where Assumption 3 holds at  φ∗, if and only if  φ∗is a global minimizer of  Q(φ|φ∗) or equivalently

image

Theorem 4 on the penalized surrogate loss  Q(φ|φ(0)) is a mirror of Theorem 2 on the penalized loss  F(φ). Furthermore, it is reasonable to deduce that similar results hold for the EM algorithm (a special case of the MM algorithm) in penalized estimation. Such a connection, due to the convexity assumption, is stronger than that between Theorem 1 and Theorem 3. These results suggest that instead of minimizing the original objective function  F(φ), we can minimize the surrogate function  Q(φ|φ(0)). The iterative MM procedure is summarized in Algorithm 1.

image

3.2 Quadratic majorization

The quadratic majorization (QM) is based on the second-order Taylor expansion for a twice differentiable function  Γ(u) such that

image

for some  wion the line segment connecting z and u. Suppose ˜B is positive definite and ˜B − ∇2Γ(w) is semipositive definite for all arguments w, then

image

majorizes  Γ(u) at z. Since  ∇γ(z|z) =  ∇Γ(z), Assumption 3 holds at z for functions  γ(u|z) and  Γ(u). The QM is a special application of the more general MM algorithm when a suitable bound ˜B exists (Lange, 2016). The generic terms defined by  ℓ(φ|φ(0)) in (12) can be replaced with the express formula (18).

While an upper bound ˜B may not be unique, to simplify computation as often possible, assume ˜B is a square diagonal matrix with all its main diagonal entries equal, i.e., ˜B = BI, where B is a scalar and I is the identity matrix. Furthermore the link function  ψ(·) takes different forms:

image

Define  u = y − fin regression problems. By (18) and the chain rule, we have

image

For classification problems with margin  u = yf, y ∈ {−1, 1}, we have

image

With  fi = x⊺i φ, zi = x⊺i φ(0), minimizing unpenalized loss  L(φ) in (1) amounts to minimizing

image

where

image

The constant weight B has no impact on the estimation given  hi(φ(0)), in contrast to the penalized estimation described below. Putting together, with iterated least squares estimation for the updated response variable, Algorithm 2 can be used to minimize  L(φ) in (1) and estimate parameters.

image

For penalized estimation, the surrogate loss function (13) simplifies to:

image

Since the surrogate function  ℓ(φ|φ(0)) satisfies Assumption 3, Theorem 3 is applicable. Theorem 4 is effective for convex loss functions such as those in generalized linear models (GLM), where quadratic majorization may result in possibly different form than (21). Minimizing  F(φ) in (2) can be achieved with Algorithm 3, and technical details are discussed in the subsequent context and supplement.

image

3.3 Penalized least squares

Minimizing the penalized least squares (22) is hampered by the fact that the penalty function  pλ(|·|) is not differentiable at 0. Let  G(·|·) be a function that approximates the penalty function  pλ(| · |), the following methods have been proposed in the literature.

image

(iii) Fan and Li (2001); Hunter and Li (2005) considered a local quadratic approximation (LQA):

image

In these proposals  G(θ|θ(0)) essentially majorizes  pλ(|θ|) at  θ(0). We implement method (i) in the algorithm since the efficient coordinate decent algorithm can easily provide penalized least squares estimates for a variety of penalty functions including the LASSO, SCAD and MCP (Friedman et al., 2010; Breheny and Huang, 2011). We nevertheless investigate properties of all three approaches in Section 4.

3.4 Applications to nonconvex loss functions

Traditional regression is based on minimizing a loss function  Γ(u), where Γ(u) =  ∥u∥2, u = y − f. The  L2loss is not robust to outliers. Robust regression has utilized absolute difference  Γ(u) =  |u| (L1loss) and Huber loss, both are convex loss functions. The nonconvex loss ClossR with a parameter σis described in Table 1 (Wang, 2018a). The ClossR with  σ= 0.9 is plotted in Figure 1, along with the aforementioned loss functions in regression. It is clear that bounded ClossR downweights large values more than other convex loss functions, effectively having better resistance to outliers.

For a binary outcome y taking values +1 and  −1, denote by u = yf the margin of a classifier f. Used in a logistic regression,  Γ(u) = log2(1+exp(−u)) is a popular choice in statistics. However, the logistic loss is unbounded and can’t control outliers well. Nonconvex loss functions have been proposed to address the robustness of classification, and three examples are listed in Table 1. These loss functions approximate the 0-1 loss, as shown in Figure 2. The  σvalue controls robustness of a loss function. For Closs, a smaller value is more robust to outliers while for other loss functions, a larger value is more robust. A typical strategy in practice is working from less robust to more robust and selecting a proper value of  σaccording to prediction.

For the loss functions in Table 1, we can obtain penalized estimates with Algorithm 3, where factor B has been derived by Wang (2018a). Since the functions are continuously differentiable, Theorem 1 and Theorem 3 are applicable. We list  λmaxin the supplement Table 5. The starting estimates of  φare based on a functional gradient decent boosting algorithm (Wang, 2018b).

To conduct a convergence analysis of Algorithm 1 for penalized estimation in convex and nonconvex problems, we construct a unified form of majorizers for the loss functions in Section 3.4 as well as those in the GLM. Consider a twice differentiable function  Li(φ) in (1). The Taylor expansion at a given value  φ(k)

is

image

for some ˜φon the line segment connecting  φand  φ(k), where

image

For  ψ(f) defined in (19), we have

image

which can be applied to functions including but not limited to the nonconvex functions in Section 3.4. For the GLM, the negative log-likelihood function is

image

where  a(·), b(·) and  c(·) are known functions,  δis the canonical parameter with δi = ψ(x⊺i φ) (if the link is canonical,  δi = x⊺i φ), and  κis the known dispersion parameter. The negative log-likelihood function has the form:

image

We have

image

∇L2i(φ) = 1a(κ)

image

For the canonical link  ψ(z) = z, we have

image

Let X denote the  n × (p+ 1) matrix of  xivalues,  Ω a n × ndiagonal matrix of weights with ith diagonal element  di(φ). Then we have

image

A matrix W is constructed to majorize  L(φ). Let ˜dj(φ), j= 0, 1, ..., p denote the eigenvalues of  ∇2L(φ). Assume that

image

and let W = I, where I is the identity matrix. If X is full rank, we may define

image

Obviously the (p + 1)  × (p+ 1) matrix W is positive definite, and all the eigenvalues of W are positive. Let  ω ≥ ϱ, ω > 0.We have

image

where the inequality  ⪰means that  ωW −∇2L(φ)−0Iis positive semidefinite. Denote

image

For  G(·|·) defined in Section 3.3 and  β= (β1, ..., βp)⊺, denote

image

Finally we define the surrogate loss

image

A similar result as in Schifano et al. (2010) is obtained.

Theorem 5 Assume twice differentiable function  F(φ) satisfies (26) or (27), Q(φ|φ(k)) is given by (30), and penalty  pλ(| · |) satisfies Assumption 1. Then: Q(φ|φ(k)) majorizes  F(φ) at  φ(k):

image

The inequality holds if  ω > ϱ.

Theorem 5 implies that Algorithm 1 provides a nonincreasing sequence F(φ(k)). We analyze Algorithm 1 under different assumptions in the sequel. The following convergence analysis does not cover (25) which is not defined when  θ= 0. See Hunter and Li (2005) for a modified majorization to avoid such a degenerate case. We first focus on convex optimization problems.

Theorem 6 Assume twice differentiable convex function  F(φ) satisfies (26) or (27),  Q(φ) is given by (30), the surrogate penalty  G(βj|β(k)j) =  λ|βj|, λ >0, and the iterates  φ(k)are generated by Algorithm 1.Then the loss values  F(φ(k)) are monotonically decreasing and limk→∞ F(φ(k)) →inf  F(φ). Furthermore φ(k)converges to a minimum point of  F(φ) if  L(·) is either (i) coercive or (ii) bounded below with bounded intercept  β0.

The bounded intercept in Theorem 6 ensures that penalty term  Λ(β) is coercive for  φ. Otherwise  Λ(β) is not coercive for  φdespite that  Λ(β) is coercive for  βmeaning  ∥β∥ → ∞implies  Λ(β) → ∞. To see this, denote

image

Then  Aφ = β. Denote ˜Λ(φ) ≜ Λ(Aφ) and let  β= (0, ..., 0)⊺, a p-dimensional vector, then we have  φ= (β0, β)⊺and ˜Λ(φ) =  Λ(β) = 0 by definitions. Thus Λ(β) is a constant as  |β0| → ∞or  ∥φ∥ → ∞. In conclusion, as  ∥φ∥ → ∞, ˜Λ(φ) → ∞doesn’t always hold. This proves that  Λ(β) is not coercive for  φif the intercept is not bounded.

The next two theorems are particularly relevant for nonconvex functions. We consider a larger class of surrogate loss  ℓ(φ|φ(0)) which covers the QM (29) as a special case while the surrogate loss function  Q(φ|φ(k)) has the same form as in (30).

image

Notice that  ℓ(φ|φ(k)) given by (29) satisfies assumptions of Theorem 7. We can summarize the implications of Theorem 7: applying the QM to the loss functions in the GLM or Table 1, and applying an appropriate majorization such as (23) or (24) to the penalty functions including the LASSO, SCAD and MCP, then every limit point of the iterates generated by the MM algorithm is a Dini stationary point. While a Dini stationary point is always a Clarke stationary point, the converse is not valid. Under different conditions, the iterates of the MM algorithm converges to a Clarke stationary point.

Theorem 8 Assume continuous function  L(φ) is locally Lipschitz continuous, L(φ) is coercive or bounded below with bounded intercept  β0, the set of Clarke stationary points S of  F(φ) is a finite set, the surrogate function  ℓ(φ|φ(k)) is strongly convex and strictly majorizes  L(φ) at  φ(k), the surrogate loss  Q(φ|φ(k)) is given by (30), the surrogate penalty  G(βj|β(k)j) is convex and majorizes pλ(|βj|) at  β(k)j, and the iterates  φ(k)are generated by Algorithm 1. Then  φ(k)converges to a Clarke stationary point of the problem minimizing  F(φ).

Theorem 8 can be applied to a broad loss functions including those in the GLM and Table 1. These functions are continuously differentiable, thus locally Lipschitz continuous. Construct a surrogate loss  Q(φ|φ(k)) given by (30) with  ω > ϱ. Then  ℓ(φ|φ(k)) strictly majorizes  L(φ). From  ∇2ℓ(φ|φ(k)) = ωWwe know  ∇2ℓ(φ|φ(k)) is positive definite. Suppose Assumption 1 holds for pλ(| · |), then  G(φ|φ(k)) is convex and majorizes  pλ(|βj|) at  β(k)j if G(φ|φ(k)) is given by (23) for the LASSO or (24). This implies some of the assumptions of Theorem 8 have been met. If additional assumptions in the theorem are valid, then we deduce that Algorithm 1 iterates converge to a Clarke stationary point. Unlike Theorem 7, Theorem 8 doesn’t assume equivalence of the first order Dini derivative between the penalized surrogate loss  Q(φ|φ(0)) and objective function  F(φ). When this does hold, as the QM in (29) and the surrogate penalty function in (23) or (24), Theorem 7 provides stronger conclusion than Theorem 8.

In the simulated examples, we evaluate performance of the proposed robust variable selection methods, i.e., MM for penalized estimation with Algorithm 3, and compare with the standard non-robust algorithms. The parameter of SCAD and MCP is fixed at a = 3. This choice works well for the simulation study. The responses in each data set are randomly contaminated with percentage of v=0, 5, 10 and 20. We generate random samples of training/tuning/test data. Training data are used for model estimation. Tuning samples are used to select optimal penalty parameters with the smallest loss values and classifica-tion errors in regression and classification problems, respectively. Test data are used to evaluate prediction accuracy except for example 1 whose prediction accuracy is not from the test data. To evaluate variable selection, we compute sensitivity and specificity, where sensitivity is the proportion of number of correctly selected predictors among truly effective predictors, and specificity is the proportion of number of correctly non-selected predictors among truly ineffective predictors. The intercept is not counted as a variable. A good variable selection method should obtain both sensitivity and specificity close to 1. We repeat Monte Carlo simulation 100 times in the following examples.

5.1 Example in regression

Example 1: Let  y = β0 + X⊺β + ϵ, where  β0= 1, β= (3, 1.5, 0, 0, 2, 0, 0, 0), ϵ ∼N(0, 3), X ∼ N8(0, Σ) with  Σij= 0.5|i−j|for i, j = 1, ..., 8.

Without the intercept, similar data was generated in Fan and Li (2001). The response variable y is contaminated by randomly multiplying 1, 000 for v = 0, 5, 10 and 20 percent of training (n=200) and tuning data (n=200). Different estimation methods are evaluated based on the mean squared prediction error E[(ˆy − β0 − X⊺β)2] (Hunter and Li, 2005). We fix  σ= 10 in ClossR for which larger  σvalue is more resistant to outliers. Table 2 and Table 6-8 in the supplement display the results for penalized ClossR methods along with standard least squares (LS), LS-based penalized methods and penalized Huber regression proposed by Yi and Huang (2017). The best prediction results with the associated variable selection are highlighted in bold.

The ClossR-based methods have the best prediction and outperform the HuberLASSO in all four settings of outliers. For the LASSO penalty, ClossR has better prediction than Huber but also selects more variables. For the clean data, ClossR-based methods are slightly better or comparable to LS-based methods when the penalty function is the same. If one believes that prediction accuracy from the penalized LS methods should be superior to the penalized ClossR in uncontaminated cases, the tiny differences in Table 2 perhaps can be explained by random variation in finite samples and selection of penalty parameters. An interesting research topic is to theoretically compare prediction between penalized LS and penalized ClossR.

For clean data, all effective variables are selected, and ClossR-based SCAD or MCP has better selection than LASSO, consistent with LS-based penalty functions. For contaminated data, all correct variables are selected in robust estimation, but not in the penalized LS estimation. In addition, ClossR-based SCAD or MCP determines fewer ineffective variables than LASSO.

For nonconvex optimization, Algorithm 3 can generate different local estimates depending on the initial values. To illustrate, two methods of initial values elaborated in supplement A.1 are compared in Figure 3 for loss ClossR and nonconvex penalty SCAD with v=20. Prediction from boosting-supported backward solution path is more reliable than the forward path. The two largest errors from the latter ultimately generates a greater mean prediction error despite a smaller median value.

5.2 Examples in classification

The following data generations are from Wu and Liu (2007) and have also been investigated in Wang (2018a,b) for robust classification. These examples include both linear and nonlinear classifiers.

Example 2: Predictor variables (x1, x2) are uniformly sampled from a unit disk  x21+  x22 ≤1 and y = 1 if  x1 ≥ x2and -1 otherwise.

Example 3: Predictor variables (x1, x2) are uniformly sampled from a unit disk  x21+  x22 ≤1 and y = 1 if (x1 − x2)(x1+  x2) <0 and -1 otherwise.

In each example, we also generate 18 noise variables from uniform[-1, 1]. To add outliers, we randomly select v percent of the data and switch their class labels. With the simulated data, the performance of an algorithm can be compared with the optimal Bayes errors. The training/tuning/test sample sizes are n = 200/200/10, 000 in Example 2 and 1, 000/1, 000/10, 000 in Example 3. To accommodate nonlinear classifiers in Example 3, the predictor space is doubled to p = 40 including quadratic terms  x2ifor i = 1, 2, ..., 20. In Example 2 and 3, no-intercept models have better prediction accuracy from logistic-based and nonconvex loss-based variable selection methods. Since the data are not generated from logistic regression models, estimation from a logistic regression model is not a clear-cut winner even for the clean data. We fix Closs σ= 0.9, Gloss  σ= 1.1 and Qloss  σ= 0.2 in the estimation.

The results for Example 2 are demonstrated in Table 3 and Table 9 in the supplement. It is clear that the performance of logistic-based methods relies on the cleanliness of the data, and is not robust with outliers. On the other hand, the nonconvex loss-based variable selection methods are better resistant to outliers. In particular, with data contaminated by v=0, 5, 10 and 20 percentage, the best prediction results 0.0047(QlossSCAD), 0.0548(GlossSCAD), 0.1055(GlossSCAD), 0.2067(QlossSCAD) are the closest to the corresponding Bayes errors. The corresponding variable selections in highlighted numbers in Table 9 almost perfectly match the Bayes optimal classifier. Since all effective variables are correctly selected, specificity is not reported in lieu of supplement Table 9. It can be interesting to compare the LASSO and boosting due to its similarity (Efron et al., 2004). The same random data was generated in Wang (2018a), where the robust boosting algorithm provided similar prediction results and slightly more sparse variable selection. For instance, the boosting errors from the Gloss are 0.0108, 0.0628, 0.1178, 0.2217 for outlier percentage v=0, 5, 10, 20, respectively (Wang, 2018a), while the corresponding QlossLASSO errors are 0.0154, 0.0672, 0.1187, 0.2223 in Table 3. As the percentage of outliers changes, the average number of variables selected in the boosting algorithm ranges from 2.5 to 2.8 (Wang, 2018a) while the QlossLASSO ranges from 4.1 to 4.9 in supplement Table 9.

The results for Example 3 are summarized in the supplement, and similar conclusions like Example 2 can be drawn. In all outlier scenarios, the nonconvex loss-based methods triumph over the logistic-based methods on prediction. In particular, the best average test errors are 0.0016 (ClossMCP), 0.0520 (QlossSCAD), 0.1018 (QlossSCAD) and 0.2019 (QlossSCAD) for percentage of outliers v=0, 5, 10 and 20, respectively. In addition, the corresponding variable selections in featured numbers in Table 11 resemble the Bayes optimal classifier. Except for v=0, all variables are more accurately identified by the nonconvex loss-based methods than the logistic-based methods. Both  x21and x22are selected in all estimation methods in each simulation run. Supplement Table 12 counts variable selection for  x1or  x2in 100 simulations. Across all methods, the average specificity excluding  x1, x21, x2, x22is  ≥ 0.98. Hence, the details are unreported. In conclusion, all methods correctly select  x21and  x22, and the remaining variables including  x1and  x2are typically eliminated by the variable selection procedures.

Shi et al. (2010) used gene expressions to predict breast cancer clinical status with 164 estrogen receptor positive cases and 114 negative cases. The same data set has been evaluated for a variety of robust classification algorithms (Li and Bradic, 2018; Wang, 2018a,b). Applying MM for penalized estimation, the analysis described below is reproduced in a vignette distributed with the R package mpath. Following the same pre-selection procedure, the data set is pre-screened to keep 3000 most significant genes from 22,282 genes with twosample t-test. These genes are then standardized. The data set is randomly split to training data and test data, and the training data contains 50/50 positive/negative cases and the test data has 114/64 positive/negative cases. The positive/negative status in the training data is randomly flipped with percentage v=0, 5, 10, 15. The parameter of SCAD and MCP is fixed at a = 3.7 and a = 12, respectively. The predictive models are aided with a five-fold cross-validation to select the optimal penalty parameters with the smallest classification errors. The analysis is conducted with Closs  σ= 0.9, Gloss  σ= 1.01 and Qloss  σ= 0.5 (Wang, 2018a). This process is repeated 100 times and the results are summarized in Table 4 and supplement Table 13.

Prediction accuracy is more similar in the clean data than those contaminated, while the logistic-based methods suffer from outliers. The best prediction with the Closs-based methods are 0.0818, 0.0839, 0.0910, 0.1143 for v=0, 5, 10, 15, respectively. This is the best prediction for this data set in the robust prediction literature. See Li and Bradic (2018); Wang (2018a,b) and the references therein. With more data contamination, more variables are selected. Loss function and penalty method also contribute to different variable selection results. In addition, sparse variable selection is maintained even with outliers. For instance, with 15% outliers, ClossSCAD not only has more accurate prediction than LogisSCAD (0.1143 vs 0.1566), but also selected only 2.8 genes on average, compared with 19.1 genes by LogisSCAD.

Outliers can substantially cause bias in traditional estimation methods and produce inaccurate variable selection problem. This problem has been tackled by robust estimation, which has a long history in statistics methodology research and applications. The previous methods are largely based on convexbased loss functions, which may not be sufficient to guard the impact of outliers. Adding to this is the variable selection challenge in predictive modeling, specifically for the high-dimensional data.

In this article, we propose penalized nonconvex loss functions for several popular penalty functions. We have illustrated that the proposed methods can improve estimation and variable selection compared to conventional estimation and robust estimation. The estimation problem is decoupled into an iterative MM algorithm with each iteration conveniently conducted by existing optimization algorithms. The QM embedded in the MM algorithm is particularly appealing since the induced penalized least squares problem has been well studied. For tuning parameter identification, it is a typical practice to construct a solution path along with penalty parameters. A popular procedure is to start with a penalty parameter  λmaxsuch that all predictor coefficients are zero. As is well-known in nonconvex optimization problems, different starting values may lead to different solutions if local minimizers are achieved. We propose an alternative yet effective approach to compute a solution path. The so-called backward path generates starting values utilizing the boosting algorithm and find solutions backwards towards zero coefficients.

This article evaluates optimality conditions for both convex and nonconvex problems in the context of penalized estimation. As expected, we could only partially characterize the optimality conditions for nonconvex problems. The link between the original optimization problem and the surrogate optimization problem, nevertheless, provides a useful utility to compute  λmax. The convergence analysis provides new results for both convex and nonconvex problems. For convex problems, the iterated estimates of the MM algorithm with QM converges to a minimum point. For nonconvex problems, the iterates of the MM algorithm with QM converges to a local stationary point in the framework of the Dini or Clarke derivative.

Future direction of research includes expanded penalty functions including those not locally Lipschitz continuous. For instance, the bridge penalty for  a ∈(0, 1). Since the generalized derivatives like Dini and Clarke derivative are both restricted to locally Lipschitz continuous functions, additional mathematical tools are required to assess properties of the penalized loss functions.

Alfons, A., Croux, C., Gelper, S., et al. (2013). Sparse least trimmed squares regression for analyzing high-dimensional large data sets. The Annals of Applied Statistics, 7(1):226–248.

Bertsekas, D. P. (1999). Nonlinear Programming. Athena Scientific, second edition.

Breheny, P. and Huang, J. (2011). Coordinate descent algorithms for noncon- vex penalized regression, with applications to biological feature selection. The Annals of Applied Statistics, 5(1):232–253.

Byrne, C. L. (2018). Auxiliary-function minimization algorithms. Applied Analysis and Optimization, 2(2):171–198.

Chi, E. C. and Scott, D. W. (2014). Robust parametric classification and vari- able selection by a minimum distance criterion. Journal of Computational and Graphical Statistics, 23(1):111–128.

Clarke, F. (2013). Functional Analysis, Calculus of Variations and Optimal Control, volume 264. Springer Science & Business Media.

Efron, B., Hastie, T., Johnstone, I., and Tibshirani, R. (2004). Least angle regression. Annals of Statistics, 32(2):407–499.

Fan, J. and Li, R. (2001). Variable selection via nonconcave penalized likeli- hood and its oracle properties. Journal of the American Statistical Association, 96(456):1348–1360.

Friedman, J., Hastie, T., and Tibshirani, R. (2010). Regularization paths for generalized linear models via coordinate descent. Journal of Statistical Software, 33:1–22.

Hunter, D. R. and Li, R. (2005). Variable selection using MM algorithms. Annals of Statistics, 33(4):1617–1642.

Lange, K. (2016). MM Optimization Algorithms. Philadelphia: SIAM.

Li, A. H. and Bradic, J. (2018). Boosting in the presence of outliers: adap- tive classification with nonconvex loss functions. Journal of the American Statistical Association, pages 1–15.

Nesterov, Y. (2004). Introductory Lectures on Convex Optimization: A Basic Course. Springer Science & Business Media.

Razaviyayn, M., Hong, M., and Luo, Z.-Q. (2013). A unified convergence anal- ysis of block successive minimization methods for nonsmooth optimization. SIAM Journal on Optimization, 23(2):1126–1153.

Rosset, S. and Zhu, J. (2007). Piecewise linear regularized solution paths. The Annals of Statistics, 35(3):1012–1030.

Schifano, E. D., Strawderman, R. L., Wells, M. T., et al. (2010). Majorization- minimization algorithms for nonsmoothly penalized objective functions. Electronic Journal of Statistics, 4:1258–1299.

Shi, L., Campbell, G., Jones, W., Campagne, F., et al. (2010). The MicroArray Quality Control (MAQC)-II study of common practices for the development and validation of microarray-based predictive models. Nature Biotechnology, 28(8):827–838. https://goo.gl/8bdBDE.

Tibshirani, R. (1996). Regression shrinkage and selection via the lasso. Journal of the Royal Statistical Society, Series B, 58:267–288.

Wang, H., Li, G., and Jiang, G. (2007). Robust regression shrinkage and consistent variable selection through the LAD-Lasso. Journal of Business & Economic Statistics, 25(3):347–355.

Wang, Z. (2018a). Quadratic majorization for nonconvex loss with applica- tions to the boosting algorithm. Journal of Computational and Graphical Statistics, 27(3):491–502.

Wang, Z. (2018b). Robust boosting with truncated loss functions. Electronic Journal of Statistics, 12(1):599–650.

Wu, Y. and Liu, Y. (2007). Robust truncated hinge loss support vector ma- chines. Journal of the American Statistical Association, 102(479):974–983.

Yi, C. and Huang, J. (2017). Semismooth Newton coordinate descent algo- rithm for elastic-net penalized Huber loss regression and quantile regression. Journal of Computational and Graphical Statistics, 26(3):547–557.

Zhang, C.-H. (2010). Nearly unbiased variable selection under minimax con- cave penalty. Annals of Statistics, 38(2):894–942.

Zou, H. and Li, R. (2008). One-step sparse estimates in nonconcave penalized likelihood models. Annals of Statistics, 36(4):1509.

image

Fig. 1 Loss functions for regression

image

Fig. 2 Loss functions for classification

image

Fig. 3 Prediction of ClossRSCAD from different solution paths in Example 1 and v=20

image

Table 1 Nonconvex loss function

Table 2 Mean squared prediction errors and standard deviations (in parentheses) for v percentage of outliers in Example 1. Numbers > 1e + 4 are suppressed.

image

Table 3 Mean test errors and standard deviations (in parentheses) for v pencentage of outliers in Example 2.

image

Table 4 Mean test errors and standard deviations (in parentheses) for v pencentage of outliers in the breast cancer data.

image

image

Appendix A Technical details of Algorithm 3

A.1 Solution paths

To optimize  F(φ), tuning parameters such as  λcan be chosen by cross-

validation or other data driven methods from some prespecified values. We

make use of  λmaxin (8) or equivalently ˜λmaxin (15). The latter can be handy

with existing computer codes for penalized least squares estimation. We com-

pute the solutions along a regularized path for a sequence of  λ. A forward path

begins with all initial non-intercept parameters zero for a fixed  αvalue, then

compute solutions for a decreasing sequence of  λ, λk, k= 1, ..., K, ranging

from  λmaxto  λmin = ϵλmaxfor a small number 0  < ϵ <1 (Friedman et al.,

2010). For nonconvex estimation in Section 3.4, Algorithm 3 typically seeks

a local optimization solution. It is well-known that there could be potentially

different estimates depending on the initial values. In practice, different initial

values can be utilized to obtain the optimal result. This article emphasizes

boosting as an innovation for initial value estimation, coupled with an alter-

native backward path, an increasing sequence of  λk, k = K, ...,1 from  λminto

λmax. This would provide different initial values for nonconvex optimization.

Most importantly, the numerical results in the simulation and data analysis

warrant this choice.

image

Table 5 The λmax values

A.2 Update on active set

To speed up computations, we may implement the active set strategy in Al-

gorithm 3. First, we determine the active set, a set of variables with non-zero

coefficients, from the initial value  φin step 1. Run step 2 to 5 with variables in

the active set. Then run step 4 once with all variables in the full set to update

the active set. If there is no change, the computation finishes. Otherwise, re-

peat the above procedure. Second, we employ the active set internally within

step 4 when minimizing penalized least squares. We begin with an initial active

set containing all variables, cycle through all coefficients in the active set with

coordinate descent algorithm until convergence, then cycle through the full set

with all variables, but only once. This generates a new active set. Compared

with the previous active set, if no change, then step 4 converges. Otherwise,

we repeat the above process until convergence. The first two procedures invoke

the full set so that variable selection is not predominated or trapped by the

initial estimation including boosting. Indeed, if all the initial parameters are

zero (forward path) in step 1, the above full set procedures still apply. That

is, the variable selection is not completely relying on step 1. Third, for large

data problems, the active set principle can be aggressively followed through to

develop a solution path for a desired  λsequence. For an increasing sequence

of  λ, starting from a small  λK(= λmin) value, Algorithm 3 generates estimates

and an active set. For  λK−1(> λK), only the coefficients of the active set are

cycled through while the coefficients of the non-active set remain zero. The

process is repeated until  λ1(= λmax) is taken.

Employing active set can substantially reduce computing burden especially for large data problems as in Section 6. The results, however, are not always preferred compared to the conventional non-active set, or full set strategy. Therefore, we focus on results from the non-active set strategy in the numerical studies.

Appendix B Generalized derivatives

Generalized derivatives for nonsmooth convex and nonsmooth nonconvex func-

tions have been developed (Nesterov, 2004; Clarke, 2013). Throughout this

paper, we only consider  f : Rm → R, a locally Lipschitz continuous function.

It is said to be Lipschitz near x if, for some neighborhood  Vxof x and some

constant K such that  |f(x) − f(y)| ≤ K∥x − y∥ ∀x, y ∈ Vx. And f is called

locally Lipschitz continuous on  Rmif it is Lipschitz near x for every  x ∈ Rm.

If for all  y, f(y) − f(x) ≥ ⟨g, y − x⟩, then g is called a subgradient of f at x.

The ordinary directional derivative at x in the direction  εis defined by:

image

The Clarke directional derivative of f at x in the direction  εis defined below:

image

The Clarke subdifferential of f at x is the set (Clarke, 2013, 10.3)

image

We say that f is regular at x provided f is Lipschitz near x and admits direc-

tional derivatives  f ′(x; ε) satisfying  f o(x; ε) =  f ′(x; ε) ∀ε. If f is differentiable

at x, then all three directional derivatives at x are the same. However, unlike

the directional derivative  f ′(x; ε), the Dini and Clarke derivatives always exist.

The point x is a stationary point of  f(·) if  f ′(x; ε) ≥ 0, ∀ ε ∈ Rm. The point

x is a Dini stationary point of  f(·) if  f ′D(x; ε) ≥ 0, ∀ ε ∈ Rm. The point x is a

Clarke stationary point of  f(·) if  f o(x; ε) ≥ 0, ∀ ε ∈ Rm. When a directional

derivative exists, a stationary point is always a Dini stationary point but not

vice versa. A Dini stationary point is always a Clarke stationary point but not

vice versa since the following relationships hold for all x and  ε:

image

We say that a vector  x ∈ Rmis a limit point of a sequence  {xn}in  Rmif there

exists a subsequence of  {xn}that converges to x.

Appendix C Proofs

Lemma 3 If Assumption 2 holds, then:

image

where  ∂Cpλ(0) is the Clarke subdifferential of  pλ(|θ|) at  θ= 0. Equality holds

if  pλ(|θ|) is also regular at  θ= 0.

Remark 1 Lemma 3 can be applied to the LASSO penalty with equality in

(32) since the penalty is convex, thus a regular function (Clarke, 2013, 10.8).

A regular function is either smooth everywhere or else has a corner of convex

type (Clarke, 2013, p. 200). Under Assumption 1 or Assumption 2,  pλ(|θ|)

is not differentiable at  θ= 0, and indeed 0 is a corner. Therefore  pλ(|θ|) is

regular if and only if  pλ(|θ|) is convex with 0 corner, i.e., the LASSO penalty.

On the other hand, penalty functions SCAD and MCP are not regular as they

are folded concave functions on R with a corner at the origin. Hence (32) may

not hold for these penalty functions.

Proof of Lemma 3

Let  poλ(0;  ε) denote the Clarke directional derivative at 0 in the direction ε, and  p′λ(0;  ε) denote the ordinary directional derivative. Their relationship and Assumption 2 lead to

image

Let  τ ∈ [−ζλ, ζλ], i.e.  ζλ ≥ |τ|, then we get

image

By definition (Clarke, 2013, 10.3),  τ ∈ ∂Cpλ(0). Thus we have shown

image

If  pλ(·) is also regular at 0, by its definition and Assumption 2, we then have

image

Let  τ ∈ ∂Cpλ(0), then according to the definition (Clarke, 2013, 10.3), a

necessary and sufficient condition is

image

Along with (34), then  ζλ|ε| ≥ τε, which leads to  τ ∈ [−ζλ, ζλ]. Hence

∂Cpλ(0)  ⊆ [−ζλ, ζλ]. Combining with (33), we have  ∂Cpλ(0) = [−ζλ, ζλ]. ⊓⊔

Proof of Theorem 1

Part (a): First,  F(φ) is locally Lipschitz continuous as  F(φ) is a sum of locally Lipschitz continuous functions. Since the first summand  L(φ) can be nonconvex and the second summand is nonsmooth, we invoke the Clarke sub-differential. Suppose  φ∗is a local minimizer of  F(φ), then 0  ∈ ∂CF(φ∗), where F(φ∗) is the Clarke subdifferential (Clarke, 2013, 10.7(a)). We have  L(φ) continuously differentiable,  β2jcontinuously differentiable and  pλ(|βj|) Lipschitz continuous near  βj, then the calculus of Clarke subdifferential holds (Clarke, 2013, 10.16):

image

Once  φ∗is plugged into (35), we then obtain

image

If  β∗j ̸= 0,  pλ(|β∗j |) is continuously differentiable by Assumption 1, thus  ∂Cpλ(|β∗j |) =

{sign(β∗j)p′λ(β∗j)} (Clarke, 2013, 10.8). Hence (5) follows. Otherwise we obtain

(6).

Part (b): First, part (a) holds since Assumption 2 contains Assumption 1. In particular, since (6) holds, it is left to compute  ∂Cpλ(0) utilized in (6). For

the first statement, Lemma 3 implies  ∂Cpλ(0)  ⊇ [−ζλ, ζλ]. Thus (6) induces

conditions including but potentially not limited to (7). In other words, it is

possible (7) is an incomplete optimality condition. For the second statement,

we provide two proofs. (i) Lemma 3 implies  ∂Cpλ(0) = [−ζλ, ζλ]. Hence (6)

is equivalent to (7), which becomes the complete optimality condition. (ii) As

argued before, if  pλ(|θ|) satisfies Assumption 2 and is also regular at 0, then

pλ(|θ|) is the LASSO penalty. For the continuous convex LASSO penalty, the

Clarke subdifferential is equivalent to the ordinary subdifferential (Clarke,

2013, 10.8). Therefore  ∂Cpλ(·) =  ∂pλ(·) = [−1,1]. From (6) we again obtain

complete optimal condition (7). ⊓⊔

Proof of Theorem 2

We know  φ∗is a global minimizer if and only if 0  ∈ ∂F(φ∗) (Nesterov, 2004). Since  F(φ∗) is a sum of convex functions, we have

image

where the last equality is obtained since  L(φ) and  β2are differentiable convex

functions. Deduce from this that (9) and (10) are valid by simple calculations.

image

holds:

image

where the second inequality holds locally or globally depending on whether  u∗

is a local or global minimizer. Thus  u∗is a local (global) minimizer of  γ(u|u∗).

image

a global minimizer of  γ(u|u∗) if and only if  ∇γ(u∗|u∗) = 0 by the optimality

condition. Thus  ∇Γ(u∗) =  ∇γ(u∗|u∗) = 0 by assumption. This happens if and

only if  u∗is a global minimizer of  Γ(u) since  Γ(u) is convex. ⊓⊔

image

minimizer of  Q(φ|φ∗). Therefore 0  ∈ ∂CQ(φ∗|φ∗). As in the proof of Theorem 1

(a), we have

image

where the second equality is obtained by Assumption 3.

image

The convexity assumption leads to

image

where the third equality is obtained by Assumption 3. Therefore 0  ∈ Q(φ|φ∗)

if and only if 0  ∈ F(φ). Deduce from this that  φ∗is a global minimizer of

Q(φ|φ∗) if and only if  φ∗is a global minimizer of  F(φ). Thus (16) and (17)

can be obtained from (36) by direct computation. ⊓⊔

image

be:

image

Suppose the penalty doesn’t change as in (23). Let  G′D(θ(0)|θ(0); ε) denote the

Dini derivative of  G(θ|θ(0)) for direction  εat  θ(0). It is trivially true that

image

If the penalty is majorized as in (24), we have

image

Hence we have

image

Again (38) is obtained. ⊓⊔

Proof of Theorem 5

Consider the second order Taylor expansion

L(φ) =  L(φ(k)) + (φ − φ(k))⊺∇L(φ(k)) + 12(φ − φ(k))⊺∇2L(˜φ)(φ − φ(k)).

By the construction of (29),  ℓ(φ|φ(k)) majorizes  L(φ) and  ℓ(φ|φ(k)) strictly

majorizes  L(φ) if  ω > ϱ.Thus to prove that  Q(φ|φ(k)) majorizes or strictly

majorizes  F(φ), we only need to show that  G(βj|β(k)j) majorizes  pλ(|βj|), 1 ≤

j ≤ p. We consider three cases depending on the construction of  G(βj|β(k)j):

image

c) If  G(βj|β(k)j) is defined by (25), by Proposition 3.1 in Hunter and Li (2005), G(βj|β(k)j) majorizes  pλ(|βj|).

In summary, we conclude that  Q(φ|φ(k)) majorizes  F(φ), and the inequality

in (31) holds if  ω > ϱ. ⊓⊔

Proof of Theorem 6

From Theorem 5 we have

image

We restate (29) below for convenience:

image

We prove through an adaption of similar arguments in Lange (2016) that

Algorithm 1 with the quadratic upper bound surrogate function satisfies the

following property:

image

With (28) we deduce that  ∇2r(φ) is positive semidefinite. Thus  r(·) is convex,

see for instance, Theorem 2.1.4 in Nesterov (2004). From (2), (30), (40) and

G(β|β(k)) =  pλ(|β|) for some convex function  pλ(|β|), we rewrite

image

By convexity we have

image

where  ∂Q(φ|φ(k)) and  ∂F(φ) are subdifferentials. Since  φ(k+1)minimizes  Q(φ|φ(k)),

then 0  ∈ ∂Q(φ(k+1)|φ(k)). Equivalently, for some  q(k+1) ∈ ∂F(φ) we have

image

From the definition of subgradient we have

image

Combining (42) and (43), we have

image

The last inequality and (41) means that (39) holds. Hence Algorithm 1 is

a sequential unconstrained minimization algorithm (SUMMA), and  F(φ(k))

converges to inf  F(φ) as shown in Byrne (2018).

We prove the second assertion utilizing Proposition 7.4.1 in Lange (2016) that states if the SUMMA condition holds,  F(φ) and  Q(φ|φ(k)) are continuous, Q(φ|φ(k)) is uniformly strongly convex, and  F(φ) achieves its minimum, then every sequence  φ(k)converges to a minimum point of  F(φ).

Since the objective function  F(φ) is convex, thus continuous. The surrogate Q(φ|φ(k)) is also continuous. From (29) we have

image

where  τis the smallest eigenvalue of W. Hence  ℓ(φ|φ(k)) is uniformly strongly

convex with parameter  ωτwhich is independent of  φ(k). Note that  Λ(β) is

convex provided  G(βj|β(k)j) =  pλ(|βj|) for some convex penalty  pλ(|·|). Denote

image

Then  Aφ = β. Denote ˜Λ(φ) ≜ Λ(Aφ). Since the composition of an affine

function A preserves convexity of  Λ(·), ˜Λ(φ) is convex. Therefore  Q(φ|φ(k)) =

ℓ(φ|φ(k)) + ˜Λ(φ) =  ℓ(φ|φ(k)) +  Λ(β) is uniformly strongly convex in  φ.

We need to show that  F(φ) is coercive. First, consider condition (i). Since L(φ) is coercive meaning  ∥φ∥ → ∞implies  L(φ) → ∞, then  F(φ) is coercive. Next, consider condition (ii). Since  β0is bounded, for instance,  β0= 0, i.e., no intercept, we deduce from  ∥φ∥ → ∞that  ∥β∥ → ∞, which in turn implies Λ(β) → ∞by Assumption 1. Additionally, since  L(φ) is bounded below, then F(φ) is coercive. In conclusion, with either (i) or (ii),  F(φ) is coercive. Combining with continuity, hence  F(φ) achieves its minimum (Bertsekas, 1999, Proposition A.8).

In summary, all conditions hold for Proposition 7.4.1 in Lange (2016), thus every sequence  φ(k)converges to a minimum point of  F(φ). ⊓⊔

Lemma 4 Suppose Assumption 1 holds for  pλ(|θ|), and  G(θ|θ(0)) is given by

(23) or (24). Then the Dini derivative of  G(θ|θ(0)) for direction  εat  θ(0)is

the same as that of  pλ(|θ|) given by:

image

Proof of Theorem 7

To show convergence, we utilize the following assertions:

image

The first term  ℓ(φ|φ(k)) is jointly continuous in (φ, φ(k)) by assumption, and the

remaining terms in (30) are also jointly continuous. Hence assertion (i) holds.

We prove the second assertion. Denote  ε= (ε0, ..., εp)⊺, we then compute

image

where  p′λ,D(β(k)j ; εj) is the Dini directional derivative of  pλ(|βj|) for direction

εjat  β(k)j. On the other hand, we have

Q′D(φ|φ(k); ε)φ=φ(k)= lim infτ→0+ Q(φ(k)+  τε) − Q(φ(k))τ

image

By Lemma 4 we have

image

By Assumption 3 we have

image

We deduce from (45) and (46) that (44) holds. Theorem 5 and assertions (i)-(ii) together imply that Assumption 1 in

Razaviyayn et al. (2013) is satisfied. Thus invoking Theorem 1 in Razaviyayn

et al. (2013) concludes that every limit point of the iterates generated by

Algorithm 1 is a Dini stationary point of the problem minimizing  F(φ). ⊓⊔

Proof of Theorem 8

Consider the following regularity conditions:

image

The above conditions lead to the desired result (Schifano et al., 2010; Chi and

Scott, 2014). In particular, each limit point ˜φof  φ(k)is a fixed point such that

M(˜φ) = ˜φ(Schifano et al., 2010, Theorem A.3), and the set of fixed points

coincides with the Clarke stationary points of  F(φ) (Schifano et al., 2010,

Proposition A.8). We now check each condition. By assumption  L(φ) is locally

Lipschitz continuous, Assumption 1 implies that  pλ(|βj) is locally Lipschitz

continuous due to a bounded derivative at every point including the origin, and

β2jis continuously differentiable thus locally Lipschitz continuous. As a result,

F(φ) is locally Lipschitz continuous. With the same arguments in the proof

of Theorem 6,  F(φ) is coercive since  L(φ) is coercive or bounded below with

β0bounded. With the additional assumption that the set of Clarke stationary

points S of  F(φ) is a finite set, R1 is satisfied. Since  ℓ(φ|φ(k)) strictly ma-

jorizes  L(φ) at  φ(k), and  G(βj|β(k)j) majorizes  pλ(|βj|) at  β(k)j, then  Q(φ|φ(k))

strictly majorizes  F(φ) at  φ(k). R2 is met. As a sum of continuous functions

on (φ, φ(k)), the penalized majorization  Q(φ|φ(k)) is continuous. The surro-

gate penalty ˜Λ(φ) is convex and finite, thus locally Lipschitz continuous. Since

ℓ(φ|φ(k)) is also locally Lipschitz,  Q(φ|φ(k)) defined in (30) is locally Lipschitz

continuous. Thus R3 is met. By assumption  ℓ(φ|φ(k)) is strongly convex. Then

Q(φ|φ(k)) is strongly convex because it is a sum of strongly convex function

ℓ(φ|φ(k)) and convex function ˜Λ(φ). Hence  Q(φ|φ(k)) has at most one global

minimizer. Since  Q(φ|φ(k)) ≥ F(φ) and  F(φ) is coercive as shown above, then

Q(φ|φ(k)) is coercive. Combining with continuity,  Q(φ|φ(k)) achieves its mini-

mum. Together,  Q(φ|φ(k)) has one and only one global minimizer. R4 is met.

image

C.1 Additional results in simulation and data analysis

Table 6 Mean number of selected variables and standard deviations (in parentheses) for v percentage of outliers in Example 1.

image

Table 7 Mean sensitivity and standard deviations (in parentheses) for v percentage of outliers in Example 1.

image

Table 8 Mean specificity and standard deviations (in parentheses) for v percentage of outliers in Example 1.

image

Table 9 Mean number of selected variables and standard deviations (in parentheses) for v pencentage of outliers in Example 2.

image

Table 10 Mean test errors and standard deviations (in parentheses) for v pencentage of outliers in Example 3.

image

Table 11 Mean number of selected variables and standard deviations (in parentheses) for v pencentage of outliers in Example 3.

image

Table 12 Variable selection for  x1 or x2for v percentage of outliers in Example 3.

image

Table 13 Mean number of selected variables and standard deviations (in parentheses) for v pencentage of outliers in the breast cancer data.

image


Designed for Accessibility and to further Open Science