Since the mid 1990’s, high dimensional statistics has attracted considerable attention, especially in the context of linear regression with more explanatory variables than observations: the so-called case. In such a context, the least squares with
regularization, referred to as the Lasso (Tibshirani, 1996) in statistics, or Basis Pursuit (Chen et al., 1998) in signal processing, has been one of the most popular tools. It enjoys theoretical guarantees (Bickel et al., 2009), as well as practical benefits: it provides sparse solutions and fast convex solvers are available. This has made the Lasso a popular method in modern data-science toolkits. Among successful fields where it has been applied,
one can mention dictionary learning (Mairal, 2010), biostatistics (Haury et al., 2012) and medical imaging (Lustig et al., 2007; Gramfort et al., 2012) to name a few.
Many algorithms exist to approximate Lasso solutions, but it is still a burning issue to accelerate solvers in high dimensions. Indeed, although some other variable selection and prediction methods exist (Fan & Lv, 2008), the best performing methods usually rely on the Lasso. For stability selection methods (Meinshausen & B¨uhlmann, 2010; Bach, 2008; Varoquaux et al., 2012), hundreds of Lasso problems need to be solved. For non-convex approaches such as SCAD (Fan & Li, 2001) or MCP (Zhang, 2010), solving the Lasso is often a required preliminary step (Zou, 2006; Zhang & Zhang, 2012; Cand`es et al., 2008).
Among possible algorithmic candidates for solving the Lasso, one can mention homotopy methods (Osborne et al., 2000), LARS (Efron et al., 2004), and approximate homotopy (Mairal & Yu, 2012), that provide solutions for the full Lasso path, i.e., for all possible choices of tuning parameter . More recently, particularly for
, coordinate descent approaches (Friedman et al., 2007) have proved to be among the best methods to tackle large scale problems.
Following the seminal work by El Ghaoui et al. (2012), screening techniques have emerged as a way to exploit the known sparsity of the solution by discarding features prior to starting a Lasso solver. Such techniques are coined safe rules when they screen out coefficients guaranteed to be zero in the targeted optimal solution. Zeroing those coeffi-cients allows to focus more precisely on the non-zero ones (likely to represent signal) and helps reducing the computational burden. We refer to (Xiang et al., 2014) for a concise introduction on safe rules. Other alternatives have tried to screen the Lasso relaxing the “safety”. Potentially, some variables are wrongly disregarded and post-processing is needed to recover them. This is for instance the strategy adopted for the strong rules (Tibshirani et al., 2012).
The original basic safe rules operate as follows: one chooses a fixed tuning parameter , and before launching any solver, tests whether a coordinate can be zeroed or not (equivalently if the corresponding variable can be disregarded or not). We will refer to such safe rules as static safe rules. Note that the test is performed according to a safe region, i.e., a region containing a dual optimal solution of the Lasso problem. In the static case, the screening is performed only once, prior any optimization iteration. Two directions have emerged to improve on static strategies.
• The first direction is oriented towards the resolution of the Lasso for a large number of tuning parameters. Indeed, practitioners commonly compute the Lasso over a grid of parameters and select the best one in a data-driven manner, e.g., by cross-validation. As two consecutive in the grid lead to similar solutions, knowing the first solution may help improve screening for the second one. We call sequential safe rules such strategies, also referred to as recursive safe rules in (El Ghaoui et al., 2012). This road has been pursued in (Wang et al., 2013; Xu & Ramadge, 2013; Xi- ang et al., 2014), and can be thought of as a “warm start” of the screening (in addition to the warm start of the solution itself). When performing sequential safe rules, one should keep in mind that generally, only an approximation of the previous dual solution is computed. Though, the safety of the rule is guaranteed only if one uses the exact solution. Neglecting this issue, leads to “unsafe” rules: relevant variables might be wrongly disregarded.
• The second direction aims at improving the screening by interlacing it throughout the optimization algorithm itself: although screening might be useless at the beginning of the algorithm, it might become (more) efficient as the algorithm proceeds towards the optimal solution. We call these strategies dynamic safe rules following (Bonnefoy et al., 2014a;b).
Based on convex optimization arguments, we leverage duality gap computations to propose a simple strategy unifying both sequential and dynamic safe rules. We coined GAP SAFE rules such safe rules.
The main contributions of this paper are 1) the introduction of new safe rules which demonstrate a clear practical improvement compared to prior strategies 2) the definition of a theoretical framework for comparing safe rules by looking at the convergence of their associated safe regions.
In Section 2, we present the framework and the basic concepts which guarantee the soundness of static and dynamic screening rules. Then, in Section 3, we introduce the new concept of converging safe rules. Such rules identify in finite time the active variables of the optimal solution (or equivalently the inactive variables), and the tests become more and more precise as the optimization algorithm proceeds. We also show that our new GAP SAFE rules, built on dual gap computations, are converging safe rules since their associated safe regions have a diameter converging to zero. We also explain how our GAP SAFE tests are sequential by nature. Application of our GAP SAFE rules with a coordinate descent solver for the Lasso problem is proposed in Section 4. Using standard data-sets, we report the time improvement compared to prior safe rules.
1.1. Model and notation
We denote by rds the set t1, . . . , du for any integer d P N. Our observation vector is and the design matrix
has p explanatory variables (or features) column-wise. We aim at approximating y as a linear combination of few variables
’s, hence expressing y as
where
is a sparse vector. The standard Euclidean norm is written
, the
norm
, the
norm
, and the matrix transposition of a matrix Q is denoted by
. We denote
.
For such a task, the Lasso is often considered (see B¨uhlmann & van de Geer (2011) for an introduction). For a tuning parameter , controlling the trade-off between data fidelity and sparsity of the solutions, a Lasso estimator
is any solution of the primal optimization problem
Denoting the dual feasible set, a dual formulation of the Lasso reads (see for instance Kim et al. (2007) or Xiang et al. (2014)):
We can reinterpret Eq. (2) as , where
refers to the projection onto a closed convex set C. In particular, this ensures that the dual solution
is always unique, contrarily to the primal
.
1.2. A KKT detour
For the Lasso problem, a primal solution and the dual solution
are linked through the relation:
The Karush-Khun-Tucker (KKT) conditions state:
Table 1. Review of some common safe sphere tests.
See for instance (Xiang et al., 2014) for more details. The KKT conditions lead to the fact that for is a primal solution. It can be considered as the mother of all safe screening rules. So from now on, we assume that
for all the considered
’s.
Safe rules exploit the KKT condition (4). This equation implies that as soon as
. The main chal- lenge is that the dual optimal solution is unknown. Hence, a safe rule aims at constructing a set
containing
. We call such a set C a safe region. Safe regions are all the more helpful that for many j’s,
1, hence for many j’s,
.
Practical benefits are obtained if one can construct a region C for which it is easy to compute its support function, denoted by and defined for any
by:
Cast differently, for any safe region C, any j P rps, and any primal optimal solution , the following holds true:
We call safe test or safe rule, a test associated to C and screening out explanatory variables thanks to Eq. (6).
Remark 1. Reminding that the support function of a set is the same as the support function of its closed convex hull (Hiriart-Urruty & Lemar´echal, 1993)[Proposition V.2.2.1], we restrict our search to closed convex safe regions.
Based on a safe region C one can partition the explanatory variables into a safe active set and a safe zero set
where:
Note that for nested safe regions then
. Consequently, a natural goal is to find safe regions as small as possible: narrowing safe regions can only increase the number of screened out variables.
Remark 2. If , the safe active set is the equicorrelation set
(in most cases (Tibshirani, 2013) it is exactly the support of
). Even when the Lasso is not unique, the equicorrelation set contains all the solutions’ supports. The other extreme case is when
, and
. Here, no variable is screened out:
and the screening is useless.
We now consider common safe regions whose support functions are easy to obtain in closed form. For simplicity we focus only on balls and domes, though more complicated regions could be investigated (Xiang et al., 2014).
2.1. Sphere tests
Following previous work on safe rules, we call sphere tests, tests relying on balls as safe regions. For a sphere test, one chooses a ball containing with center c and radius r, i.e.,
. Due to their simplicity, safe spheres have been the most commonly investigated safe regions (see for instance Table 1 for a brief review). The corresponding test is defined as follows:
Note that for a fixed center, the smaller the radius, the better the safe screening strategy.
Example 1. The first introduced sphere test (El Ghaoui et al., 2012) consists in using the center and radius
. Given that
, this is a safe region since
and
. However, one can check that this static safe rule is useless as soon as
2.2. Dome tests
Other popular safe regions are domes, the intersection between a ball and a half-space. This kind of safe region has been considered for instance in (El Ghaoui et al., 2012; Xi- ang & Ramadge, 2012; Xiang et al., 2014; Bonnefoy et al.,
Figure 1. Representation of the dome (dark blue). In our case, note that
is positive.
2014b). We denote the dome with ball center c, ball radius r, oriented hyperplane with unit normal vector w and parameter
such that
is the projection of c on the hyperplane (see Figure 1 for an illustration in the interesting case
).
Remark 3. The dome is non-trivial whenever . When
, one gets simply a hemisphere.
For the dome test one needs to compute the support function for . Interestingly, as for balls, it can be obtained in a closed form. Due to its length though, the formula is deferred to the Appendix (see also (Xiang et al., 2014)[Lemma 3] for more details).
2.3. Dynamic safe rules
For approximating a solution of the Lasso primal problem
, iterative algorithms are commonly used. We denote
the current estimate after k iterations of any iterative algorithm (see Section 4 for a specific study on coordinate descent). Dynamic safe rules aim at discovering safe regions that become narrower as k increases. To do so, one first needs dual feasible points:
. Following El Ghaoui et al. (2012) (see also (Bonnefoy et al., 2014a)), this can be achieved by a simple transformation of the current residuals
, defining
as
Such dual feasible is proportional to
, and is the closest point (for the norm
) to
in
with such a property, i.e.,
. A reason for choosing this dual point is that the dual optimal solution
is the projection of
on the dual feasible set
, and the optimal
is proportional to
, cf. Equation (3).
Remark 4. Note that if (convergence of the primal) then with the previous display and (3), we can show that
. Moreover, the convergence of the primal is unaltered by safe rules: screening out unnecessary coefficients of
, can only decrease the distance between
and
.
Example 2. Note that any dual feasible point immediately provides a ball that contains
since
The ball corresponds to the simplest safe region introduced in (Bonnefoy et al., 2014a;b) (cf. Figure 2 for more insights). When the algorithm proceeds, one expects that
gets closer to
, so
should get closer to
. Similarly to Example 1, this dynamic rule becomes useless once
is too small. More precisely, this occurs as soon as
Noticing that (since
is a contraction and
) and proceeding as for (10), one can show that this dynamic safe rule is inefficient when:
This is a critical threshold, yet the screening might stop even at a larger thanks to Eq. (13). In practice the bound in Eq. (13) cannot be evaluated a priori due to the term
). Note also that the bound in Eq. (14) is close to the one in Eq. (10), explaining the similar behavior observed in our experiments (see Figure 3 for instance).
3.1. Support discovery in finite time
Let us first introduce the notions of converging safe regions and converging safe tests.
Definition 1. Let be a sequence of closed convex sets in
containing
. It is a converging sequence of safe regions for the Lasso with parameter
if the diameters of the sets converge to zero. The associated safe screening rules are referred to as converging safe tests.
Not only converging safe regions are crucial to speed up computation, but they are also helpful to reach exact active set identification in a finite number of steps. More precisely, we prove that one recovers the equicorrelation set of the Lasso (cf. Remark 2) in finite time with any converging strategy: after a finite number of steps, the equicorrelation set is exactly identified. Such a property is sometimes referred to as finite identification of the support (Liang et al., 2014). This is summarized in the following.
Figure 2. Our new GAP SAFE sphere and dome (in orange). The dual optimal solution must lie in the dark blue region;
point in
is any point in the dual feasible set
. Remark that the GAP SAFE dome is included in the GAP SAFE sphere, and that it is the convex hull of the dark blue region.
Theorem 1. Let be a sequence of converging safe regions. The estimated support provided by
,
, satisfies
, and there exists
such that
one gets
.
Proof. The main idea of the proof is to use that and that the set
is discrete. Details are delayed to the Appendix.
Remark 5. A more general result is proved for a spe-cific algorithm (Forward-Backward) in Liang et al. (2014). Interestingly, our scheme is independent of the algorithm considered (e.g., Forward-Backward (Beck & Teboulle, 2009), Primal Dual (Chambolle & Pock, 2011), coordinatedescent (Tseng, 2001; Friedman et al., 2007)) and relies only on the convergence of a sequence of safe regions.
3.2. GAP SAFE regions: leveraging the duality gap
Proof. The construction of the ball is based on the weak duality theorem (cf. (Rockafellar & Wets, 1998) for a reminder on weak and strong duality). Fix
and
, then it holds that
Hence,
In particular, this provides . Combining (12) and (16), asserts that
belongs to the annulus
(the light blue zone in Figure 2).
Remind that the dual feasible set is convex, hence
is also convex. Thanks to (16),
, and then
is convex too. Hence,
is inside the annulus
and so is
by convexity (see Figure 2,(a) and Figure 2,(b)). Moreover,
is the point of
which is closest to
. The farthest where
can be according to this information would be if
were tangent to the inner ball
and
. Let us denote
such a point. The tangency property reads
and
. Hence, with the later and the definition of
and
.
Since by construction cannot be further away from
than
(again, insights can be gleaned from Figure 2), we conclude that
.
Remark 6. Choosing and
, then one recovers the static safe rule given in Example 1.
With the definition of the primal (resp. dual) objective for , (resp.
), the duality gap reads as
. Remind that if
, then one has
, which is a standard stopping criterion for Lasso solvers. The next proposition establishes a connection between the radius
and the duality gap
.
Proposition 1. For any , the following holds
Proof. Use the fact that and
.
If we could choose the “oracle” and
in (15) then we would obtain a zero radius. Since those quantities are unknown, we rather pick dynamically the current available estimates given by an optimization algorithm:
and
as in Eq. (11). Introducing GAP SAFE spheres and domes as below, Proposition 2 ensures that they are converging safe regions.
GAP SAFE sphere:
GAP SAFE dome:
Proposition 2. For any converging primal sequence , and dual sequence
defined as in Eq. (11), then the GAP SAFE sphere and the GAP SAFE dome are converging safe regions.
Proof. For the GAP SAFE sphere the result follows from strong duality, Remark 4 and Proposition 1 yield , since
and
. For the GAP SAFE dome, one can check that it is included in the GAP SAFE sphere, therefore inherits the convergence (see also Figure 2,(c) and (d)).
Remark 7. The radius can be compared with the radius considered for the Dynamic Safe rule and Dynamic ST3 (Bonnefoy et al., 2014a) respectively:
and
, where
. We have proved that
, but a weaker property is satisfied by the two other radius:
and
.
3.3. GAP SAFE rules : sequential for free
As a byproduct, our dynamic screening tests provide a warm start strategy for the safe regions, making our GAP SAFE rules inherently sequential. The next proposition shows their efficiency when attacking a new tuning parameter, after having solved the Lasso for a previous , even only approximately. Handling approximate solutions is a critical issue to produce safe sequential strategies: without taking into account the approximation error, the screening might disregard relevant variables, especially the one near the safe regions boundaries. Except for
, it is unrealistic to assume that one can dispose of exact solutions.
Consider and a non-increasing sequence of
tuning parameters
in
. In practice, we choose the common grid (B¨uhlmann & van de Geer, 2011)[2.12.1]):
(for instance in Figure 3, we considered
). The next result controls how the duality gap, or equivalently, the diameter of our GAP SAFE regions, evolves from
to
.
Proposition 3. Suppose that and
. Reminding
, the following holds
rβ, θ
Proof. Details are given in the Appendix.
This proposition motivates to screen sequentially as follows: having such that
, then, we can screen using the GAP SAFE sphere with center
and radius
. The adaptation to the GAP SAFE dome is straightforward and consists in replacing
by
in the GAP SAFE dome definition.
Remark 8. The basic sphere test of (Wang et al., 2013) requires the exact dual solution for center, and has radius
, which is strictly larger than ours. Indeed, if one has access to dual and primal optimal solutions at
, i.e.,
, then
and
Note that contrarily to former sequential rules (Wang et al., 2013), our introduced GAP SAFE rules still work when one has only access to approximations of .
4.1. Coordinate Descent
Screening procedures can be used with any optimization algorithm. We chose coordinate descent because it is well suited for machine learning tasks, especially with sparse and/or unstructured design matrix X. Coordinate descent requires to extract efficiently columns of X which is typically not easy in signal processing applications where X is commonly an implicit operator (e.g. Fourier or wavelets).
We implemented the screening rules of Table 1 based on the coordinate descent in Scikit-learn (Pedregosa et al., 2011). This code is written in Python and Cython to generate low level C code, offering high performance. A low level language is necessary for this algorithm to scale. Two implementations were written to work efficiently with both dense data stored as Fortran ordered arrays and sparse data stored in the compressed sparse column (CSC) format. Our pseudo-code is presented in Algorithm 1. In practice, we perform the dynamic screening tests every passes through the entire (active) variables. Iterations are stopped when the duality gap is smaller than the target accuracy.
The naive computation of in (11) involves the computation of
(
being the current residual), which costs Opnpq operations. This can be avoided as one knows when using a safe rule that the index achieving the maximum for this norm is in
. Indeed, by construction
. In practice the evaluation of the dual gap is therefore not a Opnpq but Opnqq where q is the size of
. In other words, using screening also speeds up
Figure 3. Proportion of active variables as a function of number of iterations K on the Leukemia dataset. Better strategies have longer range of
with (red) small active sets.
the evaluation of the stopping criterion.
We did not compare our method against the strong rules of Tibshirani et al. (2012) because they are not safe and therefore need complex post-processing with parameters to tune. Also we did not compare against the sequential rule of Wang et al. (2013) (e.g., EDDP) because it requires the exact dual optimal solution of the previous Lasso problem, which is not available in practice and can prevent the solver from actually converging: this is a phenomenon we always observed on our experiments.
4.2. Number of screened variables
Figure 3 presents the proportion of variables screened by several safe rules on the standard Leukemia dataset. The screening proportion is presented as a function of the number of iterations K. As the SAFE screening rule of El Ghaoui et al. (2012) is sequential but not dynamic, for a given the proportion of screened variables does not depend on K. The rules of Bonnefoy et al. (2014a) are more efficient on this dataset but they do not benefit much from the dynamic framework. Our proposed GAP SAFE tests screen much more variables, especially when the tuning parameter
gets small, which is particularly relevant in practice. Moreover, even for very small
’s (notice the logarithmic scale) where no variable is screened at the beginning of the optimization procedure, the GAP SAFE rules manage to screen more variables, especially when K increases. Finally, the figure demonstrates that the GAP SAFE dome test only brings marginal improvement over the sphere.
Figure 4. Time to reach convergence using various screening rules on the Leukemia dataset (dense data:
4.3. Gains in the computation of Lasso paths
The main interest of variable screening is to reduce computation costs. Indeed, the time to compute the screening itself should not be larger than the gains it provides. Hence, we compared the time needed to compute Lasso paths to prescribed accuracy for different safe rules. Figures 4, 5 and 6 illustrate results on three datasets. Figure 4 presents results on the dense, small scale, Leukemia dataset. Figure 5 presents results on a medium scale sparse dataset obtained with bag of words features extracted from the 20newsgroup dataset (comp.graphics vs. talk.religion.misc with TF-IDF removing English stop words and words occurring only once or more than 95% of the time). Text feature extraction was done using Scikit-Learn. Figure 6 focuses on the large scale sparse RCV1 (Reuters Corpus Volume 1) dataset, cf. (Schmidt et al., 2013).
In all cases, Lasso paths are computed as required to estimate optimal regularization parameters in practice (when using cross-validation one path is computed for each fold). For each Lasso path, solutions are obtained for values of
’s, as detailed in Section 3.3. Remark that the grid used is the default one in both Scikit-Learn and the glmnet R package. With our proposed GAP SAFE screening we obtain on all datasets substantial gains in computational time. We can already get an up to 3x speedup when we require a duality gap smaller than
. The interest of the screening is even clearer for higher accuracies: GAP SAFE sphere is 11x faster than its competitors on the Leukemia dataset, at accuracy
. One can observe that with the parameter grid used here, the larger is p compared to n, the higher is the gain in computation time.
In our experiments, the other safe screening rules did not show much speed-up. As one can see on Figure 3, those screening rules keep all the active variables for a wide range of ’s. The algorithm is thus faster for large
’s but slower afterwards, since we still compute the screening tests. Even if one can avoid some of these useless computations thanks to formulas like (14) or (10), the corresponding speed-up
Figure 5. Time to reach convergence using various screening rules on bag of words from the 20newsgroup dataset (sparse data: with
Figure 6. Computation time to reach convergence using different screening strategies on the RCV1 (Reuters Corpus Volume 1) dataset (sparse data with
would not be significant.
We have presented new results on safe rules for accelerating algorithms solving the Lasso problem (see Appendix for extension to the Elastic Net). First, we have introduced the framework of converging safe rules, a key concept independent of the implementation chosen. Our second contribution was to leverage duality gap computations to create two safer rules satisfying the aforementioned convergence properties. Finally, we demonstrated the important practical benefits of those new rules by applying them to standard dense and sparse datasets using a coordinate descent solver. Future works will extend our framework to generalized linear model and group-Lasso.
The authors would like to thanks Jalal Fadili and Jingwei Liang for helping clarifying some misleading statements on the equicorrelation set. We acknowledge the support from Chair Machine Learning for Big Data at T´el´ecom ParisTech and from the Orange/T´el´ecom ParisTech think tank phiTAB. This work benefited from the support of the ”FMJH Program Gaspard Monge in optimization and operation re-
search”, and from the support to this program from EDF.
Bach, F. Bolasso: model consistent Lasso estimation through the bootstrap. In ICML, 2008.
Beck, A. and Teboulle, M. A fast iterative shrinkagethresholding algorithm for linear inverse problems. SIAM J. Imaging Sci., 2(1):183–202, 2009.
Bickel, P. J., Ritov, Y., and Tsybakov, A. B. Simultaneous analysis of Lasso and Dantzig selector. Ann. Statist., 37 (4):1705–1732, 2009.
Bonnefoy, A., Emiya, V., Ralaivola, L., and Gribonval, R. A dynamic screening principle for the lasso. In EUSIPCO, 2014a.
Bonnefoy, A., Emiya, V., Ralaivola, L., and Gribonval, R. Dynamic Screening: Accelerating First-Order Algorithms for the Lasso and Group-Lasso. ArXiv e-prints, 2014b.
B¨uhlmann, P. and van de Geer, S. Statistics for high-dimensional data. Springer Series in Statistics. Springer, Heidelberg, 2011. Methods, theory and applications.
Cand`es, E. J., Wakin, M. B., and Boyd, S. P. Enhancing sparsity by reweighted minimization. J. Fourier Anal. Applicat., 14(5-6):877–905, 2008.
Chambolle, A. and Pock, T. A first-order primal-dual algo- rithm for convex problems with applications to imaging. J. Math. Imaging Vis., 40(1):120–145, 2011.
Chen, S. S., Donoho, D. L., and Saunders, M. A. Atomic decomposition by basis pursuit. SIAM J. Sci. Comput., 20(1):33–61 (electronic), 1998.
Efron, B., Hastie, T., Johnstone, I. M., and Tibshirani, R. Least angle regression. Ann. Statist., 32(2):407–499, 2004. With discussion, and a rejoinder by the authors.
El Ghaoui, L., Viallon, V., and Rabbani, T. Safe feature elimination in sparse supervised learning. J. Pacific Optim., 8(4):667–698, 2012.
Fan, J. and Li, R. Variable selection via nonconcave penal- ized likelihood and its oracle properties. J. Amer. Statist. Assoc., 96(456):1348–1360, 2001.
Fan, J. and Lv, J. Sure independence screening for ultrahigh dimensional feature space. J. Roy. Statist. Soc. Ser. B, 70 (5):849–911, 2008.
Friedman, J., Hastie, T., H¨ofling, H., and Tibshirani, R. Pathwise coordinate optimization. Ann. Appl. Stat., 1 (2):302–332, 2007.
Gramfort, A., Kowalski, M., and H¨am¨al¨ainen, M. Mixed- norm estimates for the M/EEG inverse problem using accelerated gradient methods. Phys. Med. Biol., 57(7): 1937–1961, 2012.
Haury, A.-C., Mordelet, F., Vera-Licona, P., and Vert, J.- P. TIGRESS: Trustful Inference of Gene REgulation using Stability Selection. BMC systems biology, 6(1):145, 2012.
Hiriart-Urruty, J.-B. and Lemar´echal, C. Convex analysis and minimization algorithms. I, volume 305. SpringerVerlag, Berlin, 1993.
Kim, S.-J., Koh, K., Lustig, M., Boyd, S., and Gorinevsky, D. An interior-point method for large-scale -regularized least squares. IEEE J. Sel. Topics Signal Process., 1(4):606–617, 2007.
Liang, J., Fadili, J., and Peyr´e, G. Local linear convergence of forward–backward under partial smoothness. In NIPS, pp. 1970–1978, 2014.
Lustig, M., Donoho, D. L., and Pauly, J. M. Sparse MRI: The application of compressed sensing for rapid MR imaging. Magnetic Resonance in Medicine, 58(6):1182– 1195, 2007.
Mairal, J. Sparse coding for machine learning, image processing and computer vision. PhD thesis, ´Ecole normale sup´erieure de Cachan, 2010.
Mairal, J. and Yu, B. Complexity analysis of the lasso reg- ularization path. In ICML, 2012.
Meinshausen, N. and B¨uhlmann, P. Stability selection. J. Roy. Statist. Soc. Ser. B, 72(4):417–473, 2010.
Osborne, M. R., Presnell, B., and Turlach, B. A. A new approach to variable selection in least squares problems. IMA J. Numer. Anal., 20(3):389–403, 2000.
Pedregosa, F., Varoquaux, G., Gramfort, A., Michel, V., Thirion, B., Grisel, O., Blondel, M., Prettenhofer, P., Weiss, R., Dubourg, V., Vanderplas, J., Passos, A., Cournapeau, D., Brucher, M., Perrot, M., and Duchesnay, E. Scikit-learn: Machine learning in Python. J. Mach. Learn. Res., 12:2825–2830, 2011.
Rockafellar, R. T. and Wets, R. J.-B. Variational analysis, volume 317 of Grundlehren der Mathematischen Wissenschaften [Fundamental Principles of Mathematical Sciences]. Springer-Verlag, Berlin, 1998.
Schmidt, M., Le Roux, N., and Bach, F. Minimizing finite sums with the stochastic average gradient. arXiv preprint arXiv:1309.2388, 2013.
Tibshirani, R. Regression shrinkage and selection via the lasso. J. Roy. Statist. Soc. Ser. B, 58(1):267–288, 1996.
Tibshirani, R., Bien, J., Friedman, J., Hastie, T., Simon, N., Taylor, J., and Tibshirani, R. J. Strong rules for discarding predictors in lasso-type problems. J. Roy. Statist. Soc. Ser. B, 74(2):245–266, 2012.
Tibshirani, R. J. The lasso problem and uniqueness. Electron. J. Stat., 7:1456–1490, 2013.
Tseng, P. Convergence of a block coordinate descent method for nondifferentiable minimization. J. Optim. Theory Appl., 109(3):475–494, 2001.
Varoquaux, G., Gramfort, A., and Thirion, B. Smallsample brain mapping: sparse recovery on spatially correlated designs with randomization and clustering. In ICML, 2012.
Wang, J., Zhou, J., Wonka, P., and Ye, J. Lasso screening rules via dual polytope projection. In NIPS, pp. 1070– 1078, 2013.
Xiang, Z. J. and Ramadge, P. J. Fast lasso screening tests based on correlations. In ICASSP, pp. 2137–2140, 2012.
Xiang, Z. J., Xu, H., and Ramadge, P. J. Learning sparse representations of high dimensional data on large scale dictionaries. In NIPS, pp. 900–908, 2011.
Xiang, Z. J., Wang, Y., and Ramadge, P. J. Screening tests for lasso problems. arXiv preprint arXiv:1405.4897, 2014.
Xu, P. and Ramadge, P. J. Three structural results on the lasso problem. In ICASSP, pp. 3392–3396, 2013.
Zhang, C.-H. Nearly unbiased variable selection under minimax concave penalty. Ann. Statist., 38(2):894–942, 2010.
Zhang, C.-H. and Zhang, T. A general theory of concave regularization for high-dimensional sparse estimation problems. Statistical Science, 27(4):576–593, 2012.
Zou, H. The adaptive lasso and its oracle properties. J. Am. Statist. Assoc., 101(476):1418–1429, 2006.
Zou, H. and Hastie, T. Regularization and variable selec- tion via the elastic net. J. Roy. Statist. Soc. Ser. B, 67(2): 301–320, 2005.
We provided in this Appendix some more details on the theoretical results given in the main part.
A.1. Dome test
Let us consider the case where the safe region C is the dome , with parameters: center c, radius r, relative distance ratio
and unit normal vector w.
The computation of the dome test formula proceeds as follows:
and so
Using the former notation:
Let us introduce the following dome parameters, for any :
• Center: .
• Radius: .
• Ratio: “ ´1 ` 2 p
{ q
.
• Normal vector: .
Reminding that the support function of a set is the same as the support function of its closed convex hull (Hiriart-Urruty & Lemar´echal, 1993)[Proposition V.2.2.1] means that we only need to optimize over the dome introduced. Therefore, one cannot improve our previous result by optimizing the problem on the intersection of the ball of radius and the complement of the ball of radius
(i.e., the blue region in Figure 2).
A.2. Proof of Theorem 1
Proof. Define . Fix
such that
. As
is a converging sequence containing
, its diameter is converging to zero, and there exists
such that
. Hence, for any
and any
. Using the triangle inequality, one gets
provided that . Thus, for any
and
.
For the reverse inclusion take , i.e.,
. Since for all
, then
and the result holds.
A.3. Proof of Proposition 3
We detail here the proof of Proposition 3.
Proof. We first use the fact that
to obtain
Then,
We deal with the dot product as
Hence,
We observe in the end that
A.4. Elastic-Net
The previously proposed tests can be adapted straightforwardly to the Elastic-Net estimator (Zou & Hastie, 2005). We provide here some more details for the interested reader.
One can reformulate this problem as a Lasso problem
and
With this modification all the tests introduced for the
Lasso can be adapted for the Elastic-Net.