b

DiscoverSearch
About
My stuff
Mind the duality gap: safer rules for the Lasso
2015·arXiv
Abstract
Abstract

Screening rules allow to early discard irrelevant variables from the optimization in Lasso problems, or its derivatives, making solvers faster. In this paper, we propose new versions of the so-called safe rules for the Lasso. Based on duality gap considerations, our new rules create safe test regions whose diameters converge to zero, provided that one relies on a converging solver. This property helps screening out more variables, for a wider range of regularization parameter values. In addition to faster convergence, we prove that we correctly identify the active sets (supports) of the solutions in finite time. While our proposed strategy can cope with any solver, its performance is demonstrated using a coordinate descent algorithm particularly adapted to machine learning use cases. Significant computing time reductions are obtained with respect to previous safe rules.

Since the mid 1990’s, high dimensional statistics has attracted considerable attention, especially in the context of linear regression with more explanatory variables than observations: the so-called  p ą ncase. In such a context, the least squares with  ℓ1regularization, referred to as the Lasso (Tibshirani, 1996) in statistics, or Basis Pursuit (Chen et al., 1998) in signal processing, has been one of the most popular tools. It enjoys theoretical guarantees (Bickel et al., 2009), as well as practical benefits: it provides sparse solutions and fast convex solvers are available. This has made the Lasso a popular method in modern data-science toolkits. Among successful fields where it has been applied,

image

one can mention dictionary learning (Mairal, 2010), biostatistics (Haury et al., 2012) and medical imaging (Lustig et al., 2007; Gramfort et al., 2012) to name a few.

Many algorithms exist to approximate Lasso solutions, but it is still a burning issue to accelerate solvers in high dimensions. Indeed, although some other variable selection and prediction methods exist (Fan & Lv, 2008), the best performing methods usually rely on the Lasso. For stability selection methods (Meinshausen & B¨uhlmann, 2010; Bach, 2008; Varoquaux et al., 2012), hundreds of Lasso problems need to be solved. For non-convex approaches such as SCAD (Fan & Li, 2001) or MCP (Zhang, 2010), solving the Lasso is often a required preliminary step (Zou, 2006; Zhang & Zhang, 2012; Cand`es et al., 2008).

Among possible algorithmic candidates for solving the Lasso, one can mention homotopy methods (Osborne et al., 2000), LARS (Efron et al., 2004), and approximate homotopy (Mairal & Yu, 2012), that provide solutions for the full Lasso path, i.e., for all possible choices of tuning parameter  λ. More recently, particularly for  p ą n, coordinate descent approaches (Friedman et al., 2007) have proved to be among the best methods to tackle large scale problems.

Following the seminal work by El Ghaoui et al. (2012), screening techniques have emerged as a way to exploit the known sparsity of the solution by discarding features prior to starting a Lasso solver. Such techniques are coined safe rules when they screen out coefficients guaranteed to be zero in the targeted optimal solution. Zeroing those coeffi-cients allows to focus more precisely on the non-zero ones (likely to represent signal) and helps reducing the computational burden. We refer to (Xiang et al., 2014) for a concise introduction on safe rules. Other alternatives have tried to screen the Lasso relaxing the “safety”. Potentially, some variables are wrongly disregarded and post-processing is needed to recover them. This is for instance the strategy adopted for the strong rules (Tibshirani et al., 2012).

The original basic safe rules operate as follows: one chooses a fixed tuning parameter  λ, and before launching any solver, tests whether a coordinate can be zeroed or not (equivalently if the corresponding variable can be disregarded or not). We will refer to such safe rules as static safe rules. Note that the test is performed according to a safe region, i.e., a region containing a dual optimal solution of the Lasso problem. In the static case, the screening is performed only once, prior any optimization iteration. Two directions have emerged to improve on static strategies.

The first direction is oriented towards the resolution of the Lasso for a large number of tuning parameters. Indeed, practitioners commonly compute the Lasso over a grid of parameters and select the best one in a data-driven manner, e.g., by cross-validation. As two consecutive  λ1sin the grid lead to similar solutions, knowing the first solution may help improve screening for the second one. We call sequential safe rules such strategies, also referred to as recursive safe rules in (El Ghaoui et al., 2012). This road has been pursued in (Wang et al., 2013; Xu & Ramadge, 2013; Xi- ang et al., 2014), and can be thought of as a “warm start” of the screening (in addition to the warm start of the solution itself). When performing sequential safe rules, one should keep in mind that generally, only an approximation of the previous dual solution is computed. Though, the safety of the rule is guaranteed only if one uses the exact solution. Neglecting this issue, leads to “unsafe” rules: relevant variables might be wrongly disregarded.

The second direction aims at improving the screening by interlacing it throughout the optimization algorithm itself: although screening might be useless at the beginning of the algorithm, it might become (more) efficient as the algorithm proceeds towards the optimal solution. We call these strategies dynamic safe rules following (Bonnefoy et al., 2014a;b).

Based on convex optimization arguments, we leverage duality gap computations to propose a simple strategy unifying both sequential and dynamic safe rules. We coined GAP SAFE rules such safe rules.

The main contributions of this paper are 1) the introduction of new safe rules which demonstrate a clear practical improvement compared to prior strategies 2) the definition of a theoretical framework for comparing safe rules by looking at the convergence of their associated safe regions.

In Section 2, we present the framework and the basic concepts which guarantee the soundness of static and dynamic screening rules. Then, in Section 3, we introduce the new concept of converging safe rules. Such rules identify in finite time the active variables of the optimal solution (or equivalently the inactive variables), and the tests become more and more precise as the optimization algorithm proceeds. We also show that our new GAP SAFE rules, built on dual gap computations, are converging safe rules since their associated safe regions have a diameter converging to zero. We also explain how our GAP SAFE tests are sequential by nature. Application of our GAP SAFE rules with a coordinate descent solver for the Lasso problem is proposed in Section 4. Using standard data-sets, we report the time improvement compared to prior safe rules.

1.1. Model and notation

We denote by rds the set t1, . . . , du for any integer d P N. Our observation vector is  y P Rnand the design matrix X “ rx1, ¨ ¨ ¨ , xps P Rnˆphas p explanatory variables (or features) column-wise. We aim at approximating y as a linear combination of few variables  xj’s, hence expressing y as  Xβwhere  β P Rpis a sparse vector. The standard Euclidean norm is written  } ¨ }, the  ℓ1norm  } ¨ }1, the  ℓ8norm  } ¨ }8, and the matrix transposition of a matrix Q is denoted by  QJ. We denote  ptq` “ maxp0, tq.

For such a task, the Lasso is often considered (see B¨uhlmann & van de Geer (2011) for an introduction). For a tuning parameter  λ ą 0, controlling the trade-off between data fidelity and sparsity of the solutions, a Lasso estimator ˆβpλqis any solution of the primal optimization problem

image

Denoting  ∆X “␣θ P Rn : ��xJj θ�� ď 1, @j P rps(the dual feasible set, a dual formulation of the Lasso reads (see for instance Kim et al. (2007) or Xiang et al. (2014)):

image

We can reinterpret Eq. (2) as ˆθpλq “ Π∆Xpy{λq, where ΠCrefers to the projection onto a closed convex set C. In particular, this ensures that the dual solution ˆθpλqis always unique, contrarily to the primal ˆβpλq.

1.2. A KKT detour

For the Lasso problem, a primal solution ˆβpλq P Rpand the dual solution ˆθpλq P Rnare linked through the relation:

image

The Karush-Khun-Tucker (KKT) conditions state:

image

image

Table 1. Review of some common safe sphere tests.

See for instance (Xiang et al., 2014) for more details. The KKT conditions lead to the fact that for  λ ě λmax “}XJy}8, 0 P Rpis a primal solution. It can be considered as the mother of all safe screening rules. So from now on, we assume that  λ ď λmaxfor all the considered  λ’s.

Safe rules exploit the KKT condition (4). This equation implies that ˆβpλqj “ 0as soon as  |xJj ˆθpλq| ă 1. The main chal- lenge is that the dual optimal solution is unknown. Hence, a safe rule aims at constructing a set  C Ă Rncontaining ˆθpλq. We call such a set C a safe region. Safe regions are all the more helpful that for many j’s,  µCpxjq :“ supθPC |xJj θ| ă1, hence for many j’s, ˆβpλqj “ 0.

Practical benefits are obtained if one can construct a region C for which it is easy to compute its support function, denoted by  σCand defined for any  x P Rnby:

image

Cast differently, for any safe region C, any j P rps, and any primal optimal solution ˆβpλq, the following holds true:

image

We call safe test or safe rule, a test associated to C and screening out explanatory variables thanks to Eq. (6).

Remark 1. Reminding that the support function of a set is the same as the support function of its closed convex hull (Hiriart-Urruty & Lemar´echal, 1993)[Proposition V.2.2.1], we restrict our search to closed convex safe regions.

Based on a safe region C one can partition the explanatory variables into a safe active set  AλpCqand a safe zero set ZλpCqwhere:

image

Note that for nested safe regions  C1 Ă C2then  ApλqpC1q ĂApλqpC2q. Consequently, a natural goal is to find safe regions as small as possible: narrowing safe regions can only increase the number of screened out variables.

Remark 2. If  C “ tˆθpλqu, the safe active set is the equicorrelation set  ApλqpCq “ Eλ :“ tj P rps : |xJj ˆθpλq| “ 1u(in most cases (Tibshirani, 2013) it is exactly the support of ˆβpλq). Even when the Lasso is not unique, the equicorrelation set contains all the solutions’ supports. The other extreme case is when  C “ ∆X, and  ApλqpCq “ rps. Here, no variable is screened out:  ZpλqpCq “ Hand the screening is useless.

We now consider common safe regions whose support functions are easy to obtain in closed form. For simplicity we focus only on balls and domes, though more complicated regions could be investigated (Xiang et al., 2014).

2.1. Sphere tests

Following previous work on safe rules, we call sphere tests, tests relying on balls as safe regions. For a sphere test, one chooses a ball containing ˆθpλqwith center c and radius r, i.e.,  C “ Bpc, rq. Due to their simplicity, safe spheres have been the most commonly investigated safe regions (see for instance Table 1 for a brief review). The corresponding test is defined as follows:

image

Note that for a fixed center, the smaller the radius, the better the safe screening strategy.

Example 1. The first introduced sphere test (El Ghaoui et al., 2012) consists in using the center  c “ y{λand radius r “ |1{λ ´ 1{λmax|}y}. Given that ˆθpλq “ Π∆Xpy{λq, this is a safe region since  y{λmax P ∆Xand  }y{λmax ´Π∆Xpy{λq} ď }y}|1{λ ´ 1{λmax|. However, one can check that this static safe rule is useless as soon as

image

2.2. Dome tests

Other popular safe regions are domes, the intersection between a ball and a half-space. This kind of safe region has been considered for instance in (El Ghaoui et al., 2012; Xi- ang & Ramadge, 2012; Xiang et al., 2014; Bonnefoy et al.,

image

Figure 1. Representation of the dome  Dpc, r, α, wq(dark blue). In our case, note that  αis positive.

2014b). We denote  Dpc, r, α, wqthe dome with ball center c, ball radius r, oriented hyperplane with unit normal vector w and parameter  αsuch that  c ´ αrwis the projection of c on the hyperplane (see Figure 1 for an illustration in the interesting case  α ą 0).

Remark 3. The dome is non-trivial whenever  α P r´1, 1s. When  α “ 0, one gets simply a hemisphere.

For the dome test one needs to compute the support function for  C “ Dpc, r, α, wq. Interestingly, as for balls, it can be obtained in a closed form. Due to its length though, the formula is deferred to the Appendix (see also (Xiang et al., 2014)[Lemma 3] for more details).

2.3. Dynamic safe rules

For approximating a solution ˆβpλqof the Lasso primal problem  Pλ, iterative algorithms are commonly used. We denote  βk P Rpthe current estimate after k iterations of any iterative algorithm (see Section 4 for a specific study on coordinate descent). Dynamic safe rules aim at discovering safe regions that become narrower as k increases. To do so, one first needs dual feasible points:  θk P ∆X. Following El Ghaoui et al. (2012) (see also (Bonnefoy et al., 2014a)), this can be achieved by a simple transformation of the current residuals  ρk “ y ´ Xβk, defining  θkas

image

Such dual feasible  θkis proportional to  ρk, and is the closest point (for the norm  }¨}) to  y{λin  ∆Xwith such a property, i.e.,  θk “ Π∆XXSpanpρkqpy{λq. A reason for choosing this dual point is that the dual optimal solution ˆθpλqis the projection of  y{λon the dual feasible set  ∆X, and the optimal ˆθpλqis proportional to  y ´ X ˆβpλq, cf. Equation (3).

Remark 4. Note that if  limkÑ`8 βk “ ˆβpλq(convergence of the primal) then with the previous display and (3), we can show that  limkÑ`8 θk “ ˆθpλq. Moreover, the convergence of the primal is unaltered by safe rules: screening out unnecessary coefficients of  βk, can only decrease the distance between  βkand ˆβpλq.

Example 2. Note that any dual feasible point  θ P ∆Ximmediately provides a ball that contains ˆθpλqsince

image

The ball  B`y{λ, qRλpθkq˘corresponds to the simplest safe region introduced in (Bonnefoy et al., 2014a;b) (cf. Figure 2 for more insights). When the algorithm proceeds, one expects that  θkgets closer to ˆθpλq, so  }θk ´ y{λ}should get closer to  }ˆθpλq ´ y{λ}. Similarly to Example 1, this dynamic rule becomes useless once  λis too small. More precisely, this occurs as soon as

image

Noticing that  }ˆθpλq} ď }y{λ}(since  Π∆Xis a contraction and  0 P ∆X) and proceeding as for (10), one can show that this dynamic safe rule is inefficient when:

image

This is a critical threshold, yet the screening might stop even at a larger  λthanks to Eq. (13). In practice the bound in Eq. (13) cannot be evaluated a priori due to the term }ˆθpλq}). Note also that the bound in Eq. (14) is close to the one in Eq. (10), explaining the similar behavior observed in our experiments (see Figure 3 for instance).

3.1. Support discovery in finite time

Let us first introduce the notions of converging safe regions and converging safe tests.

Definition 1. Let  pCkqkPNbe a sequence of closed convex sets in  Rncontaining ˆθpλq. It is a converging sequence of safe regions for the Lasso with parameter  λif the diameters of the sets converge to zero. The associated safe screening rules are referred to as converging safe tests.

Not only converging safe regions are crucial to speed up computation, but they are also helpful to reach exact active set identification in a finite number of steps. More precisely, we prove that one recovers the equicorrelation set of the Lasso (cf. Remark 2) in finite time with any converging strategy: after a finite number of steps, the equicorrelation set  Eλis exactly identified. Such a property is sometimes referred to as finite identification of the support (Liang et al., 2014). This is summarized in the following.

image

Figure 2. Our new GAP SAFE sphere and dome (in orange). The dual optimal solution ˆθpλq must lie in the dark blue region;  β is anypoint in  Rp, and θis any point in the dual feasible set  ∆X. Remark that the GAP SAFE dome is included in the GAP SAFE sphere, and that it is the convex hull of the dark blue region.

Theorem 1. Let  pCkqkPNbe a sequence of converging safe regions. The estimated support provided by  Ck, ApλqpCkq “ tj P rps : maxθPCk |θJxj| ě 1u, satisfies limkÑ8 ApλqpCkq “ Eλ, and there exists  k0 P Nsuch that @k ě k0one gets  ApλqpCkq “ Eλ.

Proof. The main idea of the proof is to use that limkÑ8 Ck “ tˆθpλqu, limkÑ8 µCkpxq “ µtˆθpλqupxq “|xJˆθpλq|and that the set  ApλqpCkqis discrete. Details are delayed to the Appendix.

Remark 5. A more general result is proved for a spe-cific algorithm (Forward-Backward) in Liang et al. (2014). Interestingly, our scheme is independent of the algorithm considered (e.g., Forward-Backward (Beck & Teboulle, 2009), Primal Dual (Chambolle & Pock, 2011), coordinatedescent (Tseng, 2001; Friedman et al., 2007)) and relies only on the convergence of a sequence of safe regions.

3.2. GAP SAFE regions: leveraging the duality gap

image

Proof. The construction of the ball  Bpθ, ˜rλpβ, θqqis based on the weak duality theorem (cf. (Rockafellar & Wets, 1998) for a reminder on weak and strong duality). Fix θ P ∆Xand  β P Rp, then it holds that

image

Hence,

image

In particular, this provides  }ˆθpλq ´ y{λ} ě pRλpβq. Combining (12) and (16), asserts that ˆθpλqbelongs to the annulus  Apy{λ, qRλpθq, pRλpβqq :“ tz P Rn : pRλpβq ď}z ´ y{λ} ď qRλpθqu(the light blue zone in Figure 2).

Remind that the dual feasible set  ∆Xis convex, hence ∆X X Bpy{λ, qRλpθqqis also convex. Thanks to (16), ∆X XBpy{λ, qRλpθqq “ ∆X XApy{λ, qRλpθq, pRλpβqq, and then  ∆X X Apy{λ, qRλpθq, pRλpβqqis convex too. Hence, ˆθpλqis inside the annulus  Apy{λ, qRλpθq, pRλpβqqand so is  rθ, ˆθpλqs Ď Apy{λ, qRλpθq, pRλpβqqby convexity (see Figure 2,(a) and Figure 2,(b)). Moreover, ˆθpλqis the point of  rθ, ˆθpλqswhich is closest to  y{λ. The farthest where ˆθpλqcan be according to this information would be if  rθ, ˆθpλqswere tangent to the inner ball  Bpy{λ, pRλpβqqand  }ˆθpλq ´ y{λ} “ pRλpβq. Let us denote  θintsuch a point. The tangency property reads  ∥θint ´ y{λ∥ “ pRλpβqand  pθ ´ θintqJpy{λ ´ θintq “ 0. Hence, with the later and the definition of qRλpθq, ∥θ ´ y{λ∥2 “ ∥θ ´ θint∥2 `∥θint ´ y{λ∥2and  ∥θ ´ θint∥2 “ qRλpθq2 ´ pRλpβq2.

Since by construction ˆθpλqcannot be further away from  θthan  θint(again, insights can be gleaned from Figure 2), we conclude that ˆθpλq P B`θ, p qRλpθq2 ´ pRλpβq2q1{2˘.

Remark 6. Choosing  β “ 0and  θ “ y{λmax, then one recovers the static safe rule given in Example 1.

With the definition of the primal (resp. dual) objective for Pλpβq, (resp.  Dλpθq), the duality gap reads as  Gλpβ, θq “Pλpβq´Dλ pθq. Remind that if  Gλpβ, θq ď ϵ, then one has Pλpβq ´ Pλpˆβpλqq ď ϵ, which is a standard stopping criterion for Lasso solvers. The next proposition establishes a connection between the radius  rλpβ, θqand the duality gap Gλpβ, θq.

Proposition 1. For any  pβ, θq P Rp ˆ ∆X, the following holds

image

Proof. Use the fact that qRλpθq2 “ ∥θ ´ y{λ∥2and pRλpβq2 ě p∥y∥2 ´ ∥Xβ ´ y∥2 ´ 2λ ∥β∥1q{λ2.

If we could choose the “oracle”  θ “ ˆθpλqand  β “ ˆβpλqin (15) then we would obtain a zero radius. Since those quantities are unknown, we rather pick dynamically the current available estimates given by an optimization algorithm:  β “ βkand  θ “ θkas in Eq. (11). Introducing GAP SAFE spheres and domes as below, Proposition 2 ensures that they are converging safe regions.

GAP SAFE sphere:

image

GAP SAFE dome:

image

Proposition 2. For any converging primal sequence pβkqkPN, and dual sequence  pθkqkPNdefined as in Eq. (11), then the GAP SAFE sphere and the GAP SAFE dome are converging safe regions.

Proof. For the GAP SAFE sphere the result follows from strong duality, Remark 4 and Proposition 1 yield limkÑ8 rλpβk, θkq “ 0, since  limkÑ8 θk “ ˆθpλqand limkÑ8 βk “ ˆβpλq. For the GAP SAFE dome, one can check that it is included in the GAP SAFE sphere, therefore inherits the convergence (see also Figure 2,(c) and (d)).

Remark 7. The radius  rλpβk, θkqcan be compared with the radius considered for the Dynamic Safe rule and Dynamic ST3 (Bonnefoy et al., 2014a) respectively: qRλpθkq “ }θk ´ y{λ}2and  p qRλpθkq2 ´ δ2q1{2, where δ “ pλmax{λ ´ 1q{ ∥xj‹∥. We have proved that limkÑ8 rλpβk, θkq “ 0, but a weaker property is satisfied by the two other radius:  limkÑ8 qRλpθkq “ qRλpˆθpλqq “} qRλpˆθpλqq ´ y{λ}2and  limkÑ8p qRλpθkq2 ´ δ2q1{2 “p qRλpˆθpλqq2 ´ δ2q1{2 ą 0.

3.3. GAP SAFE rules : sequential for free

As a byproduct, our dynamic screening tests provide a warm start strategy for the safe regions, making our GAP SAFE rules inherently sequential. The next proposition shows their efficiency when attacking a new tuning parameter, after having solved the Lasso for a previous  λ, even only approximately. Handling approximate solutions is a critical issue to produce safe sequential strategies: without taking into account the approximation error, the screening might disregard relevant variables, especially the one near the safe regions boundaries. Except for  λmax, it is unrealistic to assume that one can dispose of exact solutions.

Consider  λ0 “ λmaxand a non-increasing sequence of T ´ 1tuning parameters  pλtqtPrT ´1sin  p0, λmaxq. In practice, we choose the common grid (B¨uhlmann & van de Geer, 2011)[2.12.1]):  λt “ λ010´δt{pT ´1q(for instance in Figure 3, we considered  δ “ 3). The next result controls how the duality gap, or equivalently, the diameter of our GAP SAFE regions, evolves from  λt´1to  λt.

Proposition 3. Suppose that  t ě 1and  pβ, θq P Rp ˆ ∆X. Reminding  r2λtpβ, θq “ 2Gλtpβ, θq{λ2t, the following holds

r2λtpβ, θq “ˆλt´1λt

image

Proof. Details are given in the Appendix.

This proposition motivates to screen sequentially as follows: having  pβ, θq P Rp ˆ∆Xsuch that  Gλt´1pβ, θq ď ϵ, then, we can screen using the GAP SAFE sphere with center  θand radius  rλpβ, θq. The adaptation to the GAP SAFE dome is straightforward and consists in replacing  θk, βk, λby  θ, β, λtin the GAP SAFE dome definition.

Remark 8. The basic sphere test of (Wang et al., 2013) requires the exact dual solution  θ “ ˆθpλt´1qfor center, and has radius  |1{λt ´1{λt´1| ∥y∥, which is strictly larger than ours. Indeed, if one has access to dual and primal optimal solutions at  λt´1, i.e.,  pθ, βq “ pˆθpλt´1q, ˆβpλt´1qq, then r2λt´1pβ, θq “ 0, θ “ py ´ Xβq{λt´1and

image

Note that contrarily to former sequential rules (Wang et al., 2013), our introduced GAP SAFE rules still work when one has only access to approximations of ˆθpλt´1q.

4.1. Coordinate Descent

Screening procedures can be used with any optimization algorithm. We chose coordinate descent because it is well suited for machine learning tasks, especially with sparse and/or unstructured design matrix X. Coordinate descent requires to extract efficiently columns of X which is typically not easy in signal processing applications where X is commonly an implicit operator (e.g. Fourier or wavelets).

image

We implemented the screening rules of Table 1 based on the coordinate descent in Scikit-learn (Pedregosa et al., 2011). This code is written in Python and Cython to generate low level C code, offering high performance. A low level language is necessary for this algorithm to scale. Two implementations were written to work efficiently with both dense data stored as Fortran ordered arrays and sparse data stored in the compressed sparse column (CSC) format. Our pseudo-code is presented in Algorithm 1. In practice, we perform the dynamic screening tests every  f “ 10passes through the entire (active) variables. Iterations are stopped when the duality gap is smaller than the target accuracy.

The naive computation of  θkin (11) involves the computation of��XJρk��8(ρkbeing the current residual), which costs Opnpq operations. This can be avoided as one knows when using a safe rule that the index achieving the maximum for this norm is in  AλtpCq. Indeed, by construction  arg maxjPAλtpCq |xJj θk| “ arg maxjPrps |xJj θk| “arg maxjPrps |xJj ρk|. In practice the evaluation of the dual gap is therefore not a Opnpq but Opnqq where q is the size of  AλtpCq. In other words, using screening also speeds up

image

Figure 3. Proportion of active variables as a function of  λ and thenumber of iterations K on the Leukemia dataset. Better strategies have longer range of  λwith (red) small active sets.

the evaluation of the stopping criterion.

We did not compare our method against the strong rules of Tibshirani et al. (2012) because they are not safe and therefore need complex post-processing with parameters to tune. Also we did not compare against the sequential rule of Wang et al. (2013) (e.g., EDDP) because it requires the exact dual optimal solution of the previous Lasso problem, which is not available in practice and can prevent the solver from actually converging: this is a phenomenon we always observed on our experiments.

4.2. Number of screened variables

Figure 3 presents the proportion of variables screened by several safe rules on the standard Leukemia dataset. The screening proportion is presented as a function of the number of iterations K. As the SAFE screening rule of El Ghaoui et al. (2012) is sequential but not dynamic, for a given  λthe proportion of screened variables does not depend on K. The rules of Bonnefoy et al. (2014a) are more efficient on this dataset but they do not benefit much from the dynamic framework. Our proposed GAP SAFE tests screen much more variables, especially when the tuning parameter  λgets small, which is particularly relevant in practice. Moreover, even for very small  λ’s (notice the logarithmic scale) where no variable is screened at the beginning of the optimization procedure, the GAP SAFE rules manage to screen more variables, especially when K increases. Finally, the figure demonstrates that the GAP SAFE dome test only brings marginal improvement over the sphere.

image

Figure 4. Time to reach convergence using various screening rules on the Leukemia dataset (dense data:  n “ 72, p “ 7129).

4.3. Gains in the computation of Lasso paths

The main interest of variable screening is to reduce computation costs. Indeed, the time to compute the screening itself should not be larger than the gains it provides. Hence, we compared the time needed to compute Lasso paths to prescribed accuracy for different safe rules. Figures 4, 5 and 6 illustrate results on three datasets. Figure 4 presents results on the dense, small scale, Leukemia dataset. Figure 5 presents results on a medium scale sparse dataset obtained with bag of words features extracted from the 20newsgroup dataset (comp.graphics vs. talk.religion.misc with TF-IDF removing English stop words and words occurring only once or more than 95% of the time). Text feature extraction was done using Scikit-Learn. Figure 6 focuses on the large scale sparse RCV1 (Reuters Corpus Volume 1) dataset, cf. (Schmidt et al., 2013).

In all cases, Lasso paths are computed as required to estimate optimal regularization parameters in practice (when using cross-validation one path is computed for each fold). For each Lasso path, solutions are obtained for  T “ 100values of  λ’s, as detailed in Section 3.3. Remark that the grid used is the default one in both Scikit-Learn and the glmnet R package. With our proposed GAP SAFE screening we obtain on all datasets substantial gains in computational time. We can already get an up to 3x speedup when we require a duality gap smaller than  10´4. The interest of the screening is even clearer for higher accuracies: GAP SAFE sphere is 11x faster than its competitors on the Leukemia dataset, at accuracy  10´8. One can observe that with the parameter grid used here, the larger is p compared to n, the higher is the gain in computation time.

In our experiments, the other safe screening rules did not show much speed-up. As one can see on Figure 3, those screening rules keep all the active variables for a wide range of  λ’s. The algorithm is thus faster for large  λ’s but slower afterwards, since we still compute the screening tests. Even if one can avoid some of these useless computations thanks to formulas like (14) or (10), the corresponding speed-up

image

Figure 5. Time to reach convergence using various screening rules on bag of words from the 20newsgroup dataset (sparse data: with n “ 961, p “ 10094).

image

Figure 6. Computation time to reach convergence using different screening strategies on the RCV1 (Reuters Corpus Volume 1) dataset (sparse data with  n “ 20242 and p “ 47236).

would not be significant.

We have presented new results on safe rules for accelerating algorithms solving the Lasso problem (see Appendix for extension to the Elastic Net). First, we have introduced the framework of converging safe rules, a key concept independent of the implementation chosen. Our second contribution was to leverage duality gap computations to create two safer rules satisfying the aforementioned convergence properties. Finally, we demonstrated the important practical benefits of those new rules by applying them to standard dense and sparse datasets using a coordinate descent solver. Future works will extend our framework to generalized linear model and group-Lasso.

The authors would like to thanks Jalal Fadili and Jingwei Liang for helping clarifying some misleading statements on the equicorrelation set. We acknowledge the support from Chair Machine Learning for Big Data at T´el´ecom ParisTech and from the Orange/T´el´ecom ParisTech think tank phiTAB. This work benefited from the support of the ”FMJH Program Gaspard Monge in optimization and operation re-

search”, and from the support to this program from EDF.

Bach, F. Bolasso: model consistent Lasso estimation through the bootstrap. In ICML, 2008.

Beck, A. and Teboulle, M. A fast iterative shrinkagethresholding algorithm for linear inverse problems. SIAM J. Imaging Sci., 2(1):183–202, 2009.

Bickel, P. J., Ritov, Y., and Tsybakov, A. B. Simultaneous analysis of Lasso and Dantzig selector. Ann. Statist., 37 (4):1705–1732, 2009.

Bonnefoy, A., Emiya, V., Ralaivola, L., and Gribonval, R. A dynamic screening principle for the lasso. In EUSIPCO, 2014a.

Bonnefoy, A., Emiya, V., Ralaivola, L., and Gribonval, R. Dynamic Screening: Accelerating First-Order Algorithms for the Lasso and Group-Lasso. ArXiv e-prints, 2014b.

B¨uhlmann, P. and van de Geer, S. Statistics for high-dimensional data. Springer Series in Statistics. Springer, Heidelberg, 2011. Methods, theory and applications.

Cand`es, E. J., Wakin, M. B., and Boyd, S. P. Enhancing sparsity by reweighted  l1minimization. J. Fourier Anal. Applicat., 14(5-6):877–905, 2008.

Chambolle, A. and Pock, T. A first-order primal-dual algo- rithm for convex problems with applications to imaging. J. Math. Imaging Vis., 40(1):120–145, 2011.

Chen, S. S., Donoho, D. L., and Saunders, M. A. Atomic decomposition by basis pursuit. SIAM J. Sci. Comput., 20(1):33–61 (electronic), 1998.

Efron, B., Hastie, T., Johnstone, I. M., and Tibshirani, R. Least angle regression. Ann. Statist., 32(2):407–499, 2004. With discussion, and a rejoinder by the authors.

El Ghaoui, L., Viallon, V., and Rabbani, T. Safe feature elimination in sparse supervised learning. J. Pacific Optim., 8(4):667–698, 2012.

Fan, J. and Li, R. Variable selection via nonconcave penal- ized likelihood and its oracle properties. J. Amer. Statist. Assoc., 96(456):1348–1360, 2001.

Fan, J. and Lv, J. Sure independence screening for ultrahigh dimensional feature space. J. Roy. Statist. Soc. Ser. B, 70 (5):849–911, 2008.

Friedman, J., Hastie, T., H¨ofling, H., and Tibshirani, R. Pathwise coordinate optimization. Ann. Appl. Stat., 1 (2):302–332, 2007.

Gramfort, A., Kowalski, M., and H¨am¨al¨ainen, M. Mixed- norm estimates for the M/EEG inverse problem using accelerated gradient methods. Phys. Med. Biol., 57(7): 1937–1961, 2012.

Haury, A.-C., Mordelet, F., Vera-Licona, P., and Vert, J.- P. TIGRESS: Trustful Inference of Gene REgulation using Stability Selection. BMC systems biology, 6(1):145, 2012.

Hiriart-Urruty, J.-B. and Lemar´echal, C. Convex analysis and minimization algorithms. I, volume 305. SpringerVerlag, Berlin, 1993.

Kim, S.-J., Koh, K., Lustig, M., Boyd, S., and Gorinevsky, D. An interior-point method for large-scale  l1-regularized least squares. IEEE J. Sel. Topics Signal Process., 1(4):606–617, 2007.

Liang, J., Fadili, J., and Peyr´e, G. Local linear convergence of forward–backward under partial smoothness. In NIPS, pp. 1970–1978, 2014.

Lustig, M., Donoho, D. L., and Pauly, J. M. Sparse MRI: The application of compressed sensing for rapid MR imaging. Magnetic Resonance in Medicine, 58(6):1182– 1195, 2007.

Mairal, J. Sparse coding for machine learning, image processing and computer vision. PhD thesis, ´Ecole normale sup´erieure de Cachan, 2010.

Mairal, J. and Yu, B. Complexity analysis of the lasso reg- ularization path. In ICML, 2012.

Meinshausen, N. and B¨uhlmann, P. Stability selection. J. Roy. Statist. Soc. Ser. B, 72(4):417–473, 2010.

Osborne, M. R., Presnell, B., and Turlach, B. A. A new approach to variable selection in least squares problems. IMA J. Numer. Anal., 20(3):389–403, 2000.

Pedregosa, F., Varoquaux, G., Gramfort, A., Michel, V., Thirion, B., Grisel, O., Blondel, M., Prettenhofer, P., Weiss, R., Dubourg, V., Vanderplas, J., Passos, A., Cournapeau, D., Brucher, M., Perrot, M., and Duchesnay, E. Scikit-learn: Machine learning in Python. J. Mach. Learn. Res., 12:2825–2830, 2011.

Rockafellar, R. T. and Wets, R. J.-B. Variational analysis, volume 317 of Grundlehren der Mathematischen Wissenschaften [Fundamental Principles of Mathematical Sciences]. Springer-Verlag, Berlin, 1998.

Schmidt, M., Le Roux, N., and Bach, F. Minimizing finite sums with the stochastic average gradient. arXiv preprint arXiv:1309.2388, 2013.

Tibshirani, R. Regression shrinkage and selection via the lasso. J. Roy. Statist. Soc. Ser. B, 58(1):267–288, 1996.

Tibshirani, R., Bien, J., Friedman, J., Hastie, T., Simon, N., Taylor, J., and Tibshirani, R. J. Strong rules for discarding predictors in lasso-type problems. J. Roy. Statist. Soc. Ser. B, 74(2):245–266, 2012.

Tibshirani, R. J. The lasso problem and uniqueness. Electron. J. Stat., 7:1456–1490, 2013.

Tseng, P. Convergence of a block coordinate descent method for nondifferentiable minimization. J. Optim. Theory Appl., 109(3):475–494, 2001.

Varoquaux, G., Gramfort, A., and Thirion, B. Smallsample brain mapping: sparse recovery on spatially correlated designs with randomization and clustering. In ICML, 2012.

Wang, J., Zhou, J., Wonka, P., and Ye, J. Lasso screening rules via dual polytope projection. In NIPS, pp. 1070– 1078, 2013.

Xiang, Z. J. and Ramadge, P. J. Fast lasso screening tests based on correlations. In ICASSP, pp. 2137–2140, 2012.

Xiang, Z. J., Xu, H., and Ramadge, P. J. Learning sparse representations of high dimensional data on large scale dictionaries. In NIPS, pp. 900–908, 2011.

Xiang, Z. J., Wang, Y., and Ramadge, P. J. Screening tests for lasso problems. arXiv preprint arXiv:1405.4897, 2014.

Xu, P. and Ramadge, P. J. Three structural results on the lasso problem. In ICASSP, pp. 3392–3396, 2013.

Zhang, C.-H. Nearly unbiased variable selection under minimax concave penalty. Ann. Statist., 38(2):894–942, 2010.

Zhang, C.-H. and Zhang, T. A general theory of concave regularization for high-dimensional sparse estimation problems. Statistical Science, 27(4):576–593, 2012.

Zou, H. The adaptive lasso and its oracle properties. J. Am. Statist. Assoc., 101(476):1418–1429, 2006.

Zou, H. and Hastie, T. Regularization and variable selec- tion via the elastic net. J. Roy. Statist. Soc. Ser. B, 67(2): 301–320, 2005.

We provided in this Appendix some more details on the theoretical results given in the main part.

A.1. Dome test

Let us consider the case where the safe region C is the dome  Dpc, r, α, wq, with parameters: center c, radius r, relative distance ratio  αand unit normal vector w.

The computation of the dome test formula proceeds as follows:

image

and so

image

Using the former notation:

Let us introduce the following dome parameters, for any  θ P ∆X:

Center:  c “ py{λ ` θq{2.

Radius:  r “ qRλpθq{2.

Ratio:  α“ ´1 ` 2 pRλpθq2{ qRλpθq2.

Normal vector:  w “ py{λ ´ θq{ qRλpθq.

Reminding that the support function of a set is the same as the support function of its closed convex hull (Hiriart-Urruty & Lemar´echal, 1993)[Proposition V.2.2.1] means that we only need to optimize over the dome introduced. Therefore, one cannot improve our previous result by optimizing the problem on the intersection of the ball of radius qRλpθqand the complement of the ball of radius pRλpβq(i.e., the blue region in Figure 2).

A.2. Proof of Theorem 1

Proof. Define  maxjREλ |xJj ˆθpλq| “ t ă 1. Fix  ϵ ą 0such that  ϵ ă p1´tq{pmaxjREλ }xj}q. As  Ckis a converging sequence containing ˆθpλq, its diameter is converging to zero, and there exists  k0 P Nsuch that  @k ě k0, @θ P Ck, }θ ´ ˆθpλq} ď ϵ. Hence, for any  j R Eλand any  θ P Ck, |xJj pθ ´ ˆθpλqq| ď pmaxjREλ }xj}q}θ ´ ˆθpλq} ď pmaxjREλ }xj}qϵ. Using the triangle inequality, one gets

image

provided that  ϵ ă p1 ´ tq{pmaxjREλ }xj}q. Thus, for any  k ě k0, Ecλ Ă ZpλqpCkq “ ApλqpCkqcand  ApλqpCkq Ă Eλ.

For the reverse inclusion take  j P Eλ, i.e.,  |xJj ˆθpλq| “ 1. Since for all  k P N, ˆθpλq P Ck, then  j P ApλqpCkq “ tj P rps :maxθPCk |xJj θ| ě 1uand the result holds.

image

A.3. Proof of Proposition 3

We detail here the proof of Proposition 3.

Proof. We first use the fact that

image

to obtain

image

Then,

image

We deal with the dot product as

image

Hence,

image

We observe in the end that

image

A.4. Elastic-Net

The previously proposed tests can be adapted straightforwardly to the Elastic-Net estimator (Zou & Hastie, 2005). We provide here some more details for the interested reader.

image

One can reformulate this problem as a Lasso problem

˙P Rn`p,pand  ˜y “ˆy0˙P Rn`p.With this modification all the tests introduced for the

Lasso can be adapted for the Elastic-Net.


Designed for Accessibility and to further Open Science