b

DiscoverSearch
About
My stuff
Screening Tests for Lasso Problems
2014·arXiv
Abstract
Abstract

This paper is a survey of dictionary screening for the lasso problem. The lasso problem seeks a sparse linear combination of the columns of a dictionary to best match a given target vector. This sparse representation has proven useful in a variety of subsequent processing and decision tasks. For a given target vector, dictionary screening quickly identifies a subset of dictionary columns that will receive zero weight in a solution of the corresponding lasso problem. These columns can be removed from the dictionary prior to solving the lasso problem without impacting the optimality of the solution obtained. This has two potential advantages: it reduces the size of the dictionary, allowing the lasso problem to be solved with less resources, and it may speed up obtaining a solution. Using a geometrically intuitive framework, we provide basic insights for understanding useful lasso screening tests and their limitations. We also provide illustrative numerical studies on several datasets.

image

The sparse representation of data with respect to a dictionary of features has recently contributed to successful new methods in machine learning, pattern analysis, and signal/image processing. At the heart of many sparse representation methods is the least squares problem with ℓ1regularization, often called the lasso problem [1]:

image

where  λ > 0is a regularization parameter. The matrix B ∈ Rn×pis called the dictionary and its columns  {bi}pi=1are usually called features. Depending on the field, the terms codewords, atoms, filters, and regressors are also used. The lasso problem seeks a representation of the target vector  y ∈ Rnas a linear combination �pi=1 wibiof the features with many  wi = 0(sparse representation). Equation (1) also serves as the Lagrangian for the widely used constrained problems  minw∈Rp ∥y − Bw∥22subject to  ∥w∥1 ≤ σ, and  minw∈Rp ∥w∥1subject to ∥y − Bw∥22 ≤ ε. Many solvers of these problems address the Lagrangian formulation (1) directly [2].

The above problems are studied extensively in the signal processing, computer vision, machine learning, and statistics literature. See, for example, the general introduction to sparse dictionary representation methods in [3] and [4]. Sparse representation has proven effective in applications ranging from image restoration [5], [6], to face recognition [7], [8], object recognition [9], speech classification [10], speech recognition [11], music genre classification [12], and topic detection in text documents [13]. In these applications, it is common to encounter a large dictionary (e.g., in face recognition), data with large data dimension (e.g., in topic detection), and in

dictionary learning, a large number of dictionary iterations (e.g., in image restoration). These factors can make solving problem (1) a bottleneck in the computation. Several approaches have been suggested for addressing this computational challenge. In the context of clas-sification, Zhang et al. [14] propose abandoning sparsity and using a fast collaborative linear representation scheme based on  ℓ2regularized least squares. This improves the speed of classification in face recognition applications. However, in general the (nonlinear) Sparse Representation Classifier (SRC) [7] achieves superior classification accuracy. Another approach is to seek a sparse representation using a fast greedy method to approximate the solution of (1). There has been a considerable amount of work in this direction, see for example [3], [15], [16]. However, this approach seems best when seeking very sparse solutions and, in general, the solutions obtained can be challenging to analyze. Recently an approach known as (dictionary) screening has been proposed. For a given target vector y and regularization parameter  λ, screening quickly identifies a subset of features that is guaranteed to have zero weight in a solution  ˆwof (1). These features can be removed (or “rejected”) from the dictionary to form a smaller, more readily solved lasso problem. By padding its solution appropriately with zeros, one obtains a solution of the original problem. This approach is the focus of the paper. Screening has two potential benefits. First, it can be run in an on-line mode with very few features loaded into memory at a given time. By this means, screening can significantly reduce the size of the dictionary that needs to be loaded into memory in order to solve the lasso problem. Second, by quickly reducing the number of features we can often solve problems faster. Even small gains can become very significant when many lasso problems must be solved. Moreover, since screening is transparent to the lasso solver, it can be used in conjunction with many existing solvers. The idea of screening can be traced back to various

feature selection heuristics in which selected features {bi}are used to fit a response vector y. This is usually done by selecting features based on an empirical measure of relevance to y, such as the correlation of y and  bi. This is used, for example, in univariate voxel selection based on t-statistics in the fMRI literature [17]. Fan and Lv [18] give an excellent review of recent results on correlation based feature selection and formalize the approach in a probabilistic setting as a correlation based algorithm called Sure Independence Screening (SIS). In a similar spirit, Tibshirani et al. [19] report Strong Rules for screening the lasso, the elastic net and logistic regression. These rules are also based on thresholding correlations. With small probability, SIS and the Strong Rules can yield “false” rejections.

A second approach to screening seeks to remove dictionary columns while avoiding any false rejections. In spirit, this harks back to the problem of removing “non-binding” constraints in linear programs [20]. For the lasso problem, the first line of recent work in this direction is due to El Ghaoui et al. [21], where such screening tests are called “SAFE” tests. In addition to the lasso, this work examined screening for a variety of related sparse regularization problems. Recent work (e.g., [22], [23], [24], [25]) has focused mainly on the lasso problem and close variants.

The basic approach in the above papers is to bound the solution of the dual problem of (1) within a compact region R and find  µR(b) = maxθ∈R θT b. For simple regions  R, µRis readily computed and yields a screening test for removing a subset of unneeded features. This approach has resulted in tests based on spherical bounds [21], [22], the intersection of spheres and half spaces (domes) [21], [23], elliptical bounds [24] and novel approaches for selecting the parameters of these regions to best bound the dual solution of (1) [25]. These screening tests can execute quickly, either serially or in parallel, and require very few features to be loaded into memory at once. If one seeks a strongly to moderately sparse solution, the tests can significantly reduce dictionary size and speed up the solution of lasso problems.

To keep our survey focused, we concentrate on screening for the lasso problem. However, the methods discussed apply to any problem that can be efficiently transformed into a lasso problem. For example, the elastic net [26] and full rank generalized lasso problems [27]. Moreover, the basic ideas and methods discussed are a good foundation for applying screening to other sparse regularization problems. We will situate our exposition within the context of prior work as the development proceeds.

The main features of our survey include: (a) Our exposition uses a geometric framework which unifies many lasso screening tests and provides basic tools and geometric insights useful for developing new tests. In particular, we emphasize the separation of the structure or “architecture” of the test from the design problem of selecting its parameters.

(b) We examine whether more complex screening tests are worthwhile. For each  m ≥ 0, there is a family of tests based on the intersection of a spherical bound and m half spaces. As m increases these tests can reject more features but are also more time consuming to execute. To examine if more complex tests are worthwhile, we derive the region screening test for the intersection of a sphere and two half spaces, and use this to examine where current region screening tests stand in the trade-off between rejection rate and computational efficiency. (c) We show how composite tests can be formed from existing tests. In particular, we describe a composite test based on carefully selected dome regions that performs competitively in numerical studies. We also point out a fundamental limitation of this approach.

(d) We review sequential screening schemes that make headway on the problem of screening for small normalized values of  λ. When used in an “on-line” mode with realistic values of the regularization parameter, these methods can successfully reduce the size of large dictionaries to a manageable size, allowing larger problems to be solved, and can result in a faster overall computation.

1.1 Outline of the Paper

We begin in  §2 with a review of basic tools, especially the dual of the lasso problem and its geometric interpretation.  §3 introduces screening in greater detail and  §4 introduces region tests. After these preparations, we discuss several important forms of region tests: sphere tests (§4.2), sphere plus hyperplane tests (§4.3) and sphere plus two hyperplane tests (§4.5). We show how spherical bounds can be iteratively refined using features (§4.4), and examine ways to combine basic tests.  §5 gives a brief overview of sequential screening. We give a practical summary of screening algorithms in  §6 and illustrate the results of screening via numerical studies in  §7. We conclude in  §8. Proofs of new or key results are given in the Appendices, organized by the section in which the result is discussed.

We focus on the lasso (1), but it will be convenient to also consider the nonnegative lasso:

image

The analysis and algorithms in the paper apply (with minor changes) to both problems.

Throughout the paper we assume that a fixed dictionary B is used to solve various instances of (1) or (2). We assume that all features are nonzero and say that the dictionary is normalized if all features have unit norm. Each instance is specified by a pair  (y, λ)consisting of a target vector y and a value  λof the regularization parameter.

Multiplying the objective of (1) by  α2, with  α > 0, yields the equivalent problem:

image

where  ¯y = αy, ¯B = αB, and ¯λ = α2λ. Some lasso solvers require that  ∥B∥F ≤ 1, and problem (1) must be scaled to ensure this holds. As a result, it is meaningless to talk about the value of  λemployed when solving (1) without accounting for possible scaling. One way to do this is to define  λmax = maxpj=1 |bTj y|. Then the ratio  λ/λmaxis invariant to scaling. The parameter  λmaxis also useful for other purposes related to screening. Throughout the paper, we use the ratio  λ/λmaxas an unambiguous measure of the amount of regularization used in solving (1) and (2).

Geometric insight on lasso problems, and on screening in particular, is enhanced by bringing in the Lagrangian dual of (1). The following parameterization of the dual problem is particularly convenient [22]:

image

Solutions  ˆw ∈ Rpof (1) and ˆθ ∈ Rnof (3) satisfy:

image

The corresponding dual problem of (2) is:

image

with the primal and dual solutions related via:

image

A derivation of (3) and (4) is given in the Appendix. It will be convenient to define a feature pool B. For the lasso,  B = {±bi}pi=1and for the nonnegative lasso, B = {bi}pi=1. This allows the constraints in (3) and (5) to be stated as  ∀b ∈ B : θT b ≤ 1.

For  x ∈ Rn, let  P(x) = {z : xT z = 1}denote the hyperplane in  Rnthat has unit normal  x/∥x∥2and contains the point  x/∥x∥22. Let  H(x) = {z : xT z ≤ 1}denote the corresponding closed half space containing the origin. So a constraint of the form  bT θ ≤ 1requires that  θlies in the closed half space H(b). Hence the set of feasible points F of the dual problems is the nonempty, closed, convex set formed by the intersection of the finite set of closed half spaces  H(b), b ∈ B. This is illustrated in Fig. 1(a) and 1(b). In addition, for the lasso,  θ ∈ Fif and only if  −θ ∈ F. So  −F = F. This follows from the same property of the feature pool:  −B = B.

To maximize the objective function in (3) or (5) we seek the projection ˆθof  y/λonto the closed convex set F. This is the unique point satisfying the following set of inequalities [28,  §3.1]: for each  θ ∈ F,

image

In contrast, the lasso problem (1) may not have a unique solution [29], [30].

The set of points  {ˆθ(λ), λ > 0}is called the dual regularization path. For  λsufficiently large,  y/λlies in F and ˆθ(λ) = y/λ. To find the smallest  λfor which this holds, let

image

Then for all  b ∈ B: (y/λmax)T b ≤ yT bmax/λmax = 1. So  y/λmaxlies in the boundary of F. As  λ ≥ λmaxdecreases from a large value, ˆθ(λ) = y/λmoves in a straight line within F until  λ = λmax, at which point ˆθ(λmax) = y/λmaxfirst lies on the boundary of F. As λdecreases below  λmax, y/λmoves away from F and ˆθ(λ)is the unique projection of  y/λonto the boundary of F. Using (4), for  λ/λmax > 1, ˆw = 0, and conversely, if  ˆw = 0, then ˆθ = y/λ ∈ F. So for  λ/λmax ∈ (0, 1), y/λ /∈ F, ˆθ(λ)lines on the boundary of F, and  ˆwis nonzero.

Let I = [1, . . . , p] denote the ordered set of feature indices and  S ⊂ I. Given  w ∈ Rp, let  w↓Sdenote the vector in  R|S|obtained by subsampling w at the indices in S. Conversely, for  z ∈ R|S|, let  z↑Sdenote the vector in  Rpobtained by upsampling z: the entries of z↑Swith indices in S take the corresponding values in z and all other entries are zero. Similarly, for a dictionary B ∈ Rn×p, let  B↓Sdenote the subdictionary obtained by sampling the columns of B at the indices in S. The following properties are clear: (a)  z = (z↑S)↓S; (b) if  wi = 0for  i /∈ S, then  w = (w↓S)↑S; (c)  B↓Sz = Bz↑S; and (d) if wi = 0for  i /∈ S, Bw = B↓Sw↓S.

By (4), if we know the primal solution  ˆw, then the dual solution is ˆθ = (y − B ˆw)/λ. Conversely, if we know the dual solution ˆθ, then any point satisfying the following equations is a primal solution:

image

We now explain the idea of screening in detail. Given an instance  (y, λ)of (1), we select a partition  I = S ∪ ¯Sof the features. We say that the features indexed by S are selected and those indexed by ¯Sare rejected. Then we form the reduced dictionary  B↓Sof selected features and let ˆzdenote a solution of the corresponding lasso problem using this dictionary. In general, the upsampled vector ˆz↑Sis not a solution of the original lasso problem (1).

image

Fig. 1. The constraints and feasible set F of the dual problem for (a): general features, (b): unit norm features. (c): Examples of two spheres and a dome region bounding ˆθfor unit norm features. In all cases only the lower half of F is shown.

Here is the key point: screening seeks a partition such that the upsampled vector  ˆz↑Ssolves (1). In general, such a partition depends on the instance and hence must be computed “on-the-fly”.

By virtue of being smaller, the reduced problem is more manageable. For example, it may fit into memory when the original problem does not, and finding its solution may require less time. Hence there are two evaluation metrics of interest: the size of S (or ¯S) as a fraction of I, and the total time taken to select S and solve the reduced problem relative to the time taken to solve the original problem directly without screening. We will normally express these metrics as the rejection fraction  | ¯S|/|I|and the speedup factor  tsolve/(tscreen+trsolve). Here  tsolveis the time to solve the original lasso problem, tscreenis the time to select the partition (screen the dictionary), and  trsolveis the time to solve the reduced lasso problem.

Not surprisingly, if we know the dual solution ˆθ, then it is easy to come up with a suitable partition. To see this, consider the lasso problem. For any partition  S ∪ ¯S, let  ˆwand  ˆzdenote solutions of the original and reduced lasso problems, respectively. It is clear that the following always holds:

image

Now assume the dual solution ˆθis known, let  A(ˆθ) ={i: |ˆθT bi| = 1}denote the active constraints at  ˆθ, and consider the particular partition  A(ˆθ) ∪ ¯A(ˆθ). Equation (4) shows that if  |ˆθT bi| < 1(equivalently,  i ∈ ¯A(ˆθ)), then  ˆwi = 0. Hence for this partition,  B ˆw = B↓A(ˆθ) ˆw↓A(ˆθ), ∥ ˆw↓A(ˆθ)∥1 = ∥ ˆw∥1, and

image

Equation (12) implies that for this partition the two inequalities in (11) must be equalities. It follows that ˆw↓A(ˆθ)solves the reduced problem and  ˆz↑Ssolves the original problem. Although a simple observation, this is worth stating as a theorem.

Theorem 1. Let the solution ˆθof (3) (resp. (5)) have active set  A(ˆθ). If  ˆzis a solution of (1) (resp. (2)) with dictionary B↓A(ˆθ), then  ˆz↑A(ˆθ)solves (1) (resp. (2)). Moreover, every solution of (1) (resp. (2)) can be expressed in this way.

The fundamental partition of I into  A(ˆθ)and ¯A(ˆθ)is conceptually very important but obviously impractical. If we know ˆθ, then we can easily solve the primal problem (see (10)) and this makes screening and problem reduction unnecessary. As a first step towards finding a practical way to partition the features, we note that if  A(ˆθ) ⊆ S(screening keeps more) or equivalently ¯S ⊆ ¯A(ˆθ)(screening rejects less), then equation (12) holds with S replacing  A(ˆθ). This implies that the two inequalities in (11) hold with equality for this partition. Hence we have the following corollary of Theorem 1.

Corollary 1. Let the solution ˆθof (3) (resp. (5)) have active set  A(ˆθ). Let  A(ˆθ) ⊆ S ⊆ I. If  ˆzis a solution of (1) (resp. (2)) with dictionary  B↓S, then  ˆz↑Sis a solution of (1) (resp. (2)). Moreover, every solution of (1) (resp. (2)) can be expressed in this way.

The core idea for creating a partition of the dictionary that conforms with Corollary 1 is to bound ˆθwithin a compact region R. For each feature b, we then compute µR(b) = maxθ∈R θT b, and use this quantity to partition B [21].

We first illustrate this for the nonnegative lasso. For a compact set R, if  R = ∅, all features are rejected; otherwise for each feature  bi, µR(bi) = maxθ∈R θT biexists. Then define the partition:

image

The logic is that if ˆθ ∈ Rand  µR(bi) < 1, then ˆθT bi < 1and hence  i ∈ ¯A(ˆθ). Thus ¯S ⊆ ¯A(ˆθ), as desired.

For the lasso problem, ˆθ ∈ Rand  µR(bi) < 1ensure ˆθT bi < 1. But in this case we also need  −1 < ˆθT bior equivalently ˆθT (−bi) < 1. This holds if  µR(−bi) < 1. Effectively, we must test both  biand  −bito account for the positive or negative sign of  wi. So for the lasso the partition is:

image

For example, when  R = {ˆθ}, i ∈ ¯Sif: (a) ˆθT bi < 1(nonnegative lasso) and (b)  |ˆθT bi| < 1(lasso). So R = {ˆθ}, yields the ideal partition  A(ˆθ) ∪ ¯A(ˆθ).

From the above constructions, we see that ˆθ ∈ Rensures that the partitions (13) and (14) satisfy ¯S ⊆ ¯A(ˆθ). Hence the assumptions of Corollary 1 are satisfied. This is summarized in the following corollary.

Corollary 2. Let R be a compact region with ˆθ ∈ R. Then R defines a dictionary partition  S ∪ ¯Swith ¯S ⊆ ¯A(ˆθ).

It will be convenient to encode the partition induced by a bounding region R as a rejection test  TRwith TR(b) = 1if  b ∈ ¯Sand 0 otherwise. For example, the rejection test corresponding to (14) is:

image

We end this section by noting that for a given dictionary B, the partial order of subsets of features induces a partial order on screening tests. Test  T ′is weaker than test T, denoted  T ′ ⪯ T, if the set of features rejected by T ′is a subset of the features rejected by T. For example, if ˆθ ∈ R, then  TR ⪯ T{ˆθ}. This is a special case of the following lemma.

Lemma 1. If  R1 ⊆ R2, then  TR2 ⪯ TR1.

If  R1 ⊂ R2, then the region test for  R1can potentially reject more features than the test for  R2.

4.1 The Sphere-Hyperplane Architecture

We now consider particular forms of bounding regions for ˆθ. A natural form of bounding region consists of the intersection of a spherical bound with a finite number of half spaces. The spherical bound arises naturally once we know a dual feasible point, and half spaces arise naturally since these define the dual feasible region F (see (3)), and are integral to the projection of a point onto F (see (7)).

The intersection of a closed ball S(q, r) = {z : ∥z − q∥2 ≤ r}with center q and radius r, and m half spaces  nTi θ ≤ ci, i = 1, . . . , m, gives the region:

image

To form the corresponding region test, we find  µ(b) =maxθ∈R θT bby solving the optimization problem:

image

Once  µ(b)is known, (15) gives the corresponding screening test. Using the change of variable  z = (θ − q)/r, problem (16) can be simplified to:

image

where  ψi = (nTi q − ci)/r. The solution of (16) is then µ(b) = qT b + r¯µ(b). By decomposing z and b in terms of span{ni}mi=1and its orthogonal complement, (17) reduces to a convex program in  Rm+1.

Increasing m results in tests with the potential to reject more features, but which are also more complex and time consuming to execute. In the following two subsections, we discuss the simplest cases: m = 0 (sphere tests), and m = 1 (dome tests). This gives insight into basic tests and makes connections with the literature.

4.2 Sphere Tests

Consider bounding ˆθwithin a closed ball S(q, r) = {z : ∥z − q∥2 ≤ r}with center q and radius r. This bound gives a simple, efficiently implemented test, and it is also a useful building block for more complex tests. We first determine a close form expression for µS(q,r)(b) = maxθ∈S(q,r) θT b. An expression for a sphere test  TS(q,r)then follows from (15).

Lemma 2. For  S(q, r) = {z : ∥z − q∥2 ≤ r}and  b ∈ Rn:

image

Theorem 2. The screening test for the sphere S(q, r) is:

image

where  Vu(t) = 1 − rtand for the lasso  Vl(t) = −Vu(t), and for the nonnegative lasso  Vl(t) = −∞.

For the lasso, the test (19) can also be written as:

image

Theorem 2 defines a parametric family of tests: {ST(q, r): q ∈ Rn, r ≥ 0}, where ST(q, r) denotes the sphere test with center q and radius r. To use a sphere test one first selects values of q and r so that S(q, r) bounds ˆθ. We call this the parameter selection problem. By Lemma 1, a tighter bound has potential for better screening. So using only the information provided, and limited computation, we want to select q and r to give the “best bound”. This is a design problem involving a trade-off between the computation cost to select q and r and the resultant screening performance. Hence we don’t expect there is a “best answer”. We outline below several selection methods.

4.2.1 Parameter selection

If we know a dual feasible point  θF ∈ F, then ˆθcan’t be further away from  y/λthan  θF. This gives the basic spherical bound:

image

with center  q = y/λand radius  r = ∥θF − y/λ∥2. In particular, ˆθ(λmax) = y/λmaxis dual feasible and gives a particular instance of (21):

image

This bound is shown in Fig. 1(c) as the larger sphere in solid red. The bound (22) requires only the specification of the lasso problem and the computation of  λmax. We call it the default spherical bound.

Better bounds are possible with additional computation or if additional information is supplied. For example, [25] observed that to obtain a feasible point  θFcloser to ˆθthan  y/λmaxone can first run K steps of the homotopy algorithm on (1). This gives the solution  ˆwKof the instance  (y, λK), λK > λ, for the K-th breakpoint on the (primal) regularization path. Effectively, this first solves the lasso problem for  λK > λ, and then uses this solution to help screen for the actual instance to be solved. The sphere center can also be moved away from y/λ. Examples include the sphere tests ST2 and ST3 in [22] derived in the setting of unit norm y and  bi. In addition, [25] noted that if the dual solution ˆθ0is known for an instance  (x, λ0), then  ∥ˆθ(λ)−ˆθ0∥2 ≤ |1/λ−1/λ0| ∥y∥2(this is discussed further below). This leverages a solved instance to give a spherical bound centered at  q = ˆθ0.

4.2.2 Connections with the Literature

A variety of existing screening tests for the lasso are sphere tests. The Basic SAFE-LASSO test [31] and the test ST1 in [22, Sect. 2] are sphere tests based on the default spherical bound (22). The SAFE-LASSO test [31, Theorem 2] is also a sphere test. It assumes a dual feasible point  θ0is given and uses this to improve the default spherical bound centered at  y/λ. The sphere tests ST2 and ST3 in [22, Sect. 2] use spherical bounds not centered at  y/λ. We will comment further on the test ST3 at the end of  §4.6. The core test used in [25] is a sphere test with center ˆθ0 = ˆθ(λ0), where ˆθ0(λ0)is the dual solution at  λ0, and radius  |1/λ − 1/λ0| ∥y∥2. This bound follows from the nonexpansive property of projection onto a convex set:

image

The Strong Rule [19] is also a sphere test for the lasso problem. For notational simplicity, let the features and the target vector y have unit norm. The Strong Rule discards feature  biif  |bTi y| < 2λ−λmax. This is a sphere test with center  q = y/λand radius  rsr = (λmax − λ)/λ. The point ˆθis bounded within the default sphere (center  y/λ, radius  r = 1/λ − 1/λmax). The Strong Rule uses a sphere with the same center but a radius only a fraction of  r: rsr = rλmax. This smaller sphere is not guaranteed to contain ˆθ. So the Strong Rule can (with low probability) yield false rejections. A detailed discussion of this issue is given in [19]. A more advanced version of the Strong Rule, the Strong Sequential Rule [19], assumes a solution  ˆw0of the lasso instance  (y, λ0)is available, where  λ0 > λ. It then forms the residual r0 = y −B ˆw0and screens the lasso instance  (y, λ)using the test  |bTi r0| < 2λ−λ0. This is also a sphere test. To see this, use (4) to write  r0 = y − B ˆw = λ0ˆθ0. Then the test becomes  |bTi ˆθ0| < 1 − rssrwith  rssr = (1/λ − 1/λ0)2λ. This is a sphere test with center ˆθ0and radius  rssr = 2λrwhere r is the radius of the known bounding sphere (23). When  λ < 0.5, this test may also yield false rejections. See [19] for examples and analysis of rule violations.

The SIS test in [18] is framed in a probabilistic setting and is not intended for lasso screening. Nevertheless, if we translate SIS into our setting it is a sphere test for a lasso problem with appropriately selected  λ. SIS assumes a dictionary  B ∈ Rn×pof standardized vectors (features of unit norm) and computes the vector of (marginal) correlations  ρ = BT y. Then given  0 < γ < 1, it selects the top  [γn]features ranked by  |ρi|. Assume for simplicity that the values  |ρi|are distinct and let tγdenote the value of  |ρi|for the  [γn]-th feature in the ranking. The SIS rejection criterion can then be written as  |bTi y| < tγ. We now form a lasso problem with dictionary B, target vector y, and a value  λ/λmaxto be decided. For simplicity of notation, assume that y has unit norm. Then the default spherical bound for the dual solution of the lasso has center  y/λand radius r = 1/λ − 1/λmax, and the corresponding sphere test is  |bTi y| < λ(1 − r). Equating the right hand sides of the above test expressions, and using some algebra shows that if we take  λ/λmax = (1 + tγ)/(1 + λmax) < 1, then SIS is the default sphere test for this particular lasso problem.

4.3 Sphere Plus Halfspace Tests

Now consider a region test based on the nonempty intersection of a spherical ball  {z: ∥z − q∥2 ≤ r}and one closed half space  {z: nT z ≤ c}. Here n is the unit normal to the half space and  c ≥ 0. This yields the dome region D(q, r; n, c) = {z: nT z ≤ c, ∥z − q∥2 ≤ r}illustrated in Fig. 2(a).

The following features of the dome D(q, r; n, c) will be useful. We call the point  qdon the bounding hyperplane and the line passing through q in the direction of the hyperplane normal the dome center. The signed distance

image

Fig. 2. (a) A general dome region D(q, r; n, c) shown for 0 < ψd < 1and the dome consisting of less than half the sphere. (b) The rejection area (shaded) of a lasso dome test.

from q to  qdin the direction  −nis a fraction  ψdof the radius r of the sphere. We call the maximum straight line distance  rdone can move from  qdwithin the dome and hyperplane the dome radius. Under the sign convention indicated above, simple Euclidean geometry gives the following relationships:

image

To ensure that the dome is nondegenerate (a nonempty and proper subset of each region), we need  qdto be inside the sphere. Hence we require  −1 ≤ ψd ≤ 1. So we need  qT n ≥ c − r, this ensures that the intersection is a proper subset of the sphere and the half space; and we need  qT n ≤ c + r, this ensures the intersection is nonempty.

To find  µ(b) = maxθ∈D(q,r;n,c) θT b, for  b ∈ Rn, we solve the optimization problem (16) with m = 1. Particular instances of this problem were solved in [32, Appendix A] (by solving a Lagrange dual problem) and in [23,  §3] (by directly solving a primal problem). Both approaches can be extended to solve the general problem (16) with m = 1. This yields the following lemma, and the dome screening test.

Lemma 3. Fix a dome D = D(q, r; n, c) with  |ψd| ≤ 1. Then for  b ∈ Rn,

image

where  M1(t1, t2)is the function

image

Theorem 3. The screening test for a nondegenerate dome D(q, r; n, c) is:

image

where  Vu(t1, t2) = 1−M1(t1, t2)and for the lasso  Vl(t1, t2) =−Vu(−t1, t2), and for the nonnegative lasso  Vl(t1, t2) = −∞.

We denote a dome test by DT(q, r; n, c). Although defined piecewise, the functions  Vuand  Vlin Theorem 3 are continuous and smooth:  Vu, Vl ∈ C1. This can be checked using simple calculus. The parameters r and c of the dome do not appear as arguments in the test but play a role through  M1. The test simplifies for unit norm features. In that case,  t2 = ∥bi∥2 = 1and  M1, Vuand  Vlare only functions of  t1.

To gain some insight into this test, consider the situation when r < 1 and all features have unit norm. We can factor the test into the composition of two functions: a linear map  bi �→ [q, n]T biand a two-dimensional decision function  Hr,ψdwith  Hr,ψd(s, t) = 1if  Vl(t) < s <Vu(t), and 0 otherwise; where  s = qT bi ∈ [−∥q∥2, ∥q∥2], t = nT bi ∈ [−1, 1], and  Vu(t), Vl(t)are given in Theorem 3 with  t = t1and  t2 = 1. We can display the test rejection region by plotting  Vl(t)and  Vu(t)versus t as shown in Fig. 2(b). For the lasso, the rejection region has upper and lower boundaries. The sections of the boundaries with  Vu(t) = (1 − r)and  Vl(t) = −(1 − r), correspond to the sphere test  TS(q,r). If feature  bimaps into the shaded region in the figure, then  biis rejected. The lightly shaded (yellow) area indicates the extra rejection power of the dome test over the underlying sphere test. For a given value of  qT bi > 0, the dome test lowers the bar for rejection as  nT biincreases.

4.3.1 Parameter selection

Now consider the parameter selection problem. Since we have discussed parameter selection for a spherical bound, we assume S(q, r) is given and give examples of bounding ˆθwithin a suitable half space.

Each constraint of the dual problem  bT θ ≤ 1bounds ˆθ. This half space has  n = b/∥b∥2and  c = 1/∥b∥2. The resultant dome is nonempty since both the sphere and the half space contain ˆθ. To ensure it is proper, we require qT b ≥ 1 − r∥b∥2. This means that the sphere test does not reject the feature b. In particular, we can select b to minimize the disk radius  rd. To do so, we maximize  ψdgiven by (24):

image

image

Fig. 3. Two dome tests for unit norm features and target vector. Left: The dome (30) based on the feasible point  y/λmax and theclosed half space  bTmaxθ ≤ 1. Right: The dome (33) based on a solved instance  (y0, λ0, ˆθ0).

For unit norm features, (29) selects the feature most correlated with q. If in addition,  q = y/λ, then (29) yields bg = bmax. Selecting the default spherical bound and using (29) gives the specific dome:

image

We call this the default dome bound. When y and all the features have unit norm, this simplifies to  D(y/λ, |1/λ−1/λmax|; bmax, 1).This dome is illustrated in Fig. 3 (left).

If ˆθ0is the dual solution of an instance  (y0, λ0), then ˆθ0lies on the boundary of F. Moreover, its optimality for  (y0, λ0)ensures that it satisfies the inequalities (7). Hence for each  θ ∈ F:

image

Since  0 ∈ F, the right hand side is nonnegative. Therefore this inequality bounds F in the closed half space nT0 θ ≤ c0with

image

The intersection of this half space with the bounding sphere S(q, r) is nonempty and it is proper if  ψd ≥ −1. To check this condition note that

image

where  βis the angle between  n0and  q−ˆθ0. So if  cos β >0 or ˆθ0 ∈ S(q, r), then the dome is proper. For example, q = y/λand  r = ∥y/λ − ˆθ0∥2, yields the proper dome:

image

This dome illustrated in Fig. 3 (right).

4.3.2 Connections with the Literature

Specific dome tests were introduced in [32,  §2.4] and [23,  §3]. The dome test discussed in [23] is based on the default dome bound (30) for unit norm features and unit norm y. The SAFE-LASSO test in [32,  §2.4] is a dome test specifically designed for screening and solving lasso problems at points along the regularization path. A triple  (y, λ0, ˆθ0)is given where ˆθ0is the dual solution for instance  (y, λ0). The test uses this to screen the dictionary for an instance  (y, λ)with  λ < λ0 ≤ λmax. We show that the dome employed is (33) with  y0 = y. The solution in [32,  §2.4] entails specifying a bounding sphere and a half space, then solving the corresponding version of (16). The selected half space is  gT (θ − ˆθ0) ≥ 0where  g = ∇G(ˆθ0) = −y/λ0 + ˆθ0is the gradient of the dual objective for the solved instance evaluated at ˆθ0(up to positive scaling). The spherical bound is obtained by scaling ˆθ0to obtain the closest feasible solution to y/λ. This can be specified by letting  r0 = ∥y/λ0 − ˆθ0∥2and setting  q = y/λ, ˆr = mins∈[−1,1] ∥sˆθ0 − y/λ∥2, n =(y/λ0−ˆθ0)/r0and  c = nT ˆθ0. Assume  λ ≤ λ0 < λmaxand let  ˆs(λ)denote the optimal value of s in the definition of  ˆr. By the optimality of ˆθ0for the instance  (y, λ0), we must have  ˆs(λ0) = 1. In addition, it must hold that yT ˆθ0 ≥ 0otherwise the feasible point  −ˆθ0would be closer to  y/λ. By simple calculus we then determine that  ˆs(λ) = min{1, (yT ˆθ0)/(λ∥ˆθ0∥22)}. It follows that for all  λ < λ0, ˆs = 1. Hence for  λ < λ0we can take ˆr = ∥y/λ − ˆθ0∥2. Thus for  λ < λ0, SAFE-LASSO uses the dome (33) with the constraint  y0 = y.

4.4 Iteratively Refined Bounds

Under favorable circumstances, it is possible to refine a sphere  S(q,r)bounding ˆθto obtain a bounding sphere of smaller radius. Let the half space (n, c) also bound ˆθand its intersection with S(q, r) result in a dome D = D(q, r; n, c) with parameters  ψd, qd, and  rd. Since D is a bounded convex set, there exists a unique sphere of smallest radius that bounds D. This is called the circumsphere of D. We claim that if  0 < ψd ≤ 1, or equivalently q /∈ D, then the circumsphere of D is  S(qd, rd). In this case,  rdis strictly smaller than r and  S(qd, rd)is a tighter spherical bound on ˆθ. This is summarized below.

Lemma 4. Let S = S(q, r) and the half space (n, c) bound the dual solution ˆθ, with the resulting dome D = D(q, r; n, c) satisfying  0 < ψd ≤ 1. Then  S(qd, rd)is the circumsphere of D and hence bounds ˆθ.

If suitable half spaces can be found, e.g., among the vectors in B, the construction in Lemma 4 can be used iteratively. At step k, we have a bounding sphere  Sk =S(qk, rk)and seek  b ∈ Bsuch that  n = b/∥b∥2and c = 1/∥b∥2satisfy

image

If such b exists, set  nk = n, ck = cand

image

to obtain a tighter bounding sphere Sk+1 =S(qk+1, rk+1). A greedy strategy selects b at step k to minimize  rk+1, or equivalently to maximize  ψk:

image

When all features have equal norm, this reduces to maximizing the inner product of b and  qk. This has a simple interpretation.  Skcan be thought of as a location bound on ˆθwith the center  qkthe “estimate” of ˆθgiven the bound. The greedy strategy selects b by maximizing its alignment with the current estimate  qkof ˆθ. Since ˆθis proportional to the optimal residual in the primal problem (see (4)), this strategy selects features “best correlated” with the current estimate of the optimal residual.

4.5 Are More Half Spaces Worthwhile?

We have examined region tests defined by the intersection of a bounding sphere and one half space (m = 1), and have shown that, in general, these have additional rejection power over the simpler sphere tests (m = 0). Are more complex tests worthwhile? To examine this, we go one step further and examine the region test defined by the intersection of a bounding sphere and two half spaces (m = 2). Examining the relative performance of this test will allow us to determine where we currently stand in the trade-off between rejection power and computational efficiency.

Let  R(q, r; n1, c1; n2, c2)denote the region formed by the intersection of a sphere  S(q, r) = {θ : ∥θ − q∥2 ≤ r}and two closed half spaces  Hi = {θ : nTi θ ≤ ci}, where niis the the unit normal to  Hiand  ci ≥ 0, i = 1, 2. We call the corresponding screening test a Two Hyperplane Test (THT).

Each half space  Hi = {θ: nTi θ ≤ ci}intersects the sphere forming a dome with parameters  ψi = (nTi q −ci)/r, qi = q − ψirni, and  ri = r�1 − ψ2i, i = 1, 2. To ensure each intersection  Hi ∩ S(q, r)is nonempty and proper, we need  −1 ≤ ψi ≤ 1, i = 1, 2, and to ensure the two half spaces intersect within the sphere, we need  arccos ψ1 + arccos ψ2 ≥ arccos(nT1 n2). Under these conditions,  R(q, r; n1, c1; n2, c2)is a nonempty, proper subset of the sphere and each half space.

To find  µR(b) = maxb∈R θT bwe solve the optimization problem (16) with m = 2. Using standard techniques, this problem can be solved in closed form yielding the expressions for  µRin the following lemma. The corresponding test then follows from (15).

Lemma 5. Fix the region  R = R(q, r; n1, c1; n2, c2)and let ψisatisfy |ψi| ≤ 1, i = 1, 2, and arccos ψ1 + arccos ψ2 ≥ arccos(nT1 n2). Let h(x, y, z) =

�(1 − τ 2)z2 + 2τxy − x2 − y2, where  τ = nT1 n2. Then for b ∈ Rn,

image

where

image

and conditions (a), (b), (c) are given by

image

Theorem 4. The Two Hyperplane Test (THT) for the region R(q, r; n1, c1; n2, c2)is:

image

where condition (a’) is

image

with  Vu(t1, t2, t3) = 1 − M2(t1, t2, t3)and for the lasso, Vl(t1, t2, t3) = −Vu(−t1, −t2, t3), and for the nonnegative lasso,  Vl(t1, t2, t3) = −∞.

Theorem 4 indicates that THT uses only the 3p correlations  {qT bi, nT1 bi, nT2 bi}pi=1. So the test has time complexity O(pn).

4.5.1 Parameter selection

Assume the sphere S(q, r) has been selected. The inequality constraints in (3) provide the natural half space bounds ˆθ ∈ H(b), b ∈ B. H(b)can be equivalently spec-ified as  {θ: nT θ ≤ c}with  n = b/∥b∥2and  c = 1/∥b∥2and the resultant dome  H(b) ∩ S(q, r)has parameters given by (24), (25) and (26).

We seek two such half spaces. We can select the first by minimizing its dome radius  rd. By (26), this requires maximizing  ψd:

image

When all features have equal norm, we can simply maximize  bT qover  b ∈ B.

Suppose we have selected the first feature  b(1)using (40). This yields a dome with dome center  q(1) = qdand dome radius  r(1) = rd. Assume that  ψd ≥ 0. Then by Lemma 4, the smallest sphere containing the dome has center  q(1)and radius  r(1). To select the second feature, we can focus on the sphere  S(q(1), r(1))and repeat the above construction:

image

When all features have equal norm, we can simply maximize  bT q(1)over  b ∈ B/b(1). We call this parameter selection method Dictionary-based THT (D-THT).

Alternatively, if we have solved the instance  (y0, λ0)yielding primal and dual solutions  ˆw0and ˆθ0(see (4)), then ˆθ0must satisfy the inequalities (7). Using some algebra and (4), these inequalities can be written as: (B ˆw0)T θ ≤ (B ˆw0)T ˆθ0. Since  0 ∈ F, the right hand side is nonnegative. Hence the inequality bounds F in the half space  nT1 θ ≤ c1with

image

One can then select  n2and  c2using (41).

We will return to the THT tests in  §7 where we compare the performance of the tests with m = 0, 1, 2 and examine the trade-off between rejection rate and computational efficiency that increasing m imposes.

4.5.2 Connections with the Literature

The form of the Two Hyperplane Test was first presented (without proof) in [33], for unit norm features and target vector. The form given here (with proofs) is a generalization of that result. The general formulation allows the use of any sphere and hyperplane constraints bounding ˆθand includes the feature constraint used in [23] as a special case.

4.6 Composite Tests

The construction described in  §4.4 gives rise to a finite sequence of spheres and domes:  S1 ⊃ D1 ⊂ S2 ⊃ · · · ⊃Sk−1 ⊃ Dk−1 ⊂ Sk. Each sphere and dome has an associated test. But since  Djis contained in  Sjand  Sj+1, each dome test is stronger than the tests for the spheres that precede and succeed it.

But  Sj+1is not contained in  Sjand  Dj+1is not contained in  Dj. So we can’t claim that the last dome Dk−1leads to the strongest test. Moreover, a test based on the region  ∩k−1j=1Djis usually too complex to compute.

An alternative is to implement a composite test that rejects  biif it is rejected by any of the tests  {TDj}k−1j=1. For the nonnegative lasso,  TDjtakes the form  µj(bi) < 1, with  µj(bi) = qTj bi + M1(nTj bi, ∥bi∥2)and  M1given by (27). So the composite test rejects  biif

image

Similarly, for the lasso problem the composite test rejects biif

image

Reflecting the dome construction method, we call the tests (43) and (44) iteratively refined dome tests (IRDT). These tests can be implemented in several ways and extra domes arising in the course of the construction can also be included. This is illustrated in  §6. The major cost of the tests is calculating the inner products  qTj biand nTj bifor each feature  bito be tested. Because of the iterative construction, this can be done by computing qT1 bi, nT1 bi, . . . , nTk−1bi(see (34), (35), (37)). So to execute all of the tests  D1, . . . , Dk, only k inner products are used per feature tested. This is O(nk) time complexity per feature tested where n is the feature dimension. So the marginal cost of increasing k by 1 is the cost of computing one additional inner product per feature tested.

A composite test is mathematically equivalent to test disjunction,  (T1∨T2)(bi) = T1(bi)∨T2(bi). A disjunction of region tests is weaker than the test based on the intersection of the regions. For example, consider two spheres of equal radius with a small intersection. Both spheres can intersect a half space while the intersection does not.

Lemma 6. For compact sets  R1, R2: TR1 ∨ TR2 ⪯ TR1∩R2.

Lemma 6 indicates that a disjunction of tests is trading rejection performance for simplicity and ease of implementation. Despite the above limitation, the IRDT test is very competitive with Dictionary-based THT on the datasets used in our numerical studies.

4.6.1 Connections with the Literature

The sphere test ST3 in [22] is based on a refined spherical bound. In [22] it is assumed that y and all features have unit norm. ST3 is then constructed starting with the default spherical bound  S(q1, r1)with  q1 = y/λand  r1 = (1/λ − 1/λmax). The greedy strategy selects the feature  b = bmax. Then the dual solution ˆθlies in the default dome formed by the intersection of the spherical ball  S(q1, r1)and the half space  H(bmax). This intersection is indicated by the green dome region G in Fig. 1(c). The smallest spherical ball bounding G (dashed magenta circle in Fig. 1(c)) is obtained by substituting the values of  q1, r1and  bmaxinto (34), (35) and (36). This yields  ψ2 = λmax, q2 = y/λ − (λmax/λ − 1) bmaxand  r2 =�1/λ2max − 1 (λmax/λ − 1). These parameters are derived in [22] using a distinct approach.

A two term disjunction test is used in [24]. This test is implemented sequentially. The first test is applied and then the second is applied to the remaining features. Any disjunction test can be implemented sequentially in this fashion. The key innovation in [24] is that each test is based on an ellipsoidal bound on ˆθ. The first ellipsoid is the minimum volume ellipsoid containing the default dome (30). The second ellipsoid is constructed in a greedy fashion by selecting a feature so that the best ellipsoidal bound of the intersection of its half space and the first ellipsoid has minimum volume. The first step is in the spirit of ST3 except using an ellipsoidal bound. The second step is bound refinement based on ellipsoids rather than spheres. An ellipsoidal bound is tighter than the spherical bounds used in this section. However, its description requires a center  q ∈ Rnand a matrix  P ∈ Rn×nto encode its shape and orientation. When n is large this could be an impediment. In contrast, a sphere requires a center  q ∈ Rnand a scalar radius r.

The screening tests discussed so far screen the dictionary once, then solve the reduced lasso problem. We hence call these tests “one-shot” screening tests. These tests can perform well for moderate to large values of  λ/λmaxbut often fail to provide adequate rejection performance for smaller values of  λ/λmax. This is primarily due to the challenge of obtaining a tight region bound on ˆθwhen λ/λmaxis small.

Alternative screening methods can help with this problem. For example, [31] examined the idea of screening and solving (1) for a sequence of instances {(y, λk)}Nk=1(Recursive-SAFE). At step k the previously solved instance  (y, λk−1)defines a bound on the dual solution of the instance  (y, λk). Hence the previous solution can help screen the next instance in the sequence. A similar idea is proposed by [19] in the form of the Strong Sequential Rule. This is used to solve the lasso problem “over a grid of  λvalues”. In [32], the SAFE test for the lasso is upgraded to use a specific dome test.

In a similar spirit, [25] proposed running a homotopy algorithm to find a solution at the K-th breakpoint on the regularization path of  ˆw(λ). This effectively solves a sequence of lasso problems (via homotopy) to obtain a solution  ˆwKat  λK > λt. The dual solution ˆθKis then used to screen the instance  (y, λt). This has potential advantages, but relinquishes control of the values  λkto the breakpoints in the homotopy algorithm. In the worst case the regularization path can have  O(3p)breakpoints [34]. As a variant on homotopy, Sequential Lasso [35] solves a sequence of partially  ℓ1penalized least squares problems where features with non-zero weights in earlier steps are not penalized in subsequent steps.

With the exception of homotopy, all of the above sequential schemes use a fixed open loop design for N and the sequence  {λk}Nk=1. For example, first fix  N ≥ 2, then select  λ1 < λmax, λN = λt, and let the intermediate values be selected via geometric spacing:  λk = αλk−1with  α = (λt/λ1)1/(N−1). To solve instance  (y, λt), we first screen and solve the instance  (y, λ1). Then sequentially for k = 2, . . . , N, we screen instance  (y, λk)with the help of the known solution of the previous instance, and then solve the reduced problem. This continues until the solution for  (y, λt)is obtained. Sometimes all solutions on the grid of  λvalues are of interest, e.g., cross validation for parameter selection. But there are many other applications where only the solution of the final instance  (y, λt)is of interest – the other instances are merely waypoints in the computation.

The solution of the previous instance helps screen the next instance as follows. First use ˆθk−1as a dual feasible point to form the basic bounding sphere (21) for ˆθkwith center  y/λkand radius  ∥y/λk−ˆθk−1∥2. Then use ˆθk−1as the projection of  y/λk−1onto F, to form the bounding halfspace (31) with  nk−1 = (y/λk−1 − ˆθk−1)/∥y/λk−1 −ˆθk−1∥2, and  ck−1 = nTk−1ˆθk−1. This sphere and halfspace yield the bounding dome derived in  §4.3.2:

image

This dome, illustrated in Fig. 4, encapsulates information about ˆθkprovided by the dual solution of the previous instance.

image

Fig. 4. An illustration of the dome (45) formed at step k.

In contrast to an open loop design, one can use feedback to adaptively select N and the sequence  {λk}Nk=1as the computation proceeds [36]. This allows the value of N and the sequence  {λk}Nk=1to be adapted to each particular instance. For some instances, a small value of N is used, for others, a larger value is used. One way to see why feedback helps is to examine the diameter of Dkin (45).

Proposition 1 (From [36]). Let  Dkbe the dome (45) and δk =diam(Dk). Then

image

Using Proposition 1, it can be shown that the data adaptive feedback selection rule

image

where R > 0 is a selectable parameter, ensures that diam(Dk) ≤ Rfor all k > 1. This allows direct control of how tightly the dome (45) bounds ˆθk, k = 1, . . . , N. This is called Data-Adaptive Sequential Screening (DASS). Note that in this scheme N is not predetermined. Instead the stopping time is decided by the feedback scheme. However, (47) ensures that

image

where C is an bound on the dual regularization path [36]. We employ DASS to demonstrate the effectiveness of sequential screening in  §7. In particular, Fig. 10 shows

image

Fig. 5. Algorithm for THT. The functions  Vu and Vl are fromTheorem 4. Other Notation: For the lasso, f(z) = |z| and g(z) = sign(z) and for the nonnegative lasso, f(z) = g(z) = z. For a logical condition  c(·), [[c(z)]]evaluates to true if z satisfies condition c and false otherwise.

the range of N used by DASS on two datasets and Fig. 11 shows its performance in three sparse classification problems.

Each of the screening tests previously described requires the inputs B, y, and  λand returns  v1, . . . , vpwhere  viis a logical value indicating if  biis rejected. The algorithms can be implemented in an online fashion with very few features stored in memory at once. The critical computation is calculating inner products of the form  yT biand  nT bi. It follows that the time complexity of one-shot screening is O(np). If the features are sparse then running times are further reduced. Let s denote the average feature sparsity. Then the time complexity of one shot screening is O(sp). Reference [36] discusses the

image

Fig. 6. Algorithm for IRDT. Here  Vu and Vlare from Theorem 3. For other notation see the caption of Algorithm 1.

complexity of Data Adapted Sequential Screening and provides the upper bound (48) on the number of steps used.

A basic implementation of THT is shown in Algorithm 1. If the dictionary is unnormalized, the feature norms can be precomputed and passed as an input to the algorithm. If it is normalized, we recommend simplifying the algorithm by setting  βi = 1and removing unnecessary floating point operations (see  §4.5). The algorithm accepts two additional optional inputs: either a dual solution  (θF , λ2)or a feasible point  θF. The dual solution is useful for the application of THT in sequential screening. It is used to select the first half space used by THT (lines 7,8). If only a feasible point is provided, it is used to select the sphere radius. Otherwise, the default point  y/λmaxis used. The remaining half spaces are selected using dictionary-based selection (40), (41) (§4.5). The output values  viare determined for each  biby evaluation of the THT test in Theorem 4.

A basic implementation of IRDT is shown in Algo-

image

Fig. 7. Algorithm for Data-Adaptive Sequential Screening.

rithm 2. To keep the notation simple and the algorithm understandable, all features and y are assumed to have unit norm, but this is not required (see  §4.6). IRDT uses at most s iterations with the value of s supplied by the user (we recommend  s ≤ 5). The algorithm passes through the dictionary at most s + 1 times with the main loop executed at most s times. The break at line 15 terminates this loop early if suitable domes can’t be found. The algorithm accepts a feasible point  θFfor the dual problem as an optional input and can be adapted to accept a known dual solution. The value  vifor each  biis determined by a disjunction of a set of dome tests each based on the dome test in Theorem 3. These disjunctions are computed sequentially with subsequent tests applied only to currently surviving features.

Data-Adaptive Sequential Screening solves N lasso instances  {(y, λk)}Nk=1for a sequence of descending values λkwhere  λmax > λ1and  λN = λtis the regularization parameter value for the instance to be solved. The user must specify a radius R > 0. At each step, the algorithm uses a strong “one-shot” screening test, for example THT, provided with a solution of the previous instance, followed by an external lasso solver to solve the screened current instance. The algorithm sets  λ1 = 0.95λmaxand thereafter uses the feedback rule (47) to select  λkuntil λk ≤ λt. It then sets  N = k, λN = λtand screens and solves the final problem. See [36] for additional details on this algorithm.

We now examine the performance of the screening algorithms presented using the datasets summarized in Table

image

TABLE 1 Summary of the datasets. The reported value of  λmax isobtained by averaged over the lasso instances solved.

1, and discussed in detail below. (1) RAND: We generate lasso problems with n = 28 and p = 10, 000 by randomly generating 10, 001 28-dimensional vectors  y, b1, . . . , b10,000. These vectors are scaled to unit norm. (2) MNIST: 70, 000 images (28 × 28) of hand-written digits (60, 000 and 10, 000 in the training and testing sets, respectively) [37], [38]. We form a dictionary by randomly sampling 500 training images for each digit, and a target vector from the testing set. Each image is vectorized and scaled to unit norm. (3) YALEBXF: Frontal face images (192 × 168) of 38 subjects in the extended Yale B face dataset [39], [40]. We randomly select p = 2, 000 of the 2, 414 images as the dictionary, and y from the remaining 414 images. Each image is vectorized and scaled to unit norm. (4) RCV1: A bag-of-words representation of four classes from the Reuters Corpus Volume 1 (RCV1) dataset [41]. There are 9,625 documents with 29,992 distinct words, including categories “C15”, “ECAT”, “GCAT”, and “MCAT”, each with 2,022, 2,064, 2,901, and 2,638 documents respectively. The vector representations have an average of  75.9 ± 60.0nonzero entries; a sparsity of 0.25% ± 0.19%. (5) COIL: Images (128 × 128 × 3) of 100 objects, with 72 images per object obtained by rotating the object every 5 degrees [42]. (6) GTZAN: 100 music clips (30 sec, sampled at 22,050 Hz) for each of ten genres of music [43]. Each clip is divided into 3-sec adjacent texture windows (TW) with 50% overlap. Each TW is represented using a first order scattering vector of length 199 [44]. (7) NYT: A bag-of-words dataset in which 300,000 New York Times articles are represented as vectors with respect to a vocabulary of 102,660 words [45]. The i-th entry in vector j gives the number of occurrences of word i in document j. Documents with low word counts are removed, leaving 299,752 documents.

All experiments solve the standard lasso problem (1) using the Feature-sign [46] and FISTA [47] solvers. The grafting solver [48] was also tested and gave similar qualitative performance. We use two performance metrics: the percentage of features rejected and the speedup (time to solve the lasso problem divided by sum of the time to screen and the time to solve the reduced lasso problem). Timing and speedup results depend on the solver used. The regularization parameter  λ

image

Fig. 8. Performance of ST, DT and D-THT. Top: rejection percentage; Bottom: speedup using screening and the FeatureSign solver [46]. Solid curves lower bound and dashed curves upper bound performance for spherical bounds centered at  y/λ.

image

Fig. 9. Performance comparison of ST, DT, D-THT (all with default  θF), enhanced DPP (EDPP) [25] and the strong rule [19] using the FISTA solver on the MNIST and YALEBXF datasets.

is set using the scaling invariant ratio  λ/λmaxwhere λmax = maxi |yT bi|. So  λ/λmax ∈ [0, 1]with larger values yielding sparser solutions. For all datasets except RCV1 and NYT, we randomly select 20 dictionaries and for each dictionary we use 60 randomly selected test vectors. Averaged metrics and standard errors are reported across these 1200 lasso instances. For RCV1, since  λmaxis very low, we select 496 lasso instances with  λmax ≥ 0.5from the pool of 1200 instances and report results for these 496 instances. For the very large NYT dataset, we select the first 299,000 examples as the dictionary and 6 documents from the remaining 752 as target vectors.

7.1 The performance of one-shot screening

We first benchmark the performance of the one-shot tests: ST (§4.2), DT (§4.3), and D-THT (§4.5). We first use the default spherical bound (22). This gives a lower bound for the performance of the one-shot screening methods on each dataset. The default dome test combines this sphere with the feature  bmax, while dictionary-based THT combines it with two features using the selection scheme detailed in (40), (41). We also show results using a second “oracle” bounding sphere with center  y/λand radius  r = ∥y/λ − ˆθ∥2. This provides an upper bound on performance over bounding spheres

image

Fig. 10. Data-Adaptive Sequential Screening (DASS) applied to MNIST (top) and YALEBXF (bottom) using the feature-sign and FISTA solvers. (Left): average rejection percentage. (Middle): Speedup factor. (Right): The average value of N.

centered at  y/λ.

The performance of the one-shot screening methods on the test datasets based on the feature-sign solver [46] are shown in Fig. 8. Here are the salient points: (a) While the default one-shot tests perform well for high values of  λ/λmax, this performance quickly degrades as  λ/λmaxdecreases. At values of  λ/λmaxaround 0.2 and lower, the tests are not effective. (b) On the other hand, the upper bounds indicate potential for improvement if a better spherical bound can be found. Indeed, the significant gap between the lower and upper performance bounds suggests that it is worth investing computation to improve the default spherical bound. (c) Among the tested methods, D-THT exhibits the best performance except at very high values of  λ/λmax. On RAND, for example, using  λ/λmax = 0.5and the default spherical bound, DTHT yields a 400% rejection improvement over DT. The concurrent speedup for D-THT is about 5X while for DT is less than 2X. These effects are also seen for MNIST and YALEBXF.

Fig. 9 shows a performance comparison between ST, DT, D-THT (all with default  θF), EDPP [25] and the strong rule [19] using the FISTA solver [47]. Here are the salient points: (a) Aside from the small dip at high values of  λ/λmax, the speedup trend for the FISTA solver is similar to that for feature-sign. For the datasets we tested, feature-sign seems to be faster than FISTA, but FISTA is more sensitive to the reduction in dictionary size resulting from screening. Thus it has greater speedup. This can also been seen in Fig. 10. (b) Of the one-shot methods tested, dictionary based THT and DT consistently have the best rejection performance. But while current one-shot screening tests can perform well at moderate to high values of  λ/λmax, such performance does not extend to the important range of low values of  λ/λmax.

The rejection and speedup of IRDT (not plotted) and D-THT were very similar on the test datasets with IRDT terminating after 3 or 4 iterations at the break in line 14-16 in Algorithm 2.

7.2 The performance of sequential screening

To explore the effectiveness of sequential screening, we tested the Data-Adaptive Sequential Screening (DASS) scheme (47). The performance results are shown in Fig. 10. Here are the salient points: (a) For both MNIST and YALEBXF, with R = 0.2 the performance of DASS is robust across a variety of values of  λt; (b) DASS yields significant improvement in rejection fraction and robust speedup performance compared with one-shot tests; (c) At values of  λ/λmaxaround 0.1 and lower, DASS is rejecting 98% of the dictionary while giving speedup greater than 1. This is successful screening at much lower values of  λ/λmax.

7.3 Sequential screening and classification

Now we focus on specific values of  λ/λmaxmotivated by practical lasso problems and examine how screening

image

Fig. 11. Performance of four sequential screening algorithms (DASS, sequential dome, sequential strong rule, sequential EDDP) for the screening of the lasso problems in SRC on three datasets (COIL, GTZAN, RCV1). (Top): cross validated accuracy to determine the best  λt/λmax. (Bottom): Speedup vs Rejection at the best  λt/λmaxfor each dataset.

can help. To do so, we use the COIL, GTZAN and RCV1 datasets to examine the impact of sequential screening in Sparse Representation Classification (SRC) [7]. Although SRC was first proposed for face recognition problems, it is a generic multi-class classifier that has found success in a variety of applications. The time and memory consuming step in SRC is solving a lasso problem. For the COIL dataset we made the SRC problem more challenging by saturating a random subset of 0.5% of the pixels to white.

We first use cross-validated prediction accuracy to determine the best values for  λt/λmaxfor SRC when applied to the datasets. The results (top row of Fig. 11) are COIL:  λt/λmax = 0.15, GTZAN:  λt/λmax = 0.1, and RCV1:  λt/λmax = 0.2. For these specific values of λt/λmax, we then examine the performance of the following screening schemes in solving SRC problems for these datasets: (1) the feedback scheme DASS, and the open loop sequential screening schemes (2) sequential dome test, (3) sequential strong rule [19] and (4) sequential EDPP rule [25]. We select the parameters of each method to keep the average value of N the same. Since DASS uses a variable value of N, we first select its parameters, then use the resulting average value of N for the open loop schemes. For COIL, DASS with R = 0.5 yields an average N = 4.72; for GTZAN, DASS with R = 0.15 yields an average N = 14.63; and for RCV1, DASS with R = 1 yields an average N = 3.59. Then for the open loop sequential screening schemes we set N = 5 for COIL, N = 15 for GTZAN and N = 4 for RCV1. This keeps the average path lengths of the screening schemes the same. The results are shown in the bottom row of Fig. 11.

Here are the salient points: (a) Over 50% of the experiments (dataset+screening method) gave a speedup of at least 10X. So sequential screening offers considerable potential gain in practical applications. (b) At the high end, DASS provided 28X, 16X and 18X speedup in solving SRC lasso problems for the three datasets. That’s an average speedup of 21X. However, given that we only used three datasets and did not “tweak” each method to find its best performance on each dataset, we can’t conclude that one method is always better than the rest. That would require a more extensive investigation. Finally, although the strong rule can’t rule out false rejections, we detected no false rejections in our experiment.

7.4 Sequential screening on a large dataset

Finally, we used the NYT dataset to explore how successfully one can screen and solve lasso problems using small values of  λ/λmaxwith high dimensional data and a very large dictionary. We normalize each document and randomly selected six documents from the first 100 of the 752 held out documents subject to  0.5 < λmax < 0.9. DASS Screening (with  λt/λmax = 0.1, λ1 = 0.95λmaxand R = 0.3) was done in an “on-line” mode by loading only small amounts of the dictionary into memory at a time. The value of N is selected automatically for each instance. In all tested cases,  N ≤ 27. As a benchmark, we tested a geometrically spaced, open loop sequential screening algorithm (sequential THT) using  λt/λmax =0.1, λ1 = 0.95λmaxand N = 30.

The results for both methods are shown in Fig. 12. We can’t solve these lasso problems without using screening. Hence the usual speedup metric can’t be evaluated. The main time cost is sequentially reading features from disk into RAM. Here are the main points to note: (a) Under geometric spacing with fixed N, less than 10,000 of the features (3.3%) were held in memory at once; (b) For DASS, less than 1,000 of the features (0.33%) were held in memory at once – an order of magnitude improvement over fixed geometric spacing (The small dip at  λtis due to termination method); (c) On this dataset, both open loop sequential screening and DASS clearly exhibit a significant performance advantage over one-shot tests. The use of feedback by DASS to automatically select the number of steps N and the values  {λk}Nk=1, yields robust rejection performance. By tweaking N for each test vector in the open loop scheme, one could improve its average performance. But DASS handles this automatically and robustly.

In our survey we have emphasized separating the discussion of test structure from the problem of selecting its

image

Fig. 12. Sequential screening using on the NYT dataset. (Top): Open loop geometric spacing using N = 30 and THT test; (Bottom): DASS with R = 0.3. For six problems  (yi, λt),i = 1, . . . , 6 with λt = 0.1λmax, the plotted points indicate the number of surviving features after THT screening of each instance  (yi, λik), k = 1, . . . , N, i = 1, . . . , 6.

parameters. This allowed us to see connections between many existing screening tests, and enabled a clearer understanding of screening in general. Hopefully this will be advantageous to the development of new tests and parameter selection methods.

For one-shot screening tests, our numerical studies on THT strongly suggest that more complex region tests are indeed worthwhile. THT gave significant performance improvement beyond simpler tests in both rejection and speedup over important ranges of  λ/λmaxvalues. But the performance of one-shot tests is still inadequate at small values of  λ/λmax. The numerical studies also indicated a significant performance gap between using the default spherical bound and the best bound at the same sphere center. This indicates the value of additional computation to improve the spherical bound.

Our empirical studies have shown that sequential screening (for example, DASS) can significantly extend useful screening performance to a wider range of  λ/λmax. DASS has the additional advantage that it selects both the number N and the sequence  {λk}Nk=1automatically.

Screening is critical when the dictionary will not fit into available memory. We have demonstrated a successful application of DASS to a very large NYT dataset, of dimension 102, 660 by 299, 000. To the best of the authors’ knowledge, with constrained computational resources, screening is the only way to solve lasso problems of this size.

The concepts described in this survey should provide a

firm foundation for understanding screening for related sparse representation problems. This includes screening for the elastic net (reducible to lasso problem),  ℓ1regularized logistic regression, the graphical lasso, and the group lasso [19]. In addition, SAFE methods have been developed for the sparse support vector machine and logistic regression in [21], and the group lasso in [25]. Recently, Liu et al. [49] have proposed safe screening for generalized sparse linear models. This makes use of the variational inequality that provides a necessary and suf-ficient optimality condition for the dual problem. Dash et al., [50], consider screening for Boolean compressed sensing in which the objective is to select a sparse set of Boolean rules that are predictive of future outcomes. One of the screening rules developed is based on the duality arguments presented here. Targeting problems that use nuclear norm regularization to pursue a low rank matrix solution, Zhou et al., [51], have recently proposed safe subspace screening for nuclear norm regularized least squares problems. Wang et al., [52], have integrated DASS with sparse representation classification, to speed up classification, and Jao et al., [53], have applied screening to the problem of representing music in terms of an audio codebook (dictionary) for genre tagging. We expect to see more such applications as the size of dictionaries increase.

image

The dual lasso problem (3) is obtained as follows. Setting z = y − Bwin (1) gives the constrained problem: minz,w 1/2 zT z + λ∥w∥1, subject to  z = y − Bw. Form the Lagrangian  L(z, w, µ) = 1/2 zT z+λ∥w∥1+µT (y−Bw−z)and compute the subdifferentials with respect to z and w. Using the condition that 0 must be in each subdifferential gives  µ = ˆzand the constraints  |µT bi| ≤ λ, i = 1, . . . , p. The above equations allow the elimination of z and w from L. This leads to the dual problem: maxµ 1/2 ∥y∥22 − 1/2∥µ − y∥22, subject to  |µT bi| ≤ λ, i = 1, . . . , p. The change of variable  θ = µ/λthen gives (3). By construction, the primal and dual solutions  ˆwand ˆθare related through (4).

image

Theorem 1: The proof for the lasso is given above the theorem statement. For the nn-lasso, only the definition of the active set changes.

Corollary 1: In the proof of Theorem 1, the inclusion ¯S ⊆ ¯A(ˆθ), gives  i ∈ ¯Simplies  ˆwi = 0.

image

maxθ∈R1 θT b ≤ maxθ∈R2 θT b = µR2(b). For the lasso, if biis rejected by  TR2, then  µR2(bi) < 1and  µR2(−bi) <1. Hence  µR1(bi) < 1and  µR1(−bi) < 1. So  biis also rejected by  TR1. Therefore  TR2 ⪯ TR1. The proof for the nn-lasso is similar.

image

Lemma 2: By Cauchy-Schwarz:  θT b = (θ − q)T b +qT b ≤ ∥θ − q∥2∥b∥2 + qT bwith equality when  θ − qis aligned with b. Then  θ ∈ S(q, r)ensures  θT b ≤ r∥b∥2 +qT bwith equality when  ∥θ − q∥2 = r.

Theorem 2: For the nn-lasso we reject  biif  µS(bi) <1. So (19) follows from (18). For the lasso we reject biif  µS(bi) < 1and  µS(−bi) < 1, i.e., if  qT bi <(1 − r∥bi∥2)and  qT bi > −(1 − r∥bi∥2). This gives (19). Note  max{µS(bi), µS(−bi)} < 1 ⇔ max{qT bi +r∥bi∥2, −qT bi + r∥bi∥2} < 1 ⇔ |qT bi| < 1 − r∥bi∥2. Thus (19) and (20) are equivalent.

image

Lemma 3: Solving (16) with m = 1 is equivalent to solving the Lagrangian problem:

image

Setting the derivative w.r.t.  θequal to zero yields  θ =q + b/2µ − σn/2µ. Substituting  θinto  µDand (49):

image

where  ψ = (qT n − c)/rand  t = nT b. We now minimize this expression over  µ, σ ≥ 0. Setting the derivatives of L w.r.t.  µ, and  σequal to 0 yields two equations to solve for µand  σ: ∥b∥22+σ2−4µ2r2 = 2σtand  σ = 2µrψ+t. There

are two cases: (A) If  t ≥ −ψ∥b∥2, then  σ = t+ψ

and µ = 12r

∥b∥22−t21−ψ2; and (B) If  t < −ψ∥b∥2, then  σ = 0and  µ = ∥b∥2/(2r). Substitution of these expressions into (50) yields the result in Lemma 3.

Theorem 3: For the nn-lasso, we reject b if µD(b) = qT b + M1(nT b, ∥b∥2) < 1, i.e., if  qT b <1 − M1(nT b, ∥b∥2) = Vu(nT b, ∥b∥2). For the lasso we reject b if  qT b + M1(nT b, ∥b∥2) < 1and  −qT b +M1(−nT b, ∥b∥2)} < 1, i.e., if  qT b < 1−M1(nT b, ∥b∥2) =Vu(nT b, ∥b∥2)and  qT b > −(1 − M1(−nT b, ∥b∥2)) =Vl(nT b, ∥b∥2).

image

We make use of the following lemma from [36].

Lemma 7. If  R = S(q, r) ∩ {nT θ ≤ c}is nonempty, then

image

Lemma 5: We first solve (16) (m = 2) with  ∥b∥2 = 1by solving the Lagrangian problem:

image

Solving  ∂L/∂θ = 0for  θand substitution into  µRand (52) yields:

image

where  ψ1 = (qT n1 − c1)/r, ψ2 = (qT n2 − c2)/r, t1 =nT1 b, t2 = nT2 band  τ = nT1 n2. Setting the derivatives of L w.r.t.  µ, σand  λ, respectively, to zero yields:

image

(Case I) If  λ = 2µrψ2 + t2 − στ < 0, then set  λ = 0. Substitution into (54) yields:  σ = 2µrψ1 +t1and  1+σ2 −4µ2r2 − 2σt1 = 0. There are two subcases:

(IA) If  t1 > −ψ1, then  σ = t1 + ψ1�1−t211−ψ21,  µ = 12r�

and  λ < 0 ⇔ (ψ2 − ψ1τ)�1−t211−ψ21 + t2 − t1τ < 0. (IB) If t1 ≤ −ψ1, then  σ = 0, µ = 1/(2r)and  λ < 0 ⇔ t2 < −ψ2. (Case II) Suppose  λ = 2µrψ2 + t2 − στ > 0. Again there are two subcases: (IIA) If  σ = 2µrψ1 + t1 − λτ < 0, then set  σ = 0. Substitution into (54) yields:  λ = 2µrψ2 + t2and  1 + λ2 − 4µ2r2 − 2λt2 = 0. Solving gives,

with λ >  0 ⇔ t2 > −ψ2and σ <  0 ⇔ (ψ1−ψ2τ)�

t1 − t2τ < 0. (IIB) If  σ = 2µrψ1 + t1 − λτ > 0, then substituting  λ = 2µrψ2 +t2 −στand  σ = 2µrψ1 +t1 −λτinto (54) yields,  (1 − τ 2)σ = 2µr(ψ1 − ψ2τ) + t1 − t2τand (1 − τ 2)σ2 + 2σ(t2τ − t1) + 4µ2r2(ψ22 − 1) + 1 − t22 = 0. Solving these equations gives:  µ = 12r∆, λ = ψ2−ψ1τ1−τ 2 ∆ +

image

various conditions into (53) yields:

[(1)] t1 < −ψ1, t2 < −ψ2: µR(b) = qT b +r.

image

For general b we use  µR(b) = ∥b∥2µR(b/∥b∥2). So in each of the above expressions we replace  b, t1 = nT1 b, and  t2 = nT2 bby  b/∥b∥2, t1/∥b∥2and  t2/∥b∥2, re- spectively. Then multiply each expression by  ∥b∥2. This yields the result in Lemma 5.

Theorem 4: This is almost identical to the proof of Theorem 3 and is hence omitted.

image

Lemma 6: Note max{µR(bi), µR(−bi)} =maxθ∈R max{θT b, −θT b} = maxθ∈R |θT b|. If R1or  R2is empty, the result is clear. Hence assume each is nonempty. For the lasso, if  biis rejected by  TR1 ∨ TR2, then either  maxθ∈R1 |θT bi| < 1or  maxθ∈R2 |θT bi| < 1. Without loss of generality assume  maxθ∈R1 |θT bi| < 1. Since  R1 ∩ R2is a subset of  R1, this implies that maxθ∈R1∩R2 |θT bi| ≤ maxθ∈R1 |θT bi| < 1, so  biis also rejected by  TR1∩R2. Therefore  TR1 ∨ TR2 ⪯ TR1∩R2. The proof for the nn-lasso is similar.

This work partially supported by NSF grant CIF 1116208.

[1] R. Tibshirani, “Regression shrinkage and selection via the lasso,” J. Royal. Statist. Soc B., vol. 58, no. 1, pp. 267–288, 1996.

[2] A. Y. Yang, A. Ganesh, Z. Zhou, S. S. Sastry, and Y. Ma, “A review of fast l1-minimization algorithms for robust face recognition,” Arxiv preprint arXiv:1007.3753, 2010.

[3] M. Elad, Sparse and Redundant Representations: From Theory to Applications in Signal and Image Processing. Springer, 2010.

[4] J. Wright, Y. Ma, J. Mairal, G. Sapiro, T. Huang, and S. Yan, “Sparse representation for computer vision and pattern recognition,” Proceedings of the IEEE, vol. 98, no. 6, pp. 1031–1044, 2010.

[5] J. Mairal, M. Elad, and G. Sapiro, “Sparse representation for color image restoration,” IEEE Transactions on Image Processing, vol. 17, no. 1, pp. 53–69, 2008.

[6] J. Mairal, F. Bach, J. Ponce, G. Sapiro, and A. Zisserman, “Nonlocal sparse models for image restoration,” in IEEE 12th Int. Conf. on Computer Vision, 2009, pp. 2272–2279.

[7] J. Wright, A. Y. Yang, A. Ganesh, S. S. Sastry, and Y. Ma, “Robust face recognition via sparse representation,” IEEE Trans, on Pattern Analysis and Machine Intelligence, vol. 31, no. 2, pp. 210–227, 2009.

[8] A. Wagner, J. Wright, A. Ganesh, Z. Zhou, H. Mobahi, and Y. Ma, “Towards a practical face recognition system: robust alignment and illumination by sparse representation,” IEEE Transactions on Pattern Analysis and Machine Intelligence, 2011.

[9] K. Yu, T. Zhang, and Y. Gong, “Nonlinear learning using local coordinate coding,” in Advances in Neural Information Processing Systems, vol. 3, 2009.

[10] T. N. Sainath, A. Carmi, D. Kanevsky, and B. Ramabhadran, “Bayesian compressive sensing for phonetic classification,” in IEEE Int. Conf. Acoustics Speech and Signal Processing, 2010, pp. 4370–4373.

[11] T. N. Sainath, B. Ramabhadran, D. Nahamoo, D. Kanevsky, and A. Sethy, “Sparse representation features for speech recognition,” in Eleventh Annual Conf. of the Int. Speech Communication Association, 2010.

[12] K. Chang, J. Jang, and C. S. Iliopoulos, “Music genre classification via compressive sampling,” in Proc. 11th Int. Conf. on Music Information Retrieval (ISMIR), 2010, pp. 387–392.

[13] S. Prasad, P. Melville, A. Banerjee, and V. Sindhwani, “Emerging topic detection using dictionary learning,” in ACM Conference on Information and Knowledge Management, 2011.

[14] D. Zhang, M. Yang, and X. Feng, “Sparse representation or collaborative representation: Which helps face recognition?” in Int. Conf. on Computer Vision, 2011, pp. pp. 471–478.

[15] B. Efron, T. Hastie, I. Johnstone, and R. Tibshirani, “Least Angle Regression,” Annals of Statistics, vol. 32, no. 2, pp. 407–499, 2004.

[16] J. A. Trop and A. C. Gilbert, “Signal Recovery From Random Mea- surements Via Orthogonal Matching Pursuit,” IEEE Transactions on Information Theory, vol. 53, no. 12, pp. 4655–4666, Dec. 2007.

[17] S. M. Smith, “Overview of fMRI analysis,” British Journal of Radiology, vol. 77, no. Special Issue 2, p. S167, 2004.

[18] J. Fan and J. Lv, “Sure independence screening for ultrahigh dimensional feature space,” Journal of the Royal Statistical Society: Series B (Statistical Methodology), vol. 70, no. 5, pp. 849–911, 2008.

[19] R. Tibshirani, J. Bien, J. Friedman, T. Hastie, N. Simon, J. Taylor, and R. J. Tibshirani, “Strong rules for discarding predictors in lasso-type problems,” Journal of the Royal Statistical Society: Series B (Statistical Methodology), vol. 74, 2012.

[20] G. Thompson, F. Tonge, and S. Zionts, “Techniques for removing nonbonding constraints and extraneous variables from linear programming problems,” Management Science, vol. 12, no. 7, pp. 588–608, 1966.

[21] L. El Ghaoui, V. Viallon, and T. Rabbani, “Safe feature elimination in sparse supervised learning,” Pacific Journal of Optimization, vol. 8, no. 4, pp. 667–698, 2012.

[22] Z. J. Xiang, H. Xu, and P. J. Ramadge, “Learning sparse represen- tations of high dimensional data on large scale dictionaries,” in Advances in Neural Information Processing Systems, 2011.

[23] Z. J. Xiang and P. J. Ramadge, “Fast lasso screening tests based on correlations,” in IEEE Int. Conf. on Acoustics, Speech and Signal Processing, 2012.

[24] L. Dai and K. Pelckmans, “An ellipsoidal based, two-stage screen- ing test for bpdn,” in 20th European Signal Processing Conference, Aug. 2012.

[25] J. Wang, P. Wonka, and J. Ye, “Lasso screening rules via dual poly- tope projection,” Journal of Machine Learning Research, to appear.

[26] H. Zou and T. Hastie, “Regularization and variable selection via the elastic net,” Journal of the Royal Statistical Society B., vol. 67, no. 2, pp. 301–320, 2005.

[27] R. Tibshirani and J. Taylor, “The solution path of the generalized lasso,” Annals of Statistics, vol. 39, no. 3, pp. 1335–1371, 2011.

[28] J.-B. Hiriart-Urruty and C. Lemarechal, Fundamentals of Convex Analysis. Springer, 2001.

[29] J.-J. Fuchs, “Recovery of exact sparse representations in the pres- ence of bounded noise,” IEEE Transactions on Information Theory, vol. 51, no. 10, pp. 3601–3608, Oct. 2005.

[30] R. Tibshirani, “The lasso problem and uniqueness,” arXiv:1206.0313 [math.ST], 4th Nov. 2012.

[31] L. El Ghaoui, V. Viallon, and T. Rabbani, “Safe feature elimination in sparse supervised learning,” EECS Department, University of California, Berkeley, Tech. Rep. UCB/EECS-2010-126, Sep. 2010. [Online]. Available: http://www.eecs.berkeley.edu/Pubs/ TechRpts/2010/EECS-2010-126.html

[32] ——, “Safe Feature Elimination for the LASSO and Sparse Super- vised Learning Problems,” arXiv:1009.4219v2 [cs.LG], 2011.

[33] Y. Wang, Z. Xiang, and P. Ramadge, “Tradeoffs in improved screening of lasso problems,” in IEEE Int. Conf. on Acoustics, Speech and Signal Processing, Jun. 2013.

[34] J. Mairal and B. Yu, “Complexity analysis of the lasso regulariza- tion path,” in Proc. of the 29th Int. Conf. on Machine Learning (ICML 2012), Edinburgh, Scotland, 2012.

[35] S. Luo and Z. Chen, “Sequential lasso for feature selection with ultra-high dimensional feature space,” arXiv:1107.2734 [stat.ME], 14 July 2011.

[36] Y. Wang, X. Chen, and P. J. Ramadge, “Feedback-controlled se- quential lasso screening,” Princeton University, Tech. Rep., June 2015.

[37] Y. LeCun and C. Cortes, “The MNIST database of handwritten digits,” 1998.

[38] Y. Lecun, L. Bottou, Y. Bengio, and P. Haffner, “Gradient-based learning applied to document recognition,” Proceedings of the IEEE, vol. 86, no. 11, pp. 2278 –2324, Nov. 1998.

[39] A. S. Georghiades, P. N. Belhumeur, and D. J. Kriegman, “From few to many: Illumination cone models for face recognition under variable lighting and pose,” IEEE Trans. on Pattern Analysis and Machine Intelligence, vol. 23, no. 6, pp. 643–660, 2002.

[40] K. C. Lee, J. Ho, and D. J. Kriegman, “Acquiring linear subspaces for face recognition under variable lighting,” IEEE Trans. on Pattern Analysis and Machine Intelligence, pp. 684–698, 2005.

[41] D. Cai and X. He, “Manifold adaptive experimental design for text categorization,” IEEE Transactions on Knowledge and Data Engineering, April 2011.

[42] S. A. Nene, S. K. Nayar, and H. Murase, “Columbia object image library (coil-100),” Techn. Rep. No. CUCS-006-96, dept. Comp. Science, Columbia University, 1996.

[43] G. Tzanetakis and P. Cook, “Musical genre classification of audio signals,” IEEE Trans. Speech and Audio Processing, vol. 10, no. 5, 2002.

[44] J. And´en and S. Mallat, “Multiscale scattering for audio classifi- cation,” Proceedings of the ISMIR 2011 Conference, 2011.

[45] A. Frank and A. Asuncion, “UCI Machine Learning Repository, University of California, Irvine, School of Information and Computer Sciences,” 2010. [Online]. Available: http://archive. ics.uci.edu/ml

[46] H. Lee, A. Battle, R. Raina, and A. Ng, “Efficient sparse coding algorithms,” in Advances in Neural Information Processing Systems, vol. 19, 2007, p. 801.

[47] A. Beck and M. Teboulle, “A fast iterative shrinkage-thresholding algorithm for linear inverse problems,” SIAM J. IMAGING SCIENCES, vol. 2, no. 1, pp. 183–202, 2009.

[48] P. S. and J. Theiler, “Online feature selection using grafting,” in Proc. of the Int. Conf. on Machine Learning, 2003, pp. 592–599.

[49] J. Liu, Z. Zhao, J. Wang, and J. Ye, “Safe screening with varia- tional inequalities and its application to lasso,” arXiv:1307.7577v2 [cs.LG], Oct. 2013.

[50] S. Dash, D. Malioutov, and K. R. Varshney, “Screening for learning classification rules via boolean compressed sensing,” in IEEE Int. Conf. on Acoustics, Speech and Signal Processing, May. 2014.

[51] Q. Zhou and Q. Zhao, “Safe subspace screening for nuclear norm regularized least squares problems,” in Proceedings of the 32nd International Conference on Machine Learning, Lille, France, 2015.

[52] Y. Wang, X. Chen, and P. J. Ramadge, “Sparse representation classification via sequential lasso screening,” in 1st IEEE Global Conference on Signal and Information Processing, Dec. 2013.

[53] P.-K. Jao, C. C. M. Yeh, and Y.-H. Yang, “Modified lasso screening for audio word-based music classification using large scale dictionary,” in IEEE Int. Conf. on Acoustics, Speech and Signal Processing, May. 2014.

HERE Zhen James Xiang received the B.E. degree B.E. in 2007 from Department of Electrical Engineering, Tsinghua University, China, graduating with honors and GPA rank 1/164. He received the M.A. degree and the Ph.D. degree in Electrical Engineering from Princeton University in 2009 and 2012 respectively. He is currently a quantitative researcher at Citadel LLC in Chicago. He has received several awards and honorable mentions for his scholarship, in-

cluding: Best Student Paper Honorable Mention Award, NIPS (2011); Charlotte Elizabeth Procter Honorific Fellowship of Princeton University (2011-2012); Qualcomm Innovation Fellowship Finalist (2011); Francis Robin Upton Fellowship, Princeton University (2007-2011); Distinguished Graduate Award of Beijing City (2007); Distinguished Graduate of Tsinghua University (2007); and in 2003, a Gold Medal at the International Mathematics Olympiad, Tokyo, Japan, where he ranked 12th among 457 participants from 84 countries.

HERE Yun Wang received the B.S. degree in Electrical Engineering with highest honors from Shanghai Jiao Tong University in 2011 and the M.A. and Ph.D. degrees in Electrical Engineering from Princeton University in 2013 and 2015, respectively. His doctoral research focused on machine learning, optimization and statistical signal processing. He joined Amazon as a machine learning scientist in the fall of 2015. His honors include the Distinguished Graduate Award of

the City of Shanghai (2011) and the Anthony Ephremides Fellowship in Electrical Engineering, Princeton University (2011).

HERE Peter J. Ramadge received the B.Sc., B.E. and the M.E. degree from the University of Newcastle, Australia, and the Ph.D. degree in Electrical Engineering from the University of Toronto, Canada. He joined the faculty of Princeton University in September 1984, where he is currently Gordon Y. S. Wu Professor of Engineering, and Professor of Electrical Engineering. He is a Fellow of the IEEE and a member of SIAM. He has received several honors and awards


Designed for Accessibility and to further Open Science