The sparse representation of data with respect to a dictionary of features has recently contributed to successful new methods in machine learning, pattern analysis, and signal/image processing. At the heart of many sparse representation methods is the least squares problem with regularization, often called the lasso problem [1]:
where is a regularization parameter. The matrix
is called the dictionary and its columns
are usually called features. Depending on the field, the terms codewords, atoms, filters, and regressors are also used. The lasso problem seeks a representation of the target vector
as a linear combination
of the features with many
(sparse representation). Equation (1) also serves as the Lagrangian for the widely used constrained problems
subject to
, and
subject to
. Many solvers of these problems address the Lagrangian formulation (1) directly [2].
The above problems are studied extensively in the signal processing, computer vision, machine learning, and statistics literature. See, for example, the general introduction to sparse dictionary representation methods in [3] and [4]. Sparse representation has proven effective in applications ranging from image restoration [5], [6], to face recognition [7], [8], object recognition [9], speech classification [10], speech recognition [11], music genre classification [12], and topic detection in text documents [13]. In these applications, it is common to encounter a large dictionary (e.g., in face recognition), data with large data dimension (e.g., in topic detection), and in
dictionary learning, a large number of dictionary iterations (e.g., in image restoration). These factors can make solving problem (1) a bottleneck in the computation. Several approaches have been suggested for addressing this computational challenge. In the context of clas-sification, Zhang et al. [14] propose abandoning sparsity and using a fast collaborative linear representation scheme based on regularized least squares. This improves the speed of classification in face recognition applications. However, in general the (nonlinear) Sparse Representation Classifier (SRC) [7] achieves superior classification accuracy. Another approach is to seek a sparse representation using a fast greedy method to approximate the solution of (1). There has been a considerable amount of work in this direction, see for example [3], [15], [16]. However, this approach seems best when seeking very sparse solutions and, in general, the solutions obtained can be challenging to analyze. Recently an approach known as (dictionary) screening has been proposed. For a given target vector y and regularization parameter
, screening quickly identifies a subset of features that is guaranteed to have zero weight in a solution
of (1). These features can be removed (or “rejected”) from the dictionary to form a smaller, more readily solved lasso problem. By padding its solution appropriately with zeros, one obtains a solution of the original problem. This approach is the focus of the paper. Screening has two potential benefits. First, it can be run in an on-line mode with very few features loaded into memory at a given time. By this means, screening can significantly reduce the size of the dictionary that needs to be loaded into memory in order to solve the lasso problem. Second, by quickly reducing the number of features we can often solve problems faster. Even small gains can become very significant when many lasso problems must be solved. Moreover, since screening is transparent to the lasso solver, it can be used in conjunction with many existing solvers. The idea of screening can be traced back to various
feature selection heuristics in which selected features are used to fit a response vector y. This is usually done by selecting features based on an empirical measure of relevance to y, such as the correlation of y and
. This is used, for example, in univariate voxel selection based on t-statistics in the fMRI literature [17]. Fan and Lv [18] give an excellent review of recent results on correlation based feature selection and formalize the approach in a probabilistic setting as a correlation based algorithm called Sure Independence Screening (SIS). In a similar spirit, Tibshirani et al. [19] report Strong Rules for screening the lasso, the elastic net and logistic regression. These rules are also based on thresholding correlations. With small probability, SIS and the Strong Rules can yield “false” rejections.
A second approach to screening seeks to remove dictionary columns while avoiding any false rejections. In spirit, this harks back to the problem of removing “non-binding” constraints in linear programs [20]. For the lasso problem, the first line of recent work in this direction is due to El Ghaoui et al. [21], where such screening tests are called “SAFE” tests. In addition to the lasso, this work examined screening for a variety of related sparse regularization problems. Recent work (e.g., [22], [23], [24], [25]) has focused mainly on the lasso problem and close variants.
The basic approach in the above papers is to bound the solution of the dual problem of (1) within a compact region R and find . For simple regions
is readily computed and yields a screening test for removing a subset of unneeded features. This approach has resulted in tests based on spherical bounds [21], [22], the intersection of spheres and half spaces (domes) [21], [23], elliptical bounds [24] and novel approaches for selecting the parameters of these regions to best bound the dual solution of (1) [25]. These screening tests can execute quickly, either serially or in parallel, and require very few features to be loaded into memory at once. If one seeks a strongly to moderately sparse solution, the tests can significantly reduce dictionary size and speed up the solution of lasso problems.
To keep our survey focused, we concentrate on screening for the lasso problem. However, the methods discussed apply to any problem that can be efficiently transformed into a lasso problem. For example, the elastic net [26] and full rank generalized lasso problems [27]. Moreover, the basic ideas and methods discussed are a good foundation for applying screening to other sparse regularization problems. We will situate our exposition within the context of prior work as the development proceeds.
The main features of our survey include: (a) Our exposition uses a geometric framework which unifies many lasso screening tests and provides basic tools and geometric insights useful for developing new tests. In particular, we emphasize the separation of the structure or “architecture” of the test from the design problem of selecting its parameters.
(b) We examine whether more complex screening tests are worthwhile. For each , there is a family of tests based on the intersection of a spherical bound and m half spaces. As m increases these tests can reject more features but are also more time consuming to execute. To examine if more complex tests are worthwhile, we derive the region screening test for the intersection of a sphere and two half spaces, and use this to examine where current region screening tests stand in the trade-off between rejection rate and computational efficiency. (c) We show how composite tests can be formed from existing tests. In particular, we describe a composite test based on carefully selected dome regions that performs competitively in numerical studies. We also point out a fundamental limitation of this approach.
(d) We review sequential screening schemes that make headway on the problem of screening for small normalized values of . When used in an “on-line” mode with realistic values of the regularization parameter, these methods can successfully reduce the size of large dictionaries to a manageable size, allowing larger problems to be solved, and can result in a faster overall computation.
1.1 Outline of the Paper
We begin in 2 with a review of basic tools, especially the dual of the lasso problem and its geometric interpretation.
3 introduces screening in greater detail and
4 introduces region tests. After these preparations, we discuss several important forms of region tests: sphere tests (
4.2), sphere plus hyperplane tests (
4.3) and sphere plus two hyperplane tests (
4.5). We show how spherical bounds can be iteratively refined using features (
4.4), and examine ways to combine basic tests.
5 gives a brief overview of sequential screening. We give a practical summary of screening algorithms in
6 and illustrate the results of screening via numerical studies in
7. We conclude in
8. Proofs of new or key results are given in the Appendices, organized by the section in which the result is discussed.
We focus on the lasso (1), but it will be convenient to also consider the nonnegative lasso:
The analysis and algorithms in the paper apply (with minor changes) to both problems.
Throughout the paper we assume that a fixed dictionary B is used to solve various instances of (1) or (2). We assume that all features are nonzero and say that the dictionary is normalized if all features have unit norm. Each instance is specified by a pair consisting of a target vector y and a value
of the regularization parameter.
Multiplying the objective of (1) by , with
, yields the equivalent problem:
where ,
, and
. Some lasso solvers require that
, and problem (1) must be scaled to ensure this holds. As a result, it is meaningless to talk about the value of
employed when solving (1) without accounting for possible scaling. One way to do this is to define
. Then the ratio
is invariant to scaling. The parameter
is also useful for other purposes related to screening. Throughout the paper, we use the ratio
as an unambiguous measure of the amount of regularization used in solving (1) and (2).
Geometric insight on lasso problems, and on screening in particular, is enhanced by bringing in the Lagrangian dual of (1). The following parameterization of the dual problem is particularly convenient [22]:
Solutions of (1) and
of (3) satisfy:
The corresponding dual problem of (2) is:
with the primal and dual solutions related via:
A derivation of (3) and (4) is given in the Appendix. It will be convenient to define a feature pool B. For the lasso, and for the nonnegative lasso, B =
. This allows the constraints in (3) and (5) to be stated as
.
For , let
denote the hyperplane in
that has unit normal
and contains the point
. Let
denote the corresponding closed half space containing the origin. So a constraint of the form
requires that
lies in the closed half space H(b). Hence the set of feasible points F of the dual problems is the nonempty, closed, convex set formed by the intersection of the finite set of closed half spaces
. This is illustrated in Fig. 1(a) and 1(b). In addition, for the lasso,
if and only if
. So
. This follows from the same property of the feature pool:
.
To maximize the objective function in (3) or (5) we seek the projection of
onto the closed convex set F. This is the unique point satisfying the following set of inequalities [28,
3.1]: for each
,
In contrast, the lasso problem (1) may not have a unique solution [29], [30].
The set of points is called the dual regularization path. For
sufficiently large,
lies in F and
. To find the smallest
for which this holds, let
Then for all . So
lies in the boundary of F. As
decreases from a large value,
moves in a straight line within F until
, at which point
first lies on the boundary of F. As
decreases below
moves away from F and
is the unique projection of
onto the boundary of F. Using (4), for
, and conversely, if
, then
. So for
,
,
lines on the boundary of F, and
is nonzero.
Let I = [1, . . . , p] denote the ordered set of feature indices and . Given
, let
denote the vector in
obtained by subsampling w at the indices in S. Conversely, for
, let
denote the vector in
obtained by upsampling z: the entries of
with indices in S take the corresponding values in z and all other entries are zero. Similarly, for a dictionary
, let
denote the subdictionary obtained by sampling the columns of B at the indices in S. The following properties are clear: (a)
; (b) if
for
, then
; (c)
; and (d) if
for
.
By (4), if we know the primal solution , then the dual solution is
. Conversely, if we know the dual solution
, then any point satisfying the following equations is a primal solution:
We now explain the idea of screening in detail. Given an instance of (1), we select a partition
of the features. We say that the features indexed by S are selected and those indexed by
are rejected. Then we form the reduced dictionary
of selected features and let
denote a solution of the corresponding lasso problem using this dictionary. In general, the upsampled vector
is not a solution of the original lasso problem (1).
Fig. 1. The constraints and feasible set F of the dual problem for (a): general features, (b): unit norm features. (c): Examples of two spheres and a dome region bounding for unit norm features. In all cases only the lower half of F is shown.
Here is the key point: screening seeks a partition such that the upsampled vector solves (1). In general, such a partition depends on the instance and hence must be computed “on-the-fly”.
By virtue of being smaller, the reduced problem is more manageable. For example, it may fit into memory when the original problem does not, and finding its solution may require less time. Hence there are two evaluation metrics of interest: the size of S (or ) as a fraction of I, and the total time taken to select S and solve the reduced problem relative to the time taken to solve the original problem directly without screening. We will normally express these metrics as the rejection fraction
and the speedup factor
. Here
is the time to solve the original lasso problem,
is the time to select the partition (screen the dictionary), and
is the time to solve the reduced lasso problem.
Not surprisingly, if we know the dual solution , then it is easy to come up with a suitable partition. To see this, consider the lasso problem. For any partition
, let
and
denote solutions of the original and reduced lasso problems, respectively. It is clear that the following always holds:
Now assume the dual solution is known, let
denote the active constraints at
, and consider the particular partition
. Equation (4) shows that if
(equivalently,
), then
. Hence for this partition,
,
, and
Equation (12) implies that for this partition the two inequalities in (11) must be equalities. It follows that solves the reduced problem and
solves the original problem. Although a simple observation, this is worth stating as a theorem.
Theorem 1. Let the solution of (3) (resp. (5)) have active set
. If
is a solution of (1) (resp. (2)) with dictionary
, then
solves (1) (resp. (2)). Moreover, every solution of (1) (resp. (2)) can be expressed in this way.
The fundamental partition of I into and
is conceptually very important but obviously impractical. If we know
, then we can easily solve the primal problem (see (10)) and this makes screening and problem reduction unnecessary. As a first step towards finding a practical way to partition the features, we note that if
(screening keeps more) or equivalently
(screening rejects less), then equation (12) holds with S replacing
. This implies that the two inequalities in (11) hold with equality for this partition. Hence we have the following corollary of Theorem 1.
Corollary 1. Let the solution of (3) (resp. (5)) have active set
. Let
. If
is a solution of (1) (resp. (2)) with dictionary
, then
is a solution of (1) (resp. (2)). Moreover, every solution of (1) (resp. (2)) can be expressed in this way.
The core idea for creating a partition of the dictionary that conforms with Corollary 1 is to bound within a compact region R. For each feature b, we then compute
, and use this quantity to partition B [21].
We first illustrate this for the nonnegative lasso. For a compact set R, if , all features are rejected; otherwise for each feature
exists. Then define the partition:
The logic is that if and
, then
and hence
. Thus
, as desired.
For the lasso problem, and
ensure
. But in this case we also need
or equivalently
. This holds if
. Effectively, we must test both
and
to account for the positive or negative sign of
. So for the lasso the partition is:
For example, when if: (a)
(nonnegative lasso) and (b)
(lasso). So R =
, yields the ideal partition
.
From the above constructions, we see that ensures that the partitions (13) and (14) satisfy
. Hence the assumptions of Corollary 1 are satisfied. This is summarized in the following corollary.
Corollary 2. Let R be a compact region with . Then R defines a dictionary partition
with
.
It will be convenient to encode the partition induced by a bounding region R as a rejection test with
if
and 0 otherwise. For example, the rejection test corresponding to (14) is:
We end this section by noting that for a given dictionary B, the partial order of subsets of features induces a partial order on screening tests. Test is weaker than test T, denoted
, if the set of features rejected by
is a subset of the features rejected by T. For example, if
, then
. This is a special case of the following lemma.
Lemma 1. If , then
.
If , then the region test for
can potentially reject more features than the test for
.
4.1 The Sphere-Hyperplane Architecture
We now consider particular forms of bounding regions for . A natural form of bounding region consists of the intersection of a spherical bound with a finite number of half spaces. The spherical bound arises naturally once we know a dual feasible point, and half spaces arise naturally since these define the dual feasible region F (see (3)), and are integral to the projection of a point onto F (see (7)).
The intersection of a closed ball S(q, r) = {z : with center q and radius r, and m half spaces
, gives the region:
To form the corresponding region test, we find by solving the optimization problem:
Once is known, (15) gives the corresponding screening test. Using the change of variable
, problem (16) can be simplified to:
where . The solution of (16) is then
. By decomposing z and b in terms of span
and its orthogonal complement, (17) reduces to a convex program in
.
Increasing m results in tests with the potential to reject more features, but which are also more complex and time consuming to execute. In the following two subsections, we discuss the simplest cases: m = 0 (sphere tests), and m = 1 (dome tests). This gives insight into basic tests and makes connections with the literature.
4.2 Sphere Tests
Consider bounding within a closed ball S(q, r) =
with center q and radius r. This bound gives a simple, efficiently implemented test, and it is also a useful building block for more complex tests. We first determine a close form expression for
. An expression for a sphere test
then follows from (15).
Lemma 2. For and
:
Theorem 2. The screening test for the sphere S(q, r) is:
where and for the lasso
, and for the nonnegative lasso
.
For the lasso, the test (19) can also be written as:
Theorem 2 defines a parametric family of tests: , where ST(q, r) denotes the sphere test with center q and radius r. To use a sphere test one first selects values of q and r so that S(q, r) bounds
. We call this the parameter selection problem. By Lemma 1, a tighter bound has potential for better screening. So using only the information provided, and limited computation, we want to select q and r to give the “best bound”. This is a design problem involving a trade-off between the computation cost to select q and r and the resultant screening performance. Hence we don’t expect there is a “best answer”. We outline below several selection methods.
4.2.1 Parameter selection
If we know a dual feasible point , then
can’t be further away from
than
. This gives the basic spherical bound:
with center and radius
. In particular,
is dual feasible and gives a particular instance of (21):
This bound is shown in Fig. 1(c) as the larger sphere in solid red. The bound (22) requires only the specification of the lasso problem and the computation of . We call it the default spherical bound.
Better bounds are possible with additional computation or if additional information is supplied. For example, [25] observed that to obtain a feasible point closer to
than
one can first run K steps of the homotopy algorithm on (1). This gives the solution
of the instance
, for the K-th breakpoint on the (primal) regularization path. Effectively, this first solves the lasso problem for
, and then uses this solution to help screen for the actual instance to be solved. The sphere center can also be moved away from
. Examples include the sphere tests ST2 and ST3 in [22] derived in the setting of unit norm y and
. In addition, [25] noted that if the dual solution
is known for an instance
, then
(this is discussed further below). This leverages a solved instance to give a spherical bound centered at
.
4.2.2 Connections with the Literature
A variety of existing screening tests for the lasso are sphere tests. The Basic SAFE-LASSO test [31] and the test ST1 in [22, Sect. 2] are sphere tests based on the default spherical bound (22). The SAFE-LASSO test [31, Theorem 2] is also a sphere test. It assumes a dual feasible point is given and uses this to improve the default spherical bound centered at
. The sphere tests ST2 and ST3 in [22, Sect. 2] use spherical bounds not centered at
. We will comment further on the test ST3 at the end of
4.6. The core test used in [25] is a sphere test with center
, where
is the dual solution at
, and radius
. This bound follows from the nonexpansive property of projection onto a convex set:
The Strong Rule [19] is also a sphere test for the lasso problem. For notational simplicity, let the features and the target vector y have unit norm. The Strong Rule discards feature if
. This is a sphere test with center
and radius
. The point
is bounded within the default sphere (center
, radius
). The Strong Rule uses a sphere with the same center but a radius only a fraction of
. This smaller sphere is not guaranteed to contain
. So the Strong Rule can (with low probability) yield false rejections. A detailed discussion of this issue is given in [19]. A more advanced version of the Strong Rule, the Strong Sequential Rule [19], assumes a solution
of the lasso instance
is available, where
. It then forms the residual
and screens the lasso instance
using the test
. This is also a sphere test. To see this, use (4) to write
. Then the test becomes
with
. This is a sphere test with center
and radius
where r is the radius of the known bounding sphere (23). When
, this test may also yield false rejections. See [19] for examples and analysis of rule violations.
The SIS test in [18] is framed in a probabilistic setting and is not intended for lasso screening. Nevertheless, if we translate SIS into our setting it is a sphere test for a lasso problem with appropriately selected . SIS assumes a dictionary
of standardized vectors (features of unit norm) and computes the vector of (marginal) correlations
. Then given
, it selects the top
features ranked by
. Assume for simplicity that the values
are distinct and let
denote the value of
for the
-th feature in the ranking. The SIS rejection criterion can then be written as
. We now form a lasso problem with dictionary B, target vector y, and a value
to be decided. For simplicity of notation, assume that y has unit norm. Then the default spherical bound for the dual solution of the lasso has center
and radius
, and the corresponding sphere test is
. Equating the right hand sides of the above test expressions, and using some algebra shows that if we take
, then SIS is the default sphere test for this particular lasso problem.
4.3 Sphere Plus Halfspace Tests
Now consider a region test based on the nonempty intersection of a spherical ball and one closed half space
. Here n is the unit normal to the half space and
. This yields the dome region
illustrated in Fig. 2(a).
The following features of the dome D(q, r; n, c) will be useful. We call the point on the bounding hyperplane and the line passing through q in the direction of the hyperplane normal the dome center. The signed distance
Fig. 2. (a) A general dome region D(q, r; n, c) shown for 0 < and the dome consisting of less than half the sphere. (b) The rejection area (shaded) of a lasso dome test.
from q to in the direction
is a fraction
of the radius r of the sphere. We call the maximum straight line distance
one can move from
within the dome and hyperplane the dome radius. Under the sign convention indicated above, simple Euclidean geometry gives the following relationships:
To ensure that the dome is nondegenerate (a nonempty and proper subset of each region), we need to be inside the sphere. Hence we require
. So we need
, this ensures that the intersection is a proper subset of the sphere and the half space; and we need
, this ensures the intersection is nonempty.
To find , for
, we solve the optimization problem (16) with m = 1. Particular instances of this problem were solved in [32, Appendix A] (by solving a Lagrange dual problem) and in [23,
3] (by directly solving a primal problem). Both approaches can be extended to solve the general problem (16) with m = 1. This yields the following lemma, and the dome screening test.
Lemma 3. Fix a dome D = D(q, r; n, c) with . Then for
,
where is the function
Theorem 3. The screening test for a nondegenerate dome D(q, r; n, c) is:
where and for the lasso
, and for the nonnegative lasso
.
We denote a dome test by DT(q, r; n, c). Although defined piecewise, the functions and
in Theorem 3 are continuous and smooth:
. This can be checked using simple calculus. The parameters r and c of the dome do not appear as arguments in the test but play a role through
. The test simplifies for unit norm features. In that case,
and
and
are only functions of
.
To gain some insight into this test, consider the situation when r < 1 and all features have unit norm. We can factor the test into the composition of two functions: a linear map and a two-dimensional decision function
with
if
, and 0 otherwise; where
,
, and
are given in Theorem 3 with
and
. We can display the test rejection region by plotting
and
versus t as shown in Fig. 2(b). For the lasso, the rejection region has upper and lower boundaries. The sections of the boundaries with
and
, correspond to the sphere test
. If feature
maps into the shaded region in the figure, then
is rejected. The lightly shaded (yellow) area indicates the extra rejection power of the dome test over the underlying sphere test. For a given value of
, the dome test lowers the bar for rejection as
increases.
4.3.1 Parameter selection
Now consider the parameter selection problem. Since we have discussed parameter selection for a spherical bound, we assume S(q, r) is given and give examples of bounding within a suitable half space.
Each constraint of the dual problem bounds
. This half space has
and
. The resultant dome is nonempty since both the sphere and the half space contain
. To ensure it is proper, we require
. This means that the sphere test does not reject the feature b. In particular, we can select b to minimize the disk radius
. To do so, we maximize
given by (24):
Fig. 3. Two dome tests for unit norm features and target vector. Left: The dome (30) based on the feasible point closed half space
. Right: The dome (33) based on a solved instance
For unit norm features, (29) selects the feature most correlated with q. If in addition, , then (29) yields
. Selecting the default spherical bound and using (29) gives the specific dome:
We call this the default dome bound. When y and all the features have unit norm, this simplifies to This dome is illustrated in Fig. 3 (left).
If is the dual solution of an instance
, then
lies on the boundary of F. Moreover, its optimality for
ensures that it satisfies the inequalities (7). Hence for each
:
Since , the right hand side is nonnegative. Therefore this inequality bounds F in the closed half space
with
The intersection of this half space with the bounding sphere S(q, r) is nonempty and it is proper if . To check this condition note that
where is the angle between
and
. So if
0 or
, then the dome is proper. For example,
and
, yields the proper dome:
This dome illustrated in Fig. 3 (right).
4.3.2 Connections with the Literature
Specific dome tests were introduced in [32, 2.4] and [23,
3]. The dome test discussed in [23] is based on the default dome bound (30) for unit norm features and unit norm y. The SAFE-LASSO test in [32,
2.4] is a dome test specifically designed for screening and solving lasso problems at points along the regularization path. A triple
is given where
is the dual solution for instance
. The test uses this to screen the dictionary for an instance
with
. We show that the dome employed is (33) with
. The solution in [32,
2.4] entails specifying a bounding sphere and a half space, then solving the corresponding version of (16). The selected half space is
where
is the gradient of the dual objective for the solved instance evaluated at
(up to positive scaling). The spherical bound is obtained by scaling
to obtain the closest feasible solution to
. This can be specified by letting
and setting
and
. Assume
and let
denote the optimal value of s in the definition of
. By the optimality of
for the instance
, we must have
. In addition, it must hold that
otherwise the feasible point
would be closer to
. By simple calculus we then determine that
. It follows that for all
. Hence for
we can take
. Thus for
, SAFE-LASSO uses the dome (33) with the constraint
.
4.4 Iteratively Refined Bounds
Under favorable circumstances, it is possible to refine a sphere bounding
to obtain a bounding sphere of smaller radius. Let the half space (n, c) also bound
and its intersection with S(q, r) result in a dome D = D(q, r; n, c) with parameters
, and
. Since D is a bounded convex set, there exists a unique sphere of smallest radius that bounds D. This is called the circumsphere of D. We claim that if
, or equivalently
, then the circumsphere of D is
. In this case,
is strictly smaller than r and
is a tighter spherical bound on
. This is summarized below.
Lemma 4. Let S = S(q, r) and the half space (n, c) bound the dual solution , with the resulting dome D = D(q, r; n, c) satisfying
. Then
is the circumsphere of D and hence bounds
.
If suitable half spaces can be found, e.g., among the vectors in B, the construction in Lemma 4 can be used iteratively. At step k, we have a bounding sphere and seek
such that
and
satisfy
If such b exists, set and
to obtain a tighter bounding sphere . A greedy strategy selects b at step k to minimize
, or equivalently to maximize
:
When all features have equal norm, this reduces to maximizing the inner product of b and . This has a simple interpretation.
can be thought of as a location bound on
with the center
the “estimate” of
given the bound. The greedy strategy selects b by maximizing its alignment with the current estimate
of
. Since
is proportional to the optimal residual in the primal problem (see (4)), this strategy selects features “best correlated” with the current estimate of the optimal residual.
4.5 Are More Half Spaces Worthwhile?
We have examined region tests defined by the intersection of a bounding sphere and one half space (m = 1), and have shown that, in general, these have additional rejection power over the simpler sphere tests (m = 0). Are more complex tests worthwhile? To examine this, we go one step further and examine the region test defined by the intersection of a bounding sphere and two half spaces (m = 2). Examining the relative performance of this test will allow us to determine where we currently stand in the trade-off between rejection power and computational efficiency.
Let denote the region formed by the intersection of a sphere
and two closed half spaces
, where
is the the unit normal to
and
. We call the corresponding screening test a Two Hyperplane Test (THT).
Each half space intersects the sphere forming a dome with parameters
, and
, i = 1, 2. To ensure each intersection
is nonempty and proper, we need
, and to ensure the two half spaces intersect within the sphere, we need
. Under these conditions,
is a nonempty, proper subset of the sphere and each half space.
To find we solve the optimization problem (16) with m = 2. Using standard techniques, this problem can be solved in closed form yielding the expressions for
in the following lemma. The corresponding test then follows from (15).
Lemma 5. Fix the region and let
satisfy
, and
. Let h(x, y, z) =
, where
. Then for
,
where
and conditions (a), (b), (c) are given by
Theorem 4. The Two Hyperplane Test (THT) for the region is:
where condition (a’) is
with and for the lasso,
, and for the nonnegative lasso,
.
Theorem 4 indicates that THT uses only the 3p correlations . So the test has time complexity O(pn).
4.5.1 Parameter selection
Assume the sphere S(q, r) has been selected. The inequality constraints in (3) provide the natural half space bounds can be equivalently spec-ified as
with
and
and the resultant dome
has parameters given by (24), (25) and (26).
We seek two such half spaces. We can select the first by minimizing its dome radius . By (26), this requires maximizing
:
When all features have equal norm, we can simply maximize over
.
Suppose we have selected the first feature using (40). This yields a dome with dome center
and dome radius
. Assume that
. Then by Lemma 4, the smallest sphere containing the dome has center
and radius
. To select the second feature, we can focus on the sphere
and repeat the above construction:
When all features have equal norm, we can simply maximize over
. We call this parameter selection method Dictionary-based THT (D-THT).
Alternatively, if we have solved the instance yielding primal and dual solutions
and
(see (4)), then
must satisfy the inequalities (7). Using some algebra and (4), these inequalities can be written as:
. Since
, the right hand side is nonnegative. Hence the inequality bounds F in the half space
with
One can then select and
using (41).
We will return to the THT tests in 7 where we compare the performance of the tests with m = 0, 1, 2 and examine the trade-off between rejection rate and computational efficiency that increasing m imposes.
4.5.2 Connections with the Literature
The form of the Two Hyperplane Test was first presented (without proof) in [33], for unit norm features and target vector. The form given here (with proofs) is a generalization of that result. The general formulation allows the use of any sphere and hyperplane constraints bounding and includes the feature constraint used in [23] as a special case.
4.6 Composite Tests
The construction described in 4.4 gives rise to a finite sequence of spheres and domes:
. Each sphere and dome has an associated test. But since
is contained in
and
, each dome test is stronger than the tests for the spheres that precede and succeed it.
But is not contained in
and
is not contained in
. So we can’t claim that the last dome
leads to the strongest test. Moreover, a test based on the region
is usually too complex to compute.
An alternative is to implement a composite test that rejects if it is rejected by any of the tests
. For the nonnegative lasso,
takes the form
, with
and
given by (27). So the composite test rejects
if
Similarly, for the lasso problem the composite test rejects if
Reflecting the dome construction method, we call the tests (43) and (44) iteratively refined dome tests (IRDT). These tests can be implemented in several ways and extra domes arising in the course of the construction can also be included. This is illustrated in 6. The major cost of the tests is calculating the inner products
and
for each feature
to be tested. Because of the iterative construction, this can be done by computing
(see (34), (35), (37)). So to execute all of the tests
, only k inner products are used per feature tested. This is O(nk) time complexity per feature tested where n is the feature dimension. So the marginal cost of increasing k by 1 is the cost of computing one additional inner product per feature tested.
A composite test is mathematically equivalent to test disjunction, . A disjunction of region tests is weaker than the test based on the intersection of the regions. For example, consider two spheres of equal radius with a small intersection. Both spheres can intersect a half space while the intersection does not.
Lemma 6. For compact sets
Lemma 6 indicates that a disjunction of tests is trading rejection performance for simplicity and ease of implementation. Despite the above limitation, the IRDT test is very competitive with Dictionary-based THT on the datasets used in our numerical studies.
4.6.1 Connections with the Literature
The sphere test ST3 in [22] is based on a refined spherical bound. In [22] it is assumed that y and all features have unit norm. ST3 is then constructed starting with the default spherical bound with
and
. The greedy strategy selects the feature
. Then the dual solution
lies in the default dome formed by the intersection of the spherical ball
and the half space
. This intersection is indicated by the green dome region G in Fig. 1(c). The smallest spherical ball bounding G (dashed magenta circle in Fig. 1(c)) is obtained by substituting the values of
and
into (34), (35) and (36). This yields
and
. These parameters are derived in [22] using a distinct approach.
A two term disjunction test is used in [24]. This test is implemented sequentially. The first test is applied and then the second is applied to the remaining features. Any disjunction test can be implemented sequentially in this fashion. The key innovation in [24] is that each test is based on an ellipsoidal bound on . The first ellipsoid is the minimum volume ellipsoid containing the default dome (30). The second ellipsoid is constructed in a greedy fashion by selecting a feature so that the best ellipsoidal bound of the intersection of its half space and the first ellipsoid has minimum volume. The first step is in the spirit of ST3 except using an ellipsoidal bound. The second step is bound refinement based on ellipsoids rather than spheres. An ellipsoidal bound is tighter than the spherical bounds used in this section. However, its description requires a center
and a matrix
to encode its shape and orientation. When n is large this could be an impediment. In contrast, a sphere requires a center
and a scalar radius r.
The screening tests discussed so far screen the dictionary once, then solve the reduced lasso problem. We hence call these tests “one-shot” screening tests. These tests can perform well for moderate to large values of but often fail to provide adequate rejection performance for smaller values of
. This is primarily due to the challenge of obtaining a tight region bound on
when
is small.
Alternative screening methods can help with this problem. For example, [31] examined the idea of screening and solving (1) for a sequence of instances (Recursive-SAFE). At step k the previously solved instance
defines a bound on the dual solution of the instance
. Hence the previous solution can help screen the next instance in the sequence. A similar idea is proposed by [19] in the form of the Strong Sequential Rule. This is used to solve the lasso problem “over a grid of
values”. In [32], the SAFE test for the lasso is upgraded to use a specific dome test.
In a similar spirit, [25] proposed running a homotopy algorithm to find a solution at the K-th breakpoint on the regularization path of . This effectively solves a sequence of lasso problems (via homotopy) to obtain a solution
at
. The dual solution
is then used to screen the instance
. This has potential advantages, but relinquishes control of the values
to the breakpoints in the homotopy algorithm. In the worst case the regularization path can have
breakpoints [34]. As a variant on homotopy, Sequential Lasso [35] solves a sequence of partially
penalized least squares problems where features with non-zero weights in earlier steps are not penalized in subsequent steps.
With the exception of homotopy, all of the above sequential schemes use a fixed open loop design for N and the sequence . For example, first fix
, then select
, and let the intermediate values be selected via geometric spacing:
with
. To solve instance
, we first screen and solve the instance
. Then sequentially for k = 2, . . . , N, we screen instance
with the help of the known solution of the previous instance, and then solve the reduced problem. This continues until the solution for
is obtained. Sometimes all solutions on the grid of
values are of interest, e.g., cross validation for parameter selection. But there are many other applications where only the solution of the final instance
is of interest – the other instances are merely waypoints in the computation.
The solution of the previous instance helps screen the next instance as follows. First use as a dual feasible point to form the basic bounding sphere (21) for
with center
and radius
. Then use
as the projection of
onto F, to form the bounding halfspace (31) with
, and
. This sphere and halfspace yield the bounding dome derived in
4.3.2:
This dome, illustrated in Fig. 4, encapsulates information about provided by the dual solution of the previous instance.
Fig. 4. An illustration of the dome (45) formed at step k.
In contrast to an open loop design, one can use feedback to adaptively select N and the sequence as the computation proceeds [36]. This allows the value of N and the sequence
to be adapted to each particular instance. For some instances, a small value of N is used, for others, a larger value is used. One way to see why feedback helps is to examine the diameter of
in (45).
Proposition 1 (From [36]). Let be the dome (45) and
diam
. Then
Using Proposition 1, it can be shown that the data adaptive feedback selection rule
where R > 0 is a selectable parameter, ensures that diamfor all k > 1. This allows direct control of how tightly the dome (45) bounds
. This is called Data-Adaptive Sequential Screening (DASS). Note that in this scheme N is not predetermined. Instead the stopping time is decided by the feedback scheme. However, (47) ensures that
where C is an bound on the dual regularization path [36]. We employ DASS to demonstrate the effectiveness of sequential screening in 7. In particular, Fig. 10 shows
Fig. 5. Algorithm for THT. The functions Theorem 4. Other Notation: For the lasso, f(z) = |z| and g(z) = sign(z) and for the nonnegative lasso, f(z) = g(z) = z. For a logical condition
evaluates to true if z satisfies condition c and false otherwise.
the range of N used by DASS on two datasets and Fig. 11 shows its performance in three sparse classification problems.
Each of the screening tests previously described requires the inputs B, y, and and returns
where
is a logical value indicating if
is rejected. The algorithms can be implemented in an online fashion with very few features stored in memory at once. The critical computation is calculating inner products of the form
and
. It follows that the time complexity of one-shot screening is O(np). If the features are sparse then running times are further reduced. Let s denote the average feature sparsity. Then the time complexity of one shot screening is O(sp). Reference [36] discusses the
Fig. 6. Algorithm for IRDT. Here are from Theorem 3. For other notation see the caption of Algorithm 1.
complexity of Data Adapted Sequential Screening and provides the upper bound (48) on the number of steps used.
A basic implementation of THT is shown in Algorithm 1. If the dictionary is unnormalized, the feature norms can be precomputed and passed as an input to the algorithm. If it is normalized, we recommend simplifying the algorithm by setting and removing unnecessary floating point operations (see
4.5). The algorithm accepts two additional optional inputs: either a dual solution
or a feasible point
. The dual solution is useful for the application of THT in sequential screening. It is used to select the first half space used by THT (lines 7,8). If only a feasible point is provided, it is used to select the sphere radius. Otherwise, the default point
is used. The remaining half spaces are selected using dictionary-based selection (40), (41) (
4.5). The output values
are determined for each
by evaluation of the THT test in Theorem 4.
A basic implementation of IRDT is shown in Algo-
Fig. 7. Algorithm for Data-Adaptive Sequential Screening.
rithm 2. To keep the notation simple and the algorithm understandable, all features and y are assumed to have unit norm, but this is not required (see 4.6). IRDT uses at most s iterations with the value of s supplied by the user (we recommend
). The algorithm passes through the dictionary at most s + 1 times with the main loop executed at most s times. The break at line 15 terminates this loop early if suitable domes can’t be found. The algorithm accepts a feasible point
for the dual problem as an optional input and can be adapted to accept a known dual solution. The value
for each
is determined by a disjunction of a set of dome tests each based on the dome test in Theorem 3. These disjunctions are computed sequentially with subsequent tests applied only to currently surviving features.
Data-Adaptive Sequential Screening solves N lasso instances for a sequence of descending values
where
and
is the regularization parameter value for the instance to be solved. The user must specify a radius R > 0. At each step, the algorithm uses a strong “one-shot” screening test, for example THT, provided with a solution of the previous instance, followed by an external lasso solver to solve the screened current instance. The algorithm sets
and thereafter uses the feedback rule (47) to select
until
. It then sets
and screens and solves the final problem. See [36] for additional details on this algorithm.
We now examine the performance of the screening algorithms presented using the datasets summarized in Table
TABLE 1 Summary of the datasets. The reported value of obtained by averaged over the lasso instances solved.
1, and discussed in detail below. (1) RAND: We generate lasso problems with n = 28 and p = 10, 000 by randomly generating 10, 001 28-dimensional vectors . These vectors are scaled to unit norm. (2) MNIST: 70, 000 images (
) of hand-written digits (60, 000 and 10, 000 in the training and testing sets, respectively) [37], [38]. We form a dictionary by randomly sampling 500 training images for each digit, and a target vector from the testing set. Each image is vectorized and scaled to unit norm. (3) YALEBXF: Frontal face images (
) of 38 subjects in the extended Yale B face dataset [39], [40]. We randomly select p = 2, 000 of the 2, 414 images as the dictionary, and y from the remaining 414 images. Each image is vectorized and scaled to unit norm. (4) RCV1: A bag-of-words representation of four classes from the Reuters Corpus Volume 1 (RCV1) dataset [41]. There are 9,625 documents with 29,992 distinct words, including categories “C15”, “ECAT”, “GCAT”, and “MCAT”, each with 2,022, 2,064, 2,901, and 2,638 documents respectively. The vector representations have an average of
nonzero entries; a sparsity of
. (5) COIL: Images (
) of 100 objects, with 72 images per object obtained by rotating the object every 5 degrees [42]. (6) GTZAN: 100 music clips (30 sec, sampled at 22,050 Hz) for each of ten genres of music [43]. Each clip is divided into 3-sec adjacent texture windows (TW) with 50% overlap. Each TW is represented using a first order scattering vector of length 199 [44]. (7) NYT: A bag-of-words dataset in which 300,000 New York Times articles are represented as vectors with respect to a vocabulary of 102,660 words [45]. The i-th entry in vector j gives the number of occurrences of word i in document j. Documents with low word counts are removed, leaving 299,752 documents.
All experiments solve the standard lasso problem (1) using the Feature-sign [46] and FISTA [47] solvers. The grafting solver [48] was also tested and gave similar qualitative performance. We use two performance metrics: the percentage of features rejected and the speedup (time to solve the lasso problem divided by sum of the time to screen and the time to solve the reduced lasso problem). Timing and speedup results depend on the solver used. The regularization parameter
Fig. 8. Performance of ST, DT and D-THT. Top: rejection percentage; Bottom: speedup using screening and the FeatureSign solver [46]. Solid curves lower bound and dashed curves upper bound performance for spherical bounds centered at
Fig. 9. Performance comparison of ST, DT, D-THT (all with default ), enhanced DPP (EDPP) [25] and the strong rule [19] using the FISTA solver on the MNIST and YALEBXF datasets.
is set using the scaling invariant ratio where
. So
with larger values yielding sparser solutions. For all datasets except RCV1 and NYT, we randomly select 20 dictionaries and for each dictionary we use 60 randomly selected test vectors. Averaged metrics and standard errors are reported across these 1200 lasso instances. For RCV1, since
is very low, we select 496 lasso instances with
from the pool of 1200 instances and report results for these 496 instances. For the very large NYT dataset, we select the first 299,000 examples as the dictionary and 6 documents from the remaining 752 as target vectors.
7.1 The performance of one-shot screening
We first benchmark the performance of the one-shot tests: ST (4.2), DT (
4.3), and D-THT (
4.5). We first use the default spherical bound (22). This gives a lower bound for the performance of the one-shot screening methods on each dataset. The default dome test combines this sphere with the feature
, while dictionary-based THT combines it with two features using the selection scheme detailed in (40), (41). We also show results using a second “oracle” bounding sphere with center
and radius
. This provides an upper bound on performance over bounding spheres
Fig. 10. Data-Adaptive Sequential Screening (DASS) applied to MNIST (top) and YALEBXF (bottom) using the feature-sign and FISTA solvers. (Left): average rejection percentage. (Middle): Speedup factor. (Right): The average value of N.
centered at .
The performance of the one-shot screening methods on the test datasets based on the feature-sign solver [46] are shown in Fig. 8. Here are the salient points: (a) While the default one-shot tests perform well for high values of , this performance quickly degrades as
decreases. At values of
around 0.2 and lower, the tests are not effective. (b) On the other hand, the upper bounds indicate potential for improvement if a better spherical bound can be found. Indeed, the significant gap between the lower and upper performance bounds suggests that it is worth investing computation to improve the default spherical bound. (c) Among the tested methods, D-THT exhibits the best performance except at very high values of
. On RAND, for example, using
and the default spherical bound, DTHT yields a 400% rejection improvement over DT. The concurrent speedup for D-THT is about 5X while for DT is less than 2X. These effects are also seen for MNIST and YALEBXF.
Fig. 9 shows a performance comparison between ST, DT, D-THT (all with default ), EDPP [25] and the strong rule [19] using the FISTA solver [47]. Here are the salient points: (a) Aside from the small dip at high values of
, the speedup trend for the FISTA solver is similar to that for feature-sign. For the datasets we tested, feature-sign seems to be faster than FISTA, but FISTA is more sensitive to the reduction in dictionary size resulting from screening. Thus it has greater speedup. This can also been seen in Fig. 10. (b) Of the one-shot methods tested, dictionary based THT and DT consistently have the best rejection performance. But while current one-shot screening tests can perform well at moderate to high values of
, such performance does not extend to the important range of low values of
.
The rejection and speedup of IRDT (not plotted) and D-THT were very similar on the test datasets with IRDT terminating after 3 or 4 iterations at the break in line 14-16 in Algorithm 2.
7.2 The performance of sequential screening
To explore the effectiveness of sequential screening, we tested the Data-Adaptive Sequential Screening (DASS) scheme (47). The performance results are shown in Fig. 10. Here are the salient points: (a) For both MNIST and YALEBXF, with R = 0.2 the performance of DASS is robust across a variety of values of ; (b) DASS yields significant improvement in rejection fraction and robust speedup performance compared with one-shot tests; (c) At values of
around 0.1 and lower, DASS is rejecting 98% of the dictionary while giving speedup greater than 1. This is successful screening at much lower values of
.
7.3 Sequential screening and classification
Now we focus on specific values of motivated by practical lasso problems and examine how screening
Fig. 11. Performance of four sequential screening algorithms (DASS, sequential dome, sequential strong rule, sequential EDDP) for the screening of the lasso problems in SRC on three datasets (COIL, GTZAN, RCV1). (Top): cross validated accuracy to determine the best . (Bottom): Speedup vs Rejection at the best
for each dataset.
can help. To do so, we use the COIL, GTZAN and RCV1 datasets to examine the impact of sequential screening in Sparse Representation Classification (SRC) [7]. Although SRC was first proposed for face recognition problems, it is a generic multi-class classifier that has found success in a variety of applications. The time and memory consuming step in SRC is solving a lasso problem. For the COIL dataset we made the SRC problem more challenging by saturating a random subset of 0.5% of the pixels to white.
We first use cross-validated prediction accuracy to determine the best values for for SRC when applied to the datasets. The results (top row of Fig. 11) are COIL:
, GTZAN:
, and RCV1:
. For these specific values of
, we then examine the performance of the following screening schemes in solving SRC problems for these datasets: (1) the feedback scheme DASS, and the open loop sequential screening schemes (2) sequential dome test, (3) sequential strong rule [19] and (4) sequential EDPP rule [25]. We select the parameters of each method to keep the average value of N the same. Since DASS uses a variable value of N, we first select its parameters, then use the resulting average value of N for the open loop schemes. For COIL, DASS with R = 0.5 yields an average N = 4.72; for GTZAN, DASS with R = 0.15 yields an average N = 14.63; and for RCV1, DASS with R = 1 yields an average N = 3.59. Then for the open loop sequential screening schemes we set N = 5 for COIL, N = 15 for GTZAN and N = 4 for RCV1. This keeps the average path lengths of the screening schemes the same. The results are shown in the bottom row of Fig. 11.
Here are the salient points: (a) Over 50% of the experiments (dataset+screening method) gave a speedup of at least 10X. So sequential screening offers considerable potential gain in practical applications. (b) At the high end, DASS provided 28X, 16X and 18X speedup in solving SRC lasso problems for the three datasets. That’s an average speedup of 21X. However, given that we only used three datasets and did not “tweak” each method to find its best performance on each dataset, we can’t conclude that one method is always better than the rest. That would require a more extensive investigation. Finally, although the strong rule can’t rule out false rejections, we detected no false rejections in our experiment.
7.4 Sequential screening on a large dataset
Finally, we used the NYT dataset to explore how successfully one can screen and solve lasso problems using small values of with high dimensional data and a very large dictionary. We normalize each document and randomly selected six documents from the first 100 of the 752 held out documents subject to
. DASS Screening (with
and R = 0.3) was done in an “on-line” mode by loading only small amounts of the dictionary into memory at a time. The value of N is selected automatically for each instance. In all tested cases,
. As a benchmark, we tested a geometrically spaced, open loop sequential screening algorithm (sequential THT) using
and N = 30.
The results for both methods are shown in Fig. 12. We can’t solve these lasso problems without using screening. Hence the usual speedup metric can’t be evaluated. The main time cost is sequentially reading features from disk into RAM. Here are the main points to note: (a) Under geometric spacing with fixed N, less than 10,000 of the features (3.3%) were held in memory at once; (b) For DASS, less than 1,000 of the features (0.33%) were held in memory at once – an order of magnitude improvement over fixed geometric spacing (The small dip at is due to termination method); (c) On this dataset, both open loop sequential screening and DASS clearly exhibit a significant performance advantage over one-shot tests. The use of feedback by DASS to automatically select the number of steps N and the values
, yields robust rejection performance. By tweaking N for each test vector in the open loop scheme, one could improve its average performance. But DASS handles this automatically and robustly.
In our survey we have emphasized separating the discussion of test structure from the problem of selecting its
Fig. 12. Sequential screening using on the NYT dataset. (Top): Open loop geometric spacing using N = 30 and THT test; (Bottom): DASS with R = 0.3. For six problems , the plotted points indicate the number of surviving features after THT screening of each instance
parameters. This allowed us to see connections between many existing screening tests, and enabled a clearer understanding of screening in general. Hopefully this will be advantageous to the development of new tests and parameter selection methods.
For one-shot screening tests, our numerical studies on THT strongly suggest that more complex region tests are indeed worthwhile. THT gave significant performance improvement beyond simpler tests in both rejection and speedup over important ranges of values. But the performance of one-shot tests is still inadequate at small values of
. The numerical studies also indicated a significant performance gap between using the default spherical bound and the best bound at the same sphere center. This indicates the value of additional computation to improve the spherical bound.
Our empirical studies have shown that sequential screening (for example, DASS) can significantly extend useful screening performance to a wider range of . DASS has the additional advantage that it selects both the number N and the sequence
automatically.
Screening is critical when the dictionary will not fit into available memory. We have demonstrated a successful application of DASS to a very large NYT dataset, of dimension 102, 660 by 299, 000. To the best of the authors’ knowledge, with constrained computational resources, screening is the only way to solve lasso problems of this size.
The concepts described in this survey should provide a
firm foundation for understanding screening for related sparse representation problems. This includes screening for the elastic net (reducible to lasso problem), regularized logistic regression, the graphical lasso, and the group lasso [19]. In addition, SAFE methods have been developed for the sparse support vector machine and logistic regression in [21], and the group lasso in [25]. Recently, Liu et al. [49] have proposed safe screening for generalized sparse linear models. This makes use of the variational inequality that provides a necessary and suf-ficient optimality condition for the dual problem. Dash et al., [50], consider screening for Boolean compressed sensing in which the objective is to select a sparse set of Boolean rules that are predictive of future outcomes. One of the screening rules developed is based on the duality arguments presented here. Targeting problems that use nuclear norm regularization to pursue a low rank matrix solution, Zhou et al., [51], have recently proposed safe subspace screening for nuclear norm regularized least squares problems. Wang et al., [52], have integrated DASS with sparse representation classification, to speed up classification, and Jao et al., [53], have applied screening to the problem of representing music in terms of an audio codebook (dictionary) for genre tagging. We expect to see more such applications as the size of dictionaries increase.
The dual lasso problem (3) is obtained as follows. Setting in (1) gives the constrained problem:
, subject to
. Form the Lagrangian
and compute the subdifferentials with respect to z and w. Using the condition that 0 must be in each subdifferential gives
and the constraints
, i = 1, . . . , p. The above equations allow the elimination of z and w from L. This leads to the dual problem:
, subject to
, i = 1, . . . , p. The change of variable
then gives (3). By construction, the primal and dual solutions
and
are related through (4).
Theorem 1: The proof for the lasso is given above the theorem statement. For the nn-lasso, only the definition of the active set changes.
Corollary 1: In the proof of Theorem 1, the inclusion , gives
implies
.
. For the lasso, if
is rejected by
, then
and
1. Hence
and
. So
is also rejected by
. Therefore
. The proof for the nn-lasso is similar.
Lemma 2: By Cauchy-Schwarz: with equality when
is aligned with b. Then
ensures
with equality when
.
Theorem 2: For the nn-lasso we reject if
1. So (19) follows from (18). For the lasso we reject
if
and
, i.e., if
and
. This gives (19). Note
. Thus (19) and (20) are equivalent.
Lemma 3: Solving (16) with m = 1 is equivalent to solving the Lagrangian problem:
Setting the derivative w.r.t. equal to zero yields
. Substituting
into
and (49):
where and
. We now minimize this expression over
. Setting the derivatives of L w.r.t.
, and
equal to 0 yields two equations to solve for
and
and
. There
are two cases: (A) If , then
and
; and (B) If
, then
and
. Substitution of these expressions into (50) yields the result in Lemma 3.
Theorem 3: For the nn-lasso, we reject b if , i.e., if
. For the lasso we reject b if
and
, i.e., if
and
.
We make use of the following lemma from [36].
Lemma 7. If is nonempty, then
Lemma 5: We first solve (16) (m = 2) with by solving the Lagrangian problem:
Solving for
and substitution into
and (52) yields:
where and
. Setting the derivatives of L w.r.t.
and
, respectively, to zero yields:
(Case I) If , then set
. Substitution into (54) yields:
and
. There are two subcases:
(IA) If , then
,
and . (IB) If
, then
and
. (Case II) Suppose
. Again there are two subcases: (IIA) If
, then set
. Substitution into (54) yields:
and
. Solving gives,
with λ > and σ <
. (IIB) If
, then substituting
and
into (54) yields,
and
. Solving these equations gives:
various conditions into (53) yields:
[(1)] t, t
: µ
r.
For general b we use . So in each of the above expressions we replace
, and
by
and
, re- spectively. Then multiply each expression by
. This yields the result in Lemma 5.
Theorem 4: This is almost identical to the proof of Theorem 3 and is hence omitted.
Lemma 6: Note . If
or
is empty, the result is clear. Hence assume each is nonempty. For the lasso, if
is rejected by
, then either
or
. Without loss of generality assume
. Since
is a subset of
, this implies that
, so
is also rejected by
. Therefore
. The proof for the nn-lasso is similar.
This work partially supported by NSF grant CIF 1116208.
[1] R. Tibshirani, “Regression shrinkage and selection via the lasso,” J. Royal. Statist. Soc B., vol. 58, no. 1, pp. 267–288, 1996.
[2] A. Y. Yang, A. Ganesh, Z. Zhou, S. S. Sastry, and Y. Ma, “A review of fast l1-minimization algorithms for robust face recognition,” Arxiv preprint arXiv:1007.3753, 2010.
[3] M. Elad, Sparse and Redundant Representations: From Theory to Applications in Signal and Image Processing. Springer, 2010.
[4] J. Wright, Y. Ma, J. Mairal, G. Sapiro, T. Huang, and S. Yan, “Sparse representation for computer vision and pattern recognition,” Proceedings of the IEEE, vol. 98, no. 6, pp. 1031–1044, 2010.
[5] J. Mairal, M. Elad, and G. Sapiro, “Sparse representation for color image restoration,” IEEE Transactions on Image Processing, vol. 17, no. 1, pp. 53–69, 2008.
[6] J. Mairal, F. Bach, J. Ponce, G. Sapiro, and A. Zisserman, “Nonlocal sparse models for image restoration,” in IEEE 12th Int. Conf. on Computer Vision, 2009, pp. 2272–2279.
[7] J. Wright, A. Y. Yang, A. Ganesh, S. S. Sastry, and Y. Ma, “Robust face recognition via sparse representation,” IEEE Trans, on Pattern Analysis and Machine Intelligence, vol. 31, no. 2, pp. 210–227, 2009.
[8] A. Wagner, J. Wright, A. Ganesh, Z. Zhou, H. Mobahi, and Y. Ma, “Towards a practical face recognition system: robust alignment and illumination by sparse representation,” IEEE Transactions on Pattern Analysis and Machine Intelligence, 2011.
[9] K. Yu, T. Zhang, and Y. Gong, “Nonlinear learning using local coordinate coding,” in Advances in Neural Information Processing Systems, vol. 3, 2009.
[10] T. N. Sainath, A. Carmi, D. Kanevsky, and B. Ramabhadran, “Bayesian compressive sensing for phonetic classification,” in IEEE Int. Conf. Acoustics Speech and Signal Processing, 2010, pp. 4370–4373.
[11] T. N. Sainath, B. Ramabhadran, D. Nahamoo, D. Kanevsky, and A. Sethy, “Sparse representation features for speech recognition,” in Eleventh Annual Conf. of the Int. Speech Communication Association, 2010.
[12] K. Chang, J. Jang, and C. S. Iliopoulos, “Music genre classification via compressive sampling,” in Proc. 11th Int. Conf. on Music Information Retrieval (ISMIR), 2010, pp. 387–392.
[13] S. Prasad, P. Melville, A. Banerjee, and V. Sindhwani, “Emerging topic detection using dictionary learning,” in ACM Conference on Information and Knowledge Management, 2011.
[14] D. Zhang, M. Yang, and X. Feng, “Sparse representation or collaborative representation: Which helps face recognition?” in Int. Conf. on Computer Vision, 2011, pp. pp. 471–478.
[15] B. Efron, T. Hastie, I. Johnstone, and R. Tibshirani, “Least Angle Regression,” Annals of Statistics, vol. 32, no. 2, pp. 407–499, 2004.
[16] J. A. Trop and A. C. Gilbert, “Signal Recovery From Random Mea- surements Via Orthogonal Matching Pursuit,” IEEE Transactions on Information Theory, vol. 53, no. 12, pp. 4655–4666, Dec. 2007.
[17] S. M. Smith, “Overview of fMRI analysis,” British Journal of Radiology, vol. 77, no. Special Issue 2, p. S167, 2004.
[18] J. Fan and J. Lv, “Sure independence screening for ultrahigh dimensional feature space,” Journal of the Royal Statistical Society: Series B (Statistical Methodology), vol. 70, no. 5, pp. 849–911, 2008.
[19] R. Tibshirani, J. Bien, J. Friedman, T. Hastie, N. Simon, J. Taylor, and R. J. Tibshirani, “Strong rules for discarding predictors in lasso-type problems,” Journal of the Royal Statistical Society: Series B (Statistical Methodology), vol. 74, 2012.
[20] G. Thompson, F. Tonge, and S. Zionts, “Techniques for removing nonbonding constraints and extraneous variables from linear programming problems,” Management Science, vol. 12, no. 7, pp. 588–608, 1966.
[21] L. El Ghaoui, V. Viallon, and T. Rabbani, “Safe feature elimination in sparse supervised learning,” Pacific Journal of Optimization, vol. 8, no. 4, pp. 667–698, 2012.
[22] Z. J. Xiang, H. Xu, and P. J. Ramadge, “Learning sparse represen- tations of high dimensional data on large scale dictionaries,” in Advances in Neural Information Processing Systems, 2011.
[23] Z. J. Xiang and P. J. Ramadge, “Fast lasso screening tests based on correlations,” in IEEE Int. Conf. on Acoustics, Speech and Signal Processing, 2012.
[24] L. Dai and K. Pelckmans, “An ellipsoidal based, two-stage screen- ing test for bpdn,” in 20th European Signal Processing Conference, Aug. 2012.
[25] J. Wang, P. Wonka, and J. Ye, “Lasso screening rules via dual poly- tope projection,” Journal of Machine Learning Research, to appear.
[26] H. Zou and T. Hastie, “Regularization and variable selection via the elastic net,” Journal of the Royal Statistical Society B., vol. 67, no. 2, pp. 301–320, 2005.
[27] R. Tibshirani and J. Taylor, “The solution path of the generalized lasso,” Annals of Statistics, vol. 39, no. 3, pp. 1335–1371, 2011.
[28] J.-B. Hiriart-Urruty and C. Lemarechal, Fundamentals of Convex Analysis. Springer, 2001.
[29] J.-J. Fuchs, “Recovery of exact sparse representations in the pres- ence of bounded noise,” IEEE Transactions on Information Theory, vol. 51, no. 10, pp. 3601–3608, Oct. 2005.
[30] R. Tibshirani, “The lasso problem and uniqueness,” arXiv:1206.0313 [math.ST], 4th Nov. 2012.
[31] L. El Ghaoui, V. Viallon, and T. Rabbani, “Safe feature elimination in sparse supervised learning,” EECS Department, University of California, Berkeley, Tech. Rep. UCB/EECS-2010-126, Sep. 2010. [Online]. Available: http://www.eecs.berkeley.edu/Pubs/ TechRpts/2010/EECS-2010-126.html
[32] ——, “Safe Feature Elimination for the LASSO and Sparse Super- vised Learning Problems,” arXiv:1009.4219v2 [cs.LG], 2011.
[33] Y. Wang, Z. Xiang, and P. Ramadge, “Tradeoffs in improved screening of lasso problems,” in IEEE Int. Conf. on Acoustics, Speech and Signal Processing, Jun. 2013.
[34] J. Mairal and B. Yu, “Complexity analysis of the lasso regulariza- tion path,” in Proc. of the 29th Int. Conf. on Machine Learning (ICML 2012), Edinburgh, Scotland, 2012.
[35] S. Luo and Z. Chen, “Sequential lasso for feature selection with ultra-high dimensional feature space,” arXiv:1107.2734 [stat.ME], 14 July 2011.
[36] Y. Wang, X. Chen, and P. J. Ramadge, “Feedback-controlled se- quential lasso screening,” Princeton University, Tech. Rep., June 2015.
[37] Y. LeCun and C. Cortes, “The MNIST database of handwritten digits,” 1998.
[38] Y. Lecun, L. Bottou, Y. Bengio, and P. Haffner, “Gradient-based learning applied to document recognition,” Proceedings of the IEEE, vol. 86, no. 11, pp. 2278 –2324, Nov. 1998.
[39] A. S. Georghiades, P. N. Belhumeur, and D. J. Kriegman, “From few to many: Illumination cone models for face recognition under variable lighting and pose,” IEEE Trans. on Pattern Analysis and Machine Intelligence, vol. 23, no. 6, pp. 643–660, 2002.
[40] K. C. Lee, J. Ho, and D. J. Kriegman, “Acquiring linear subspaces for face recognition under variable lighting,” IEEE Trans. on Pattern Analysis and Machine Intelligence, pp. 684–698, 2005.
[41] D. Cai and X. He, “Manifold adaptive experimental design for text categorization,” IEEE Transactions on Knowledge and Data Engineering, April 2011.
[42] S. A. Nene, S. K. Nayar, and H. Murase, “Columbia object image library (coil-100),” Techn. Rep. No. CUCS-006-96, dept. Comp. Science, Columbia University, 1996.
[43] G. Tzanetakis and P. Cook, “Musical genre classification of audio signals,” IEEE Trans. Speech and Audio Processing, vol. 10, no. 5, 2002.
[44] J. And´en and S. Mallat, “Multiscale scattering for audio classifi- cation,” Proceedings of the ISMIR 2011 Conference, 2011.
[45] A. Frank and A. Asuncion, “UCI Machine Learning Repository, University of California, Irvine, School of Information and Computer Sciences,” 2010. [Online]. Available: http://archive. ics.uci.edu/ml
[46] H. Lee, A. Battle, R. Raina, and A. Ng, “Efficient sparse coding algorithms,” in Advances in Neural Information Processing Systems, vol. 19, 2007, p. 801.
[47] A. Beck and M. Teboulle, “A fast iterative shrinkage-thresholding algorithm for linear inverse problems,” SIAM J. IMAGING SCIENCES, vol. 2, no. 1, pp. 183–202, 2009.
[48] P. S. and J. Theiler, “Online feature selection using grafting,” in Proc. of the Int. Conf. on Machine Learning, 2003, pp. 592–599.
[49] J. Liu, Z. Zhao, J. Wang, and J. Ye, “Safe screening with varia- tional inequalities and its application to lasso,” arXiv:1307.7577v2 [cs.LG], Oct. 2013.
[50] S. Dash, D. Malioutov, and K. R. Varshney, “Screening for learning classification rules via boolean compressed sensing,” in IEEE Int. Conf. on Acoustics, Speech and Signal Processing, May. 2014.
[51] Q. Zhou and Q. Zhao, “Safe subspace screening for nuclear norm regularized least squares problems,” in Proceedings of the 32nd International Conference on Machine Learning, Lille, France, 2015.
[52] Y. Wang, X. Chen, and P. J. Ramadge, “Sparse representation classification via sequential lasso screening,” in 1st IEEE Global Conference on Signal and Information Processing, Dec. 2013.
[53] P.-K. Jao, C. C. M. Yeh, and Y.-H. Yang, “Modified lasso screening for audio word-based music classification using large scale dictionary,” in IEEE Int. Conf. on Acoustics, Speech and Signal Processing, May. 2014.
HERE Zhen James Xiang received the B.E. degree B.E. in 2007 from Department of Electrical Engineering, Tsinghua University, China, graduating with honors and GPA rank 1/164. He received the M.A. degree and the Ph.D. degree in Electrical Engineering from Princeton University in 2009 and 2012 respectively. He is currently a quantitative researcher at Citadel LLC in Chicago. He has received several awards and honorable mentions for his scholarship, in-
cluding: Best Student Paper Honorable Mention Award, NIPS (2011); Charlotte Elizabeth Procter Honorific Fellowship of Princeton University (2011-2012); Qualcomm Innovation Fellowship Finalist (2011); Francis Robin Upton Fellowship, Princeton University (2007-2011); Distinguished Graduate Award of Beijing City (2007); Distinguished Graduate of Tsinghua University (2007); and in 2003, a Gold Medal at the International Mathematics Olympiad, Tokyo, Japan, where he ranked 12th among 457 participants from 84 countries.
HERE Yun Wang received the B.S. degree in Electrical Engineering with highest honors from Shanghai Jiao Tong University in 2011 and the M.A. and Ph.D. degrees in Electrical Engineering from Princeton University in 2013 and 2015, respectively. His doctoral research focused on machine learning, optimization and statistical signal processing. He joined Amazon as a machine learning scientist in the fall of 2015. His honors include the Distinguished Graduate Award of
the City of Shanghai (2011) and the Anthony Ephremides Fellowship in Electrical Engineering, Princeton University (2011).
HERE Peter J. Ramadge received the B.Sc., B.E. and the M.E. degree from the University of Newcastle, Australia, and the Ph.D. degree in Electrical Engineering from the University of Toronto, Canada. He joined the faculty of Princeton University in September 1984, where he is currently Gordon Y. S. Wu Professor of Engineering, and Professor of Electrical Engineering. He is a Fellow of the IEEE and a member of SIAM. He has received several honors and awards