Simultaneous Safe Screening of Features and Samples in Doubly Sparse Modeling

2016·Arxiv

Abstract

Abstract

1 Introduction

In many areas of science and industry, large-scale datasets with many features and samples are collected and analyzed for data-driven scientific discovery and decision making. One promising approach to handle large numbers of features and samples analyze these large-scale datasets is introducing sparsity constraints in statistical models [1]. The most common approach for inducing feature sparsity, e.g., in a linear least-square model fitting, is using sparsity-inducing penalties such as -norm of the coefficients [2]. A feature sparse model depends only on a subset of features (called active features), and the rest of the features (called non-active features) are irrelevant. On the other hand, the most popular machine learning algorithm that induces sample sparsity would be the support vector machine (SVM) [3]. In the SVM, the large margin principle enhances sample sparsity in the sense that it depends only on a subset of samples (called support vectors (SVs) or active samples) and does not depend on the rest of the samples (called non-SVs or non-active samples).

The problem of training a sparse model can be conceptually divided into two tasks. The first task is to identify active features or samples, and the second task is to optimize the model with respect to them. If the first task is perfectly completed, then the second task would be relatively easy because one only needs to solve a small-scale optimization problem that only depends on a subset of features or samples. In general, however, it is impossible to completely identify the set of active features or samples before actually training the model. Thus, some of existing sparse learning solvers employ working set approaches. Roughly speaking, a working set is the set of features or samples that are predicted to be active. Since the prediction is not perfect, there may be false positives (non-active features/samples in the working set) and false negatives (active features/samples not in the working set). Thus, one needs to repeat a working set prediction and optimization over the working set until the optimality condition is satisfied. For example, LIBSVM [4], a well-known SVM solver, employs working set approaches. It repeats predicting a working set (the set of samples that are predicted to be SVs), and optimizing the model by only using the samples in the working set.

The problem of learning a sparse model is conceptually interpreted as the process of identifying active features or samples and then optimizing the model over them. Numerical optimization of those sparse learning methods is often computationally expensive. A major difficulty stems from a combinatorial property of identifying active features or samples for which working set approaches have been studied to predict a set of features or samples to be active. Recently, a new approach called safe screening has been studied by several authors. Safe screening enables to identify a subset of non-active features or samples before or during the model training process. A nice thing about safe screening is that it is guaranteed to have no false negatives, i.e., safe screening never identify active features or samples as non-active. It means that, if we train the model by only using the remaining set of features or samples after safe screening, the solution is guaranteed to be optimal. The basic technical idea behind safe screening is to bound the solution of the problem within a region, and show that some features or samples cannot be active wherever the optimal

Figure 1: Schematic illustration of the proposed approach. By iterating safe feature screening and safe sample screening, irrelevant features and samples are safely removed out from the training set. A significant advantage of considering them simultaneously rather than individually is that they have a synergy effect in the sense that the screening performances (the number of features/samples that can be safely removed out) are gradually improved by exploiting the results in earlier steps.

solution is located within the region.

After the seminal work by [5], safe feature screening (safely screening a part of non-active features in sparse feature models such as LASSO) has been intensively studied in the literature [6, 7, 8, 9, 10, 11, 12, 13]. Safe feature screening exploits the fact that the sparseness of a feature is characterized by a property of the dual solution, i.e., if the dual solution satisfies a certain condition in the dual space, the feature is guaranteed to be non-active. Safe feature screening is beneficial especially when the number of features is large. There are also several studies on safe sample screening (safely screening a part of non-active samples in sparse sample models such as the SVM) [14, 15, 16]. The basic idea behind safe sample screening is that the sparseness of a sample is characterized by a property of the primal solution. If the primal solution satisfies a certain condition in the primal space, the sample is guaranteed to be non-active. Safe sample screening is useful when the number of samples is large.

In this paper, we consider problems where the numbers of features and samples are both large. In these problems, we consider a class of learning algorithms that induce both of feature sparsity and sample sparsity, which we call doubly sparse modeling. Our main contribution in this paper is to develop a safe screening method that can identify both of non-active features and non-active samples simultaneously in doubly sparse modeling. Specifically, we propose a novel method for simultaneously constructing two regions, one in the dual space and the other in the primal space. The former is used for safe feature screening, while the latter is used for safe sample screening.

A significant advantage of considering safe feature screening and safe sample screening simultaneously rather than individually is that they have a synergy effect. Specifically, we show that, after we know that a part of features are non-active based on safe feature screening, we can potentially improve the performance of safe sample screening. Our basic idea behind this property is that, by fixing a part of the primal variables, the region in the primal space, which is used for safe sample screening, can be made smaller (we can also show the converse similarly).

These findings suggest that, safe screening performances of features and samples can be both improved

by alternatively repeating them. Figure 1 is a schematic illustration of the simultaneous safe screening approach.

Another interesting finding we first introduce in this paper is that, by simultaneously considering regions both in the dual and the primal spaces, we can also identify features and samples that are guaranteed to be active. We call this technique as safe keeping. While safe screening assures no false negative findings, safe keeping guarantees no false positive findings of active features/samples. By combining these two techniques, we can better identify active features/samples. A practical advantage of safe keeping is that we do not have to consider the safe screening rules anymore for features and samples which are identified as active by safe keeping. This is helpful for reducing the computational costs of safe screening rule evaluations especially in the context of dynamic safe screening [8].

1.1 Notation and outline

We use the following notations in the rest of the paper. For any natural number n, we define [n] := {1, . . . , n}. For an matrix M, its i-th row and j-th column are denoted as and , respectively, for ] and ]. The norm, the norm and the norm of a vector are respectively written as and . For a scalar z, we define [. We write the subdifferential operator as , and remind that the subdifferential of norm is given as . For a function f, we denote its domain as domf.

Here is the outline of the paper. 2 introduces the problem formulation we mainly consider in this paper and basic concepts of safe screening. 3 describes our main contribution where we show that simultaneously screening features and samples has synergetic advantages. 4 also presents another contribution where we introduce a new approach for predicting active set without false positive errors. 5 discusses another problem that induces spasities both in features and samples. 6 covers numerical experiments for demonstrating the advantage of the simultaneous safe screening. 7 concludes the paper.

2 Preliminaries

In this section, we first describe the problem formulation in 2.1. Then, we briefly summarize the basic concepts of safe feature screening and safe sample screening in 2.2 and 2.3, respectively.

2.1 Problem formulation

Consider classification and regression problems with the number of samples n and the number of features d. The training set is written as where , and for classification and for regression. The input data matrix (design matrix) is denoted as .

We consider a linear classification and regression function in the form of , and study the problem of estimating the parameter by solving a class of regularized empirical risk minimization

problems:

where is a penalty function, is a loss function for the i-th sample, and 0 is a trade-off parameter for controlling the balance between the penalty and the loss.

In this paper, we study doubly sparse modeling, i.e., a pair of penalty and loss function that induces sparsities both in features and samples. As specific working examples, we consider -penalized smoothed hinge SV classification and -penalized smoothed -insensitive SV regression. The use of these smoothed loss functions are known to produce almost same solutions as the original hinge or -insensitive loss functions [17]. We will discuss other doubly sparse modeling problem in 5.

Remembering that the original SVMs are trained with penalty, by combining an additional -penalty, our penalty function is written as elastic net penalty:

where 0 is a balancing parameter which we omit hereafter by substituting

The smoothed hinge loss and the smoothed -insensitive loss are respectively written as

1 w (yw < 1 γ),

(1 (otherwise),

where 0 is a tuning parameter. There are profiles the smoothed hinge loss and the smoothed -insensitive loss as shown in Figure 2.

Using Fenchel’s duality theorem (see, e.g., Corollary 31.2.1 in [18]), the dual problem of (1) is written as

where and is convex conjugate function of and , respectively. The convex conjugate function of the

penalty term in (2) is given as (see [17] )

Figure 2: Illustrations of the smoothed hinge loss (orange) and the smoothed -insensitive loss (green), where 5 and 5.

The convex conjugate functions of the smoothed hinge loss and the smoothed -insensitive loss are respec-

tively written as

and

We call the problems in (1) and (5) as primal problem and dual problem, respectively, and denote the primal optimal solution as and the dual optimal solution as .

2.2 Safe feature screening

The goal of safe feature screening is to identify a part of non-active features before or during the optimization process. Safe feature screening is built on the following KKT optimality condition

(see Theorem 31.3 in [18])

In the case of our specific regularization term (2), the optimality condition (8) is written as

The optimality condition (9) indicates that

The basic idea behind safe feature screening is to construct a region Θin the dual space so that , and then compute an upper bound . Using this upper bound,

we can construct a safe feature screening rule in the form of

After the seminal work [5], many different approaches for constructing a region Θhave been developed (see 1). Among those, we use an approach in [13]. Noting the fact that the convex conjugate function is -strongly convex, and henceforth the dual objective function ) is -strongly concave, we can define

a region

where ( ˆ( ˆ( ˆw) is the duality gap defined by an arbitrary pair of primal feasible solution ˆdomand dual feasible solution ˆdom. Since the region Θis a sphere, we can explicitly write

the upper bound as

2.3 Safe sample screening

The goal of safe sample screening is to identify a part of non-active samples before or during the optimization process. Here, we slightly abuse the word “non-active” in the sense that we call a sample to be non-active not only when the corresponding is 0, but also when it is 1. Although the i-th sample can be removed out only when guarantee that 1 because the size of the optimization problem can be reduced.

Safe sample screening is also built on the KKT optimality condition

In the case of smoothed hinge loss, the KKT condition (12) is written when

We construct a region Θin the primal space so that , and then compute a lower bound and an upper bound . The optimality condition

(13) suggests that

Similarly, for 1, the optimality condition is written as

suggesting that

In the case of smoothed -insensitive loss, the optimality condition (12) is written as

It indicates that

In order to develop a sphere region Θin the primal space, we extend the duality GAP-based safe feature screening approach proposed in [13] into safe sample screening context. The result is summarized in the following lemma.

Lemma 1. For any ˆdomand ˆdom,

Furthermore, using the sphere form region Θin (16), for any ˆdomand ˆdom, a pair of lower

and upper bounds of are given as

The proof of Lemma 1 is presented in Appendix A.

3 Simultaneous safe screening

We have shown that, for doubly sparse modeling problems, safe screening rules for both of features and samples can be constructed respectively. This has not been explored in depth regardless of its practical importance. In this paper we further develop the framework in which safe feature screening and safe sample screening are alternately iterated. A significant additional benefit of this framework is that the result of the previous safe feature screening can be exploited for making the primal region Θsmaller, meaning that the performance of the next safe sample screening can be improved, and vice-versa.

The following two theorems formally state these ideas. First, Theorem 2 states that we can obtain tighter upper bound for feature screening by exploiting the result of the previous safe sample screening.

Theorem 2. Consider a safe feature screening problem given arbitrary pair of primal and dual feasible solution ˆdomand ˆdom. Furthermore, suppose that the result of the previous safe sample screening step assures the optimal values for a subset of the samples ]. Let , ( ˆ, and ˜be an n-dimensional vector whose element is defined as ˜for and ˜for . Then, is bounded from above by the following upper bound:

and the upper bound in (18) is tighter than or equal to that in (11), i.e., ˜).

The proof of Theorem 2 is presented in Appendix. By replacing the upper bounds in (11) with that in (18) in the safe feature screening step, there are more chance for screening out non-active features.

Next, Theorem 3 states that we can obtain tighter lower and upper bounds for sample screening by exploiting the result of the previous safe feature screening.

Theorem 3. Consider a safe sample screening problem given arbitrary pair of primal and dual feasible solutions ˆdomand ˆdom. Furthermore, suppose that the result of the previous safe feature screening step assures that for a subset of the features ]. Let ( ˆ, and ˜w be a d-dimensional vector whose element is defined as ˜for and ˜for . Then, is bounded from below and above respectively by the following lower and upper bounds:

and these bounds in (19) are tighter than or equal to those in (17), i.e., ˜) and

˜UB(xwUB(xw

The proof of Theorem 3 is presented in Appendix. By replacing the lower and the upper bounds in (17) with those in (19) in the safe sample screening step, there are more chance to be able to screen out non-active samples. Again, we note that the tighter bounds in (19) are different from the bounds that would be obtained just by using ˜w as the primal feasible solution ˆw in (11).

Theorems 2 and 3 suggests that, by alternately iterating feature screening and sample screening, more and more features and samples could be screened out. This iteration process can be terminated when there are few chances to be able to screen out additional features and samples. Such a termination condition can be developed by using the results in the next section.

4 Safe keeping of active features and samples

Safe screening studies initiated by the seminal work by [5] enabled us to identify a part of non-active features/samples before actually solving the optimization problem. In other words, safe screening is interpreted as an active set prediction method without false negative error (an error that truly active features/samples are predicted as non-active). In this section, we show that, by exploiting the two regions in the dual and the primal spaces, we can develop an active set prediction method without false positive error (an error that truly non-active features/samples are predicted as active). We call the latter approach as safe feature/sample

keeping.

Safe feature keeping rule can be constructed by using the region in the primal space. Using Θin (16), we can get the lower bound of for ] as ( ˆ). Using this lower bound, safe feature keeping rule is simply formulated as the following theorem.

Theorem 4. For an arbitrary pair of primal feasible solution ˆdomand dual feasible solution ˆ

dom

Similarly, safe sample keeping rule can be constructed by using a region in the dual space. The condition for being active is written as 1. This can be guaranteed when the condition (0, 1) or 0) holds for the i-th element of . Since Θis a sphere, safe sample keeping rule can be simply derived by (10).

Theorem 5. For an arbitrary pair of primal feasible solution ˆdomand dual feasible solutionx

ˆdom

When we use safe screening approaches, there is a trade-off between the computational costs of evaluating safe screening rules and the computational time saving by screening out some features/samples. If we know in advance that some of the features/samples cannot be non-active by using safe keeping approaches, we do not have to waste the rule evaluation costs for those features/samples. Safe screening rule evaluation costs would be more significant in dynamic screening and our simultaneous screening scenarios because rules are repeatedly evaluated. By combining safe screening and safe keeping approaches, we can get an information about how many features/samples are not yet determined to be active or non-active. This information can be also used as a stopping criteria of dynamic screening and our simultaneous screening. We can stop evaluating safe screening rules when there only remain few features/samples that have not been determined to be active or non-active. Specifically, if the fraction of the sum of the safely screened features/samples and the safely kept features/samples is close to one, then we have few chances to be able to safely screen out additional features/samples. The additional computational complexitiesof safe feature/sample keeping for a single feature/sample is O(1), which is negligible compared with O(n) complexity for safe feature screening and O(d) complexity for safe sample screening.

We finally note that, in our particular working problem of smooth SVC and smooth SVR, safe keeping is also possible by using the KKT optimality conditions in (9) for features, and (13) - (15) for samples. We describe the details in the Appendix.

5 LP-based simultaneous safe screening

In this section, we consider another empirical risk minimization problem that induces sparsities both in features and samples. Specifically, we study a problem with -penalty and vanilla hinge loss , which we call LP-based SVM because it is casted into a linear program (LP). LP-based SVM has been studied in [19, 20], and also in boosting context. LPBoost [21] solves LP-based SVM via the column generation approach of linear programming. while ERLPBoost [22] is a variant of LPBoost obtained by adding a small entropic term in the LP objective. Sparse LPBoost [23] is simiilar to simultaneous screening in that it iteratively solves LP sub-problems for features and samples. LP-based SVM induces feature sparsity due to -penalty and sample sparsity due to the property of hinge loss.

The convex conjugate functions of -penalty and vanilla hinge loss are respectively written as

and the dual problem is written as

5.1 Safe feature screening for LP-based SVM

Feature safe screening for LP-based SVM was studied in the seminal safe feature screening paper by [5]. However, the method presented in their paper requires a precise optimal solution of an LP-based SVM with a different penalty parameter . This requirement is impractical because precise optimal solutions are often difficult to get numerically as recently pointed out by [12]. Here, we present a novel safe feature screening method for LP-based SVM that only requires an arbitrary pair of a primal feasible solution ˆdomand a dual feasible solution ˆdom. The proposed safe feature screening method for LP-based SVM is summarized in the following theorem.

Theorem 6. Consider safe feature screening problem given an arbitrary pair of a primal feasible solution ˆdomand a dual feasible solution ˆdom. Let ˆ( ˆ

, and ], be the vector obtained by sorting in increasing order. Furthermore, let and represent the numbers of negative and positive elements of , respectively. Then,

where

5.2 Safe sample screening for LP-based SVM

Here, we develop a novel safe sample screening method for LP-based SVM as summarized in the following theorem.

Theorem 7. Consider safe sample screening given an arbitrary primal feasible solution ˆdom. Let ) be a subgradient of vanilla hinge loss ) for dom, and define + ˆw.

Then,

where

The proof of Theorem 7 is presented in Appendix. Although the lower and the upper bounds are not explicitly presented, these optimization problems can be easily solved because they are just linear programs with one variable 0. As we discussed in 3, by alternatively iterating safe feature screening in 5.1 and safe sample screening in 5.2, we can make the regions in the dual and the primal regions step by step, indicating that the chance of screening out more features and samples increases .

6 Numerical experiments

We demonstrate the advantage of simultaneous safe screening through numerical experiments. After we describe the experimental setups in 6.1, we report the results on safe screening and keeping rates, and computation time savings in 6.2 and 6.2, respectively.

6.1 Experimental setups

Table 1 summarizes the datasets used in the experiments. We picked up four datasets whose numbers of features and samples are both large from libsvm dataset repository [4]. These are all classification datasets. For regression experiments, we just used the label indicator variables as scalar response variables.

Table 1: Benchmark datasets used in the experiments.

Here, we report the results on -penalized smoothed hinge SV classification and -penalized smoothed -insensitive SV regression. We set for classification and for regression, and considered problems with various values of the penalty parameter between and 10. The parameter in the smoothed hinge loss is set to be 0.5. Also, the parameters in the smoothed -insensitive loss is set to be 0.1 and is set to be 0.5. The proposed methods can be used with any optimization solvers as long as they provide both primal and dual sequences of solutions that converge to the optimal solution. For the experiments, we used Proximal Stochastic Dual Coordinate Ascent (SDCA) [17] and Stochastic Primal-Dual Coordinate (SPDC) [24] because they are state-of-the-art optimization methods for general

Figure 3: Safe screening and keeping rates for classification problems (for real-sim and rcv1-test datasets). The three plots in the left show the individual safe feature/sample screening rates (the middle and the bottom ones are for 1, respectively). The three plots in the center show the additional safe screening rates by simultaneously considering feature and sample screenings. The gray area in these center plots corresponds to the blue area in the corresponding left plot. In these gray area, the individual safe screening performances are good enough (screening rate > 0.95) and additional screening is unnecessary. The top right and middle right plots show the safe keeping rates for feature and samples, respectively. The bottom right plot shows the numbers of active features and samples for various values of .

Figure 4: Safe screening and keeping rates in regression problems. See the caption in Figure 3.

large-scale regularized empirical risk minimization problems. We wrote the code in C++ along with Eigen library for some numerical computations. The code is provided as a supplementary material, and will be put in public domain after the paper is accepted. All the computations were conducted by using a single core of an Intel Xeon CPU E5-2643 v2 (3.50GHz), 64GB MEM.

6.2 Safe screening and keeping rates

For demonstrating the synergy effect of simultaneous safe screening, we compared the simultaneous screening rates with individual safe screening rates. Figures 3 and 4 shows the results on classification and regression problems,respectively. In the 3 3 subplots, the letf plots indicate the individual screening rates defined as (#(screened features or samples) / #(1)). The center plots indicate the additional screening rates by the synergy effect defined as (#(additionally screened features or samples) / #(1)). The top plots represent the results on feature screening, while the middle and the bottom plots show the results on sample screening (for each of 1). We investigated the screening rates for various values of in the horizontal axis and for various quality of the solutions measured in terms of the duality gap in the vertical axis. In all the datasets, we observed that it is valuable to consider both feature and sample screening when the numbers of features and samples are both large. In addition, we confirmed that there are improvements in screening rates by the synergy effect both in feature and sample screenings especially when duality gap ( ˆ) is large. Note that gray areas in the center plots corresponds to the blue area in the corresponding left plot, where the individual safe screening

Figure 5: Total computation time for training 100 solutions for various values of in classification problems.

Figure 6: Total computation time for training 100 solutions for various values of in regression problems.

performances are good enough (screening rate > 0.95) and additional screening is unnecessary.

The top left and the middle left plots show the rates of features and samples, respectively, that are determined to be active or non-active by using safe keeping and safe screening approaches, respectively. We see that, by combining safe keeping and safe screening approaches, a large portion of features/samples can be determined to be active or non-active without actually solving the optimization problems. The bottom right plot shows the number of non-active features/samples for various values of .

6.3 Computation time savings

We compared the computational costs of simultaneous safe screening and individual safe feature/sample screening with the naive baseline (denoted as “non-screening”). We compared the computation costs in a realistic model building scenario. Specifically, we computed a sequence of solutions at 100 different penalty parameter values evenly allocated in [10] in the logarithmic scale. In all the cases, we used warm-start approach, i.e., when we computed a solution at a new , we used the solution at the previous as the initial starting point of the optimizer. In addition, whenever possible, we used dynamic safe screening strategies [8] in which safe screening rules are evaluated every time the duality gap ( ˆ) was 0.1 times smaller than before. Here, we exploited the information obtained by safe keeping as well, i.e., we did not evaluate safe screening rules for features and samples which are safely kept as active, and the rate of features/samples that are determined to be active or non-active (see the left top and left middle plots in Figure 3) is used as the stopping criterion for safe feature rule evaluations. (we terminated safe screening rule evaluations when the rate reaches 0.95).

Figures 5 and 6 show the entire computation time for training 100 different solutions. In all the datasets, simultaneous safe screening was significantly faster than individual safe feature/sample screening and non-screening. Figure 7 shows a sequence of computation times for various values of for the classification problem on rcv1-test and real-sim datasets with SPDC optimization solver. These plots suggest that the computation time savings by individual safe feature screening was better than individual safe sample screening when is large because the feature screening rates are high when is large, while the difference between the two individual screening approaches gets smaller as gets smaller (see Figure 3). Simultaneous safe screening was consistently faster than individual safe feature/sample screening and non-screening in all the problem setups.

7 Conclusions

We introduced a new approach for safely screening out features and samples simultaneously. We showed that alternatively iterating feature and sample screening steps has synergetic advantages in that screening performances can be radily improved in steps. We also introduced a new approach for predicting active set without false positive error, which we called safe keeping. Intensive numerical experiments demonstrated the advantage of our approaches in classification and regression problems with large numbers of features and samples.

Figure 7: Number of optimization steps and computation time. Sequences of the number of passes through the entire dataset and computation time to convergence for various values of for classification problems with SPDC solver are plotted.

A Proofs

A.1 Proof of Lemma 1

Proof. Since ) is -strongly convex, dom,

where, ). On the other hand, dom( ˆ0 (see Proposition B.24 in

[25]). Also, from weak duality, dom). By substituting ,

Therefore, is within a region Θ, where

Since Θis Sphere, a lower bound of and an upper bound of are given in closed form as follows:

A.2 Proof of Theorem 2

Proof. Supposing that the result of the previous safe sample screening step assures the optimal values

for a subset of the samples ], the the dual optimal solution region is written as

Therefore,

Since ˜Θ, the upper bound in (18) is tighter than or equal to that in (11), i.e., ˜

A.3 Proof of Theorem 3

Proof. Supposing that the result of the previous safe feature screening step assures that

of the features ], the the primal optimal solution region is written as

Since ˜Θ, these bounds in (19) are tighter than or equal to those in (17), i.e., ˜

A.4 Proof of Theorem 6

We first construct the the dual optimal solution region ˜Θ.

Lemma 8. For an arbitrary pair of primal feasible solution ˆdomand dual feasible solution ˆ

dom, the dual optimal solution region is written as

Proof of Lemma 8 . From the optimality and weak duality ˆand ( ˆw), respectively.

Therefore,

Noting that ˆΘ

Proof of Theorem 6. From Lemma 8,

Moreover,

where . Let us define three n-dimensional vectors ¯, ¯and ¯as follows:

If + 1 , then ¯is an element of Θand minimizes . If + 1 then ¯, ¯is an element of Θand minimizes because . If then ¯,

meaning that ¯is an element of Θand minimizes because . Therefore,

Similarly, from Lemma 8,

Moreover,

Let us define three n-dimensional vectors ¯, ¯and ¯as follows:

If + 1 , then ¯is an element of Θand maximizes . If + 1 then ¯, ¯is an element of Θand maximizes because . If then ¯, meaning

that ¯is an element of Θand maximizes because . Therefore,

On the other hand, from KKT condition(8),

A.5 Proof of Theorem 7

First, we construct the primal optimal solution region Θ.

Lemma 9. The primal optimal solution region Θis given domas

Proof. From Proposition B.24 in [25],

where ). Form the convexity of for ] and the definition of subgradient

and thus, ( ˆdom. Therefore, dom,

Since ( ˆˆˆw and 1]we have

By combining these results,

Proof of Theorem 7. From Lemma 9,

Using a Lagrange multiplier 0,

Since 0 , which is written as ) + ( ˆw), we have

where the constraint comes from (26). Similarly, since

B Safe keeping by using KKT optimality conditions

In this appendix, we describe another type of safe keeping approaches based on KKT optimality conditions.

Theorem 10. For an arbitrary pair of primal feasible solution ˆdomand dual feasible solution

ˆdom

for ], where

Proof. In the case that is -strongly concave, is bounded from below and above respectively by the following lower and upper bounds:

On the other hand, in the case of our specific regularization term (2), from KKT optimality condition (8), if then

Therefore,

Similarly, we can develop safe sample keeping based on KKT optimiality condition.

Theorem 11. For an arbitrary pair of primal feasible solution ˆdomand dual feasible solution

ˆdom, if is smoothed hinge loss then, for ,

and, for 1,

If is smoothed -insensitive loss then

for ], where

Proof. In the case that is -strongly convex, is bounded from below and above respectively by the following lower and upper bounds:

On the other hand, from KKT optimality condition (8), in the case of smoothed hinge loss (3), if

Also, in the case of smoothed -insensitive (4), if or

References

[1] Trevor Hastie, Robert Tibshirani, and Martin Wainwright. Statistical Learning with Sparsity: The Lasso and Generalizations. CRC Press, 2015.

[2] Robert Tibshirani. Regression shrinkage and selection via the lasso. Journal of the Royal Statistical Society. Series B (Methodological), pages 267–288, 1996.

[3] Bernhard E Boser, Isabelle M Guyon, and Vladimir N Vapnik. A training algorithm for optimal margin classifiers. In Proceedings of the fifth annual workshop on Computational learning theory, pages 144–152. ACM, 1992.

[4] Chih-Chung Chang and Chih-Jen Lin. Libsvm: A library for support vector machines. ACM Transactions on Intelligent Systems and Technology (TIST), 2(3):27, 2011.

[5] Laurent El Ghaoui, Vivian Viallon, and Tarek Rabbani. Safe feature elimination for the lasso and sparse supervised learning problems. Pacific Journal of Optimization, 8(4):667–698, 2012.

[6] Zhen J Xiang, Hao Xu, and Peter J Ramadge. Learning sparse representations of high dimensional data on large scale dictionaries. In Advances in Neural Information Processing Systems, pages 900–908, 2011.

[7] Jie Wang, Jiayu Zhou, Peter Wonka, and Jieping Ye. Lasso screening rules via dual polytope projection. In Advances in Neural Information Processing Systems, pages 1070–1078, 2013.

[8] Antoine Bonnefoy, Valentin Emiya, Liva Ralaivola, and R´emi Gribonval. A dynamic screening principle for the lasso. In Signal Processing Conference (EUSIPCO), 2014 Proceedings of the 22nd European, pages 6–10. IEEE, 2014.

[9] Jun Liu, Zheng Zhao, Jie Wang, and Jieping Ye. Safe Screening with Variational Inequalities and Its Application to Lasso. In Proceedings of the 31st International Conference on Machine Learning, 2014.

[10] Jie Wang, Jiayu Zhou, Jun Liu, Peter Wonka, and Jieping Ye. A safe screening rule for sparse logistic regression. In Advances in Neural Information Processing Systems, pages 1053–1061, 2014.

[11] Zhen James Xiang, Yun Wang, and Peter J Ramadge. Screening tests for lasso problems. arXiv preprint arXiv:1405.4897, 2014.

[12] Olivier Fercoq, Alexandre Gramfort, and Joseph Salmon. Mind the duality gap: safer rules for the lasso. In Proceedings of the 32nd International Conference on Machine Learning, pages 333–342, 2015.

[13] Eugene Ndiaye, Olivier Fercoq, Alexandre Gramfort, and Joseph Salmon. Gap safe screening rules for sparse multi-task and multi-class models. In Advances in Neural Information Processing Systems, pages 811–819, 2015.

[14] Kohei Ogawa, Yoshiki Suzuki, and Ichiro Takeuchi. Safe screening of non-support vectors in pathwise svm computation. In Proceedings of the 30th International Conference on Machine Learning, pages 1382–1390, 2013.

[15] Jie Wang, Peter Wonka, and Jieping Ye. Scaling svm and least absolute deviations via exact data reduction. Proceedings of The 31st International Conference on Machine Learning, 2014.

[16] Julian Zimmert, Christian Schroeder de Witt, Giancarlo Kerg, and Marius Kloft. Safe screening for support vector machines. NIPS 2015 Workshop on Optimization in Machine Learning (OPT), 2015.

[17] Shai Shalev-Shwartz and Tong Zhang. Accelerated proximal stochastic dual coordinate ascent for regularized loss minimization. Mathematical Programming, pages 1–41, 2015.

[18] Ralph Tyrell Rockafellar. Convex analysis. Princeton university press, 1970.

[19] Paul S Bradley and Olvi L Mangasarian. Feature selection via concave minimization and support vector machines. In ICML, volume 98, pages 82–90, 1998.

[20] Ji Zhu, Saharon Rosset, Trevor Hastie, and Rob Tibshirani. 1-norm support vector machines. Advances in neural information processing systems, 16(1):49–56, 2004.

[21] Ayhan Demiriz, Kristin P Bennett, and John Shawe-Taylor. Linear programming boosting via column generation. Machine Learning, 46(1-3):225–254, 2002.

[22] Manfred Warmuth, Karen Glocer, and S V N Vishwanathan. Entropy Regularized LPBoost. In Proceedings of the 19th International Conference on Algorithmic Learning Theory, volume 5254 of LNCS, pages 256–271, 2008.

[23] Kohei Hatano and Eiji Takimoto. Linear Programming Boosting by Column and Row Generation. In Proceedings of the 12th International Conference on Dicovery Science (DS 2009), volume 5808 of LNCS, pages 401–408, 2009.

[24] Yuchen Zhang and Xiao Lin. Stochastic primal-dual coordinate method for regularized empirical risk minimization. Proceedings of The 32nd International Conference on Machine Learning, pages 353–361, 2015.

[25] D. P. Bertsekas. Nonlinear Programming (2nd edition). Athena Scientific, 1999.

designed for accessibility and to further open science