Regularization Path of Cross-Validation Error Lower Bounds

2015·Arxiv

Abstract

Abstract

1 Introduction

Many machine learning tasks involve careful tuning of a regularization parameter that controls the balance between an empirical loss term and a regularization term. A regularization parameter is usually selected by comparing the cross-validation (CV) errors at several different regularization parameters. Although its choice has a significant impact on the generalization performances, the current practice is still more of an art than a science. For example, in commonly used grid-search, it is hard to tell how many grid points we should search over for obtaining sufficiently small CV error.

In this paper we introduce a novel framework for a class of regularized binary classification problems

regularization parameters to be a set of regularization parameters such that the CV error of the solution at the regularization parameter is guaranteed to be no greater by than the best possible CV error in the entire range of regularization parameters. Given a set of solutions obtained, for example, by grid-search, the proposed framework allows us to provide a theoretical guarantee of the current best solution by explicitly quantifying its approximation level in the above sense. Furthermore, when a desired approximation level is specified, the proposed framework can be used for efficiently finding one of the -approximate regularization parameters.

The proposed framework is built on a novel CV error lower bound that can be represented as a function of the regularization parameter, and this is why we call it as a regularization path of CV error lower bounds. For computing a path, no special optimization algorithm is needed. We only need to have a finite number of solutions obtained by any algorithms. It is thus easy to apply our framework to common regularization parameter tuning strategies such as grid-search or Bayesian optimization. Furthermore, the proposed framework can be used not only with exact optimal solutions but also with sufficiently good approximate solutions, which is computationally advantageous because completely solving an optimization problem is often much more costly than obtaining a reasonably good approximate solution.

Our main contribution in this paper is to show that a theoretically guaranteed choice of a regularization parameter in the above sense is possible with reasonable computational costs. To the best of our knowledge, there is no other existing methods for providing such a theoretical guarantee on CV error that can be used as generally as ours. Figure 1 illustrates the behavior of the algorithm for obtaining = 0.1 approximate regularization parameter (see 5 for the setup).

Related works Optimal regularization parameter can be found if its exact regularization path can be com-

puted. Exact regularization path has been intensively studied [8, 15], but they are known to be numerically

unstable and do not scale well. Furthermore, exact regularization path can be computed only for a limited class of problems whose solutions are written as piecewise-linear functions of the regularization parameter [22]. Our framework is much more efficient and can be applied to wider classes of problems whose exact regularization path cannot be computed. This work was motivated by recent studies on approximate regularization path [13, 11, 12, 20]. These approximate regularization paths have a property that the objective function value at each regularization parameter value is no greater by than the optimal objective function value in the entire range of regularization parameters. Although these algorithms are much more stable and efficient than exact ones, for the task of tuning a regularization parameter, our interest is not in objective

Figure 1: An illustration of the proposed framework. One

function values but in CV errors. Our approach is more suitable for regularization parameter tuning tasks in the sense that the approximation quality is guaranteed in terms of CV error.

As illustrated in Figure 1, we only compute a finite number of solutions, but still provide approximation guarantee in the whole interval of the regularization parameter. To ensure such a property, we need to introduce a novel CV error lower bound that is sufficiently tight and represented as a monotonic function of the regularization parameter. Although several CV error bounds (mostly for leave-one-out CV) of SVM and other similar learning frameworks exist (e.g., [26, 16, 7, 17]), none of them satisfy the above required properties. The idea of our CV error bound is inspired from recent studies on safe screening [9, 28, 21, 19, 27] (see Appendix A for the detail). Furthermore, we emphasize that our contribution is not in presenting a new generalization error bound, but in introducing a practical framework for providing a theoretical guarantee on the choice of a regularization parameter. Although generalization error bounds such as structural risk minimization [25] might be used for a rough tuning of a regularization parameter, they are known to be too loose to use as an alternative to CV (see, e.g., 11 in [23]). We also note that our contribution is not in presenting new method for regularization parameter tuning such as Bayesian optimization [24], random search [1] and gradient-based search [6]. As we demonstrate in experiments, our approach can provide a theoretical approximation guarantee of the regularization parameter selected by these existing methods.

2 Problem Setup

We consider linear binary classification problems. Let be the training set where n is the size of the training set, d is the input dimension, and [n] := {1, . . . , n}. An independent held-out validation set with size is denoted similarly as . A linear decision function is written as f(x) = , where is a vector of coefficients, and represents the transpose. We assume the availability of a held-out validation set only for simplifying the exposition. All the proposed methods presented in this paper can be straightforwardly adapted to a cross-validation setup. Furthermore, the proposed methods can be kernelized if the loss function satisfies a certain condition. In this paper we focus on the following class of regularized convex loss minimization problems:

where C > 0 is the regularization parameter, and is the Euclidean norm. The loss function is denoted as . We assume that ) is convex and subdifferentiable in the 2nd argument. Examples of such loss functions include logistic loss, hinge loss, Huber-hinge loss, etc. For notational convenience, we denote the individual loss as ) := ) for all ]. The optimal solution for the regularization parameter C is explicitly denoted as . We assume that the regularization parameter is defined in a finite interval [], e.g., = 10and = 10as we did in the experiments.

For a solution , the validation erroris defined as

where ) is the indicator function. In this paper, we consider two problems. In the first problem, given a set of (either optimal or approximate) solutions at T different regularization parameters

], we compute the approximation level such that

In the second problem, we find an -approximate regularization parameter within an interval ],

which is defined as an element of the following set

Both of these two problems can be solved by using our proposed framework for computing a path of validation error lower bounds.

3 Validation error lower bounds as a function of regularization

In this section, we derive a validation error lower bound which is represented as a function of the regularization parameter C. Our basic idea is to compute a lower and an upper bound of the inner product score for each validation input ], as a function of the regularization parameter C. For computing the bounds of , we use a solution (either optimal or approximate) for a different regularization parameter ˜.

3.1 Score bounds

We first describe how to obtain a lower and an upper bound of inner product score based on an approximate solution ˆat a different regularization parameter ˜.

Lemma 1. Let ˆbe an approximate solution of the problem (1) for a regularization parameter value ˜C

The proof is presented in Appendix A. Lemma 1 tells that we have a lower and an upper bound of the score for each validation instance that linearly change with the regularization parameter C. When ˆis optimal, it can be shown that (see Proposition B.24 in [2]) there exists a subgradient such that g( ˆ) = 0, meaning that the bounds are tight because ( ˆ) = ( ˆ) = 0.

The results in Corollary 2 are obtained by simply substituting C = ˜C into (5a) and (5b).

3.2 Validation Error Bounds

Given a lower and an upper bound of the score of each validation instance, a lower bound of the validation error can be computed by simply using the following facts:

Furthermore, since the bounds in Lemma 1 linearly change with the regularization parameter C, we can identify the interval of C within which the validation instance is guaranteed to be mis-classified.

This lemma can be easily shown by applying (5) to (6).

Using Lemma 3, the lower bound of the validation error is represented as a function of the regularization parameter C in the following form.

Theorem 4. Using an approximate solution ˆfor a regularization parameter ˜C, the validation error

Theorem 4 is a direct consequence of Lemma 3. The lower bound (7) is a staircase function of the regularization parameter C.

By setting C = ˜C, we can obtain a lower and an upper bound of the validation error for the regularization parameter ˜C itself, which are used in the algorithm as a stopping criteria for obtaining an approximate solution ˆ.

————————————————– algorithm ————————————————–

4 Algorithm

In this section we present two algorithms for each of the two problems discussed in 2. Due to the space limitation, we roughly describe the most fundamental forms of these algorithms. Details and several extensions

of the algorithms are presented in supplementary appendices B , C and D.

4.1 Problem 1: Computing the approximation level ε from a given set of solutions

Given a set of (either optimal or approximate) solutions ˆ, obtained e.g., by ordinary grid-search, our first problem is to provide a theoretical approximation level in the sense of (3). This problem can be solved easily by using the validation error lower bounds developed in 3.2. The algorithm is presented in Algorithm 1, where we compute the current best validation error in line 1, and a lower bound of the best possible validation error := min) in line 2. Then, the approximation level can be simply obtained by subtracting the latter from the former. We note that ), the lower bound of , can be easily computed by using T valuation error lower bounds = 1, . . . , T , because they are represented as staircase functions of C.

4.2 Problem 2: Finding an ε-approximate regularization parameter

Given a desired approximation level such as = 0.01, our second problem is to find an -approximate regularization parameter. To this end we develop an algorithm that produces a set of optimal or approximate soluitons ˆsuch that, if we apply Algorithm 1 to this sequence, then approximation level would be smaller than or equal to . Algorithm 2 is the pseudo-code of this algorithm. It computes approximate solutions for an increasing sequence of regularization parameters in the main loop (lines 2-11).

Let us now consider iteration in the main loop, where we have already computed 1 approximate

is the best (in worst-case) regularization parameter obtained so far and it is guaranteed to be an -

approximate regularization parameter in the interval [˜] in the sense that the validation error,

the interval [˜]. Using the validation error lower bound in Theorem 4, the task is to find the smallest

˜that violates

In order to formulate such a ˜, let us define

Furthermore, let

and denote the -smallest element of Γ as (Γ) for any natural number k. Then, the smallest ˜˜

that violates (9) is given as

Figure 2: Illustrations of Algorithm 1 on three benchmark datasets (D2, D3, D4). The plots indicate how the approximation level improves as the number of solutions T increases in grid-search (red), Bayesian optimization (blue) and our own method (green, see the main text).

Figure 3: Illustrations of Algorithm 2 on ionosphere (D3) dataset for (a) op2 with = 0.10, (b) op2 with = 0.05 and (c) op3 with = 0.05, respectively. Figure 1 also shows the result for op3 with = 0.10.

5 Experiments

In this section we present experiments for illustrating the proposed methods. Table 2 summarizes the datasets used in the experiments. They are taken from libsvm dataset repository [4]. All the input features except D9 and D10 were standardized to [1]. For illustrative results, the instances were randomly divided into a training and a validation sets in roughly equal sizes. For quantitative results, we used 10-fold CV. We used Huber hinge loss (e.g., [5]) which is convex and subdifferentiable with respect to the second argument. The proposed methods are free from the choice of optimization solvers. In the experiments, we used an optimization solver described in [18], which is also implemented in well-known liblinear software [10]. Our slightly modified code (for adaptation to Huber hinge loss) is provided as a supplementary material, and it will be put in public domain after the paper is accepted. Whenever possible, we used warm-start approach, i.e., when we trained a new solution, we used the closest solutions trained so far (either approximate or optimal ones) as the initial starting point of the optimizer. All the computations were conducted by using a single core of an HP workstation Z800 (Xeon(R) CPU X5675 (3.07GHz), 48GB MEM). In all the experiments, we set = 10and = 10.

Results on problem 1 We applied Algorithm 1 in 4 to a set of solutions obtained by 1) grid-search, 2) Bayesian optimization (with expected improvement acquisition function), and 3) our own method that exploits information on the CV error lower bound available during the search process. Figure 2 illustrates the results on three datasets, where we see how the approximation level in the vertical axis changes as the number of solutions (T in our notation) increases. The red plots indicate the results of grid-search. As we increase the grid points, the approximation level was tended to be improved. The blue plots indicate the results of Bayesian Optimization (BO). Since BO tends to focus its search on a small region of the regularization parameter, it was difficult to tightly bound the approximation level. The green plots indicate the result of the third option, where we sequentially computed a solution whose validation error lower bound is smallest based on the information obtained so far. The results suggest that this naive approach seems to offer slight improvement from grid-search.

Results on problem 2 We applied Algorithm 2 to benchmark datasets for demonstrating theoretically guaranteed choice of a regularization parameter is possible with reasonable computational costs. Besides the algorithm presented in 4, we also tested a variant described in supplementary Appendix B. Specifically, we have three algorithm options. In the first option (op1), we used optimal solutions for computing CV error lower bounds. In the second option (op2), we instead used approximate solutions . In the last option (op3), we additionally used speed-up tricks described in supplementary Appendix B. We considered four different choices of . Note that = 0 indicates the task of finding the exactly optimal regularization parameter. In some datasets, the smallest validation errors are less than 0.1 or 0.05, in which cases we do not report the results (indicated as “05” etc.). In trick1, we initially computed solutions at four different regularization parameter values evenly allocated in [1010] in the logarithmic scale. In trick2, the next regularization parameter ˜was set by replacing in (10) with 1(see supplementary Appendix B).

For the purpose of illustration, we plot examples of validation error curves in several setups. Figure 3 shows the validation error curves of ionosphere (D3) dataset for several options and .

Next, we report the results on computational costs in CV setups. Table 1 shows the number of optimization problems we actually solved in the algorithm (which is denoted as T ), and the total computation time in seconds. The computational costs of the methods mostly depend on T . As is evident from the algorithm description in gets smaller as increases. Two tricks in supplementary Appendix B seem to be helpful in most cases for reducing T . In addition, we see the advantage of using approximate solutions by comparing the computation times of op1 and op2, although approximate solutions can be only used for = 0. Over-

Table 1: Computational costs. For each of the three options and , the number of optimization problems solved (denoted as T ) and the total computational costs (denoted as time) are listed. Note that, for op2, there are no results for = 0.

all, the results suggest that the proposed algorithm allows us to find theoretically guaranteed approximate regularization parameters with reasonable costs except for = 0 cases. For example, the algorithm found an = 0.01 approximate regularization parameter within a minute in 10-fold CV for a dataset with more than 50000 instances (see the results on D10 for = 0.01 with op2 and op3 in Table 1).

6 Conclusions and future works

We presented a novel algorithmic framework for computing CV error lower bounds as a function of the regularization parameter. The proposed framework can be used for a theoretically guaranteed choice of a regularization parameter. Additional advantage of this framework is that we only need to compute a set of sufficiently good approximate solutions for obtaining such a theoretical guarantee, which is computationally advantageous. As demonstrated in the experiments, our algorithm is practical in the sense that the computational cost is reasonable as long as the approximation quality is not too close to 0. An important future work is to extend the approach to multiple hyper-parameters tuning setups.

A Proof of Lemma 1

In this section we prove Lemma 1. First we present two propositions which are used of proving Lemma 1.

See, for example, Proposition B.24 in [2] for the proof of Proposition 6.

Proposition 7. Let be arbitrary d-dimensional vectors and r > 0 be an arbitrary positive constant. Then, the solutions of the following optimization problem can be explicitly obtained as follows:

Proof of Proposition 7. Using a Lagrange multiplier 0, the problem (12) is rewritten as

where is strictly positive because the constraint is strictly active at the optimal solution. By

letting = 0, the optimal is written as

Substituting into ),

The upper bound of in (13) can be shown similarly.

Proof of Lemma 1. From Proposition 6, the optimal solution satisfies

Since from is convex for any ] and the definition of a subgradient, we have the following two inequalities:

Combining these two inequalities, we have

Substituting (15) into (14),

From (4),

Substituting (17) into (16),

Using Proposition 7, the solution of (18) is given as

Figure 4: An illustrative example of Algo-

and the solution of (19) is given as

Remark 8. We note that the idea of using Propositions 6 and 7 for proving Lemma 1 is inspired from recent studies on safe screening [9, 28, 21, 19, 27]. Safe screening has been introduced in the context of sparse modeling. It allows us to identify sparse features or instances before actually solving the optimization problem. A key technique used in those studies is to bound Lagrange multipliers at the optimal solution (Lagrange multiplier values at the optimal solution tell us which features or instances are active or non-active) in somewhat similar way as we did in 3. Our main contribution is to borrow this idea for representing a validation error lower bound as a function of the regularization parameter, and show that it can be used for finding an approximately optimal regularization parameter with theoretical guarantee.

B Details of the speed-up tricks for ﬁnding an ε-approximate regularization parameter

In this appendix, we first describe two modifications of the basic algorithm for finding an -approximate regularization parameter presented in 4.2 for further speed-up.

Trick1 The efficiency of the algorithm depends on how far one can step forward in each iteration. We see in (10) that the step size ˜˜is large if the current minimum validation error upper bound is small. In other words, the step size will be small until we have sufficiently small . It suggests that, if we can find small enough at an earlier stage of the algorithm, we can reduce the total computational cost of the algorithm. In order to find sufficiently small as early as possible, we propose a simple heuristic approach, where we first roughly search over the entire range by a rough grid search.

Trick2 Our next modification for speed-up is to use

for computing the validation error lower bound in [ ˜˜]. It provides a tighter validation error lower bounds than using ) alone, meaning that larger step might be allowed in each iteration. However, we cannot actually compute ) before we fix ˜. We thus propose a simple trial-and-error approach. Specifically, we step forward a little bit further than (10) when we select the next ˜. After we fix ˜, we compute an approximate solution ˆand then check whether the validation error ) is not smaller by than the current minimum for [ ˜] by using now available ).

Algorithm 3 is the pseudo-code of the proposed algorithm along with tricks 1 and 2.

There are two additional input parameters and 1. The former is used for trick1, where we initially compute m approximate solutions for regularization parameter values evenly allocated in the interval [] in the logarithmic scale. Trick1 is described at lines 2-9 in Algorithm 3.

The latter 1 is used for trick2, where the next regularization parameter value is determined in trial-and-error manner. To formally describe trick2, let us define a set Γ as a function of w in the following

way

Then, our initial trial step is written as

where 1 represents how far we step forward. We then compute an approximate solution ˆ,

and obtain a validation error lower bound ) by combining ) and ). For accepting this trial step, we need to make sure that the lower bounds are not smaller by than the current best for any ]. To this end, we investigate where the two lower bounds ) and ) go below . To formulate this, let us define

the following two functions

where, for the latter, we define

and denote the -largest element of ∆ as (∆) for any natural number k. The trial step to is

accepted if

If not, we need to shrink the trial step by using the procedure described in Algorithm 4. Briefly speaking, Algorithm 4 conducts a bisection search until we find two approximate solutions ˆand ˆthat satisfy ( ˆ( ˆ). We note that, with the use of trick2, the sequence of the regularization parameter values ˜˜is not necessarily in increasing order because they are computed in trial-and-error manner.

C Approximate regularization path in terms of validation errors

In this appendix, we describe the details of approximate regularization path in terms of validation errors and its experimental results.

By slightly modifying the algorithm, we can compute an -approximate regularization path whose approximation level is measured in terms of the validation errors. Such an -approximate regularization path

is formulated as a function

such that

In order to compute W, we need an upper bound of the validation errors as well as a lower bound represented as a function of the regularization parameter. Given a solution ˆfor a regularization parameter ˜C, our basic idea is to go forward the regularization path as long as the difference between the upper and the lower bounds are not greater than . We note that, the approximation quality of our approximate regularization path is measured in terms of the validation errors, which is more advantageous for hyper-parameter tuning tasks than existing approaches [13, 11, 12, 20] in which the approximation quality is evaluated in terms of the objective function values.

We compute a validation error upper bound based on the following simple facts:

Based on these facts, we have a lemma for validation error upper bounds similar to Lemma 3:

This lemma can be easily shown by applying (5) to (23).

Using Lemma 9, an upper bound of the validation errors is represented as a function of the regularization parameter C in the following form.

Theorem 10. Using an approximate solution ˆfor a regularization parameter ˜C, the validation error

Theorem 10 is a direct consequence of Lemma 9.

C.1 Algorithm

Algorithm 5 is the pseudo-code of our approximate regularization path.

Figure 5: An illustrative image of tack-

The main difference between Algorithm 2 and Algorithm 5 is in how to determine the next regularization parameter value ˜. For tracking an approximate solution path, we need to find the smallest ˜such that the difference between the upper and the lower bounds ) is

greater than or equal to . To formulate this, let us define

and

Then, ˜that meets the above requirement is formulated as

Figure 5 depicts how is determined.

Using the output of Algorithm 5 , our approximate regularization path is written as

where

In approximate regularization path computation, we need a special treatment in a pathological situation that the signs of the scores of multiple validation instances change at one time at a regularization parameter

value C. Such a pathological situation is formally stated as follows. Let

Then, if the size of Ω is greater than

Algorithm 5 does not work properly. Although such a pathological situation can be considered as an exceptional case and treated by tedious book-keeping operations, in the following experiments, we simply add an constraint that ˜˜10.

C.2 Experiments

Here, we describe the experimental results on approximate regularization path computation. The experimental setup is same as that in 5. Since we cannot use speed-up tricks here, we have two algorithm options. In the first option (op4), we used optimal solutions for computing CV error lower bounds. In the second option (op5), we instead used approximate solutions . Table 3 shows the experimental results. Compared with the results in Table 1, we needed to solve more optimization problems (denoted as T ) and hence the total computational cost is larger than simply finding an -approximate regularization parameter. For large datasets D9 and D10 with = 0, we could not finish the computations within 100 hours.

Table 3: Complexities and computational costs of approximate regularization path computation experiments. For each of the three options and , the number of optimization problems solved (denoted as T ) and the total computational costs (denoted as time) are listed. Note that, for op5, there are no results for = 0. For D9 and D10 with = 0, we could not finish the computations within 100 hours.

D Adaptation to cross-validation setup

All the methods presented above can be straightforwardly adapted to a cross-validation (CV) setup. Consider k-fold CV where n instances are divided into k disjoint subsets with almost equal size. Let

be the optimal solution trained without using the instances in . Then, the k-fold CV error is defined as

where, note that, the CV error is not a function of w, but a function of C. Our algorithm can find an -approximate regularization parameter at which the k-fold CV error is guaranteed to be no greater by than the smallest possible k-fold CV error. For each of the k folds, we can compute a validation error lower bound as described before. A lower bound of the entire k-fold CV error can be obtained by simply summing them up.

Algorithm 3 : Finding an -approximate regularization parameter with approximate solutions using tricks 1

and 2

¯1

= 0 to

, ˆ

Set by (20) using ˆ

Set by (22) using ˆ

break while loop

solve (1) approximately for

RecursiveCheck( ˜)

+ 1

References

[1] J. Bergstra and Y. Bengio. Random Search for Hyper-Parameter Optimization. Journal of Machine Learning Research, 13:281–305, 2012.

[2] P D. Bertsekas. Nonlinear Programming. Athena Scientific, 1999.

[3] S. Boyd and L. Vandenberghe. Convex optimization. Cambridge University Press, 2004.

[4] C. Chang and C. Lin. LIBSVM : A Library for Support Vector Machines. ACM Transactions on Intelligent Systems and Technology, 2:1–39, 2011.

[5] O. Chapelle. Training a support vector machine in the primal. Neural computation, 19:1155–1178, 2007.

[6] O. Chapelle, V. Vapnik, O. Bousquet, and S. Mukherjee. Choosing multiple parameters for support vector machines. Machine Learning, 46:131–159, 2002.

[7] K. Chung, W. Kao, C. Sun, L. Wang, and C. Lin. Radius margin bounds for support vector machines with the RBF kernel. Neural computation, 2003.

[8] B. Efron, T. Hastie, I. Johnstone, and R. TIbshirani. Least angle regression. Annals of Statistics, 32(2):407–499, 2004.

[9] L. El Ghaoui, V. Viallon, and T. Rabbani. Safe feature elimination in sparse supervised learning. Pacific Journal of Optimization, 2012.

[10] R. Fan, K. Chang, and C. Hsieh. LIBLINEAR: A library for large linear classification. The Journal of Machine Learning, 9:1871–1874, 2008.

[11] J. Giesen, M. Jaggi, and S. Laue. Approximating Parameterized Convex Optimization Problems. ACM Transactions on Algorithms, 9, 2012.

[12] J. Giesen, S. Laue, and Wieschollek P. Robust and Efficient Kernel Hyperparameter Paths with Guarantees. In International Conference on Machine Learning, 2014.

[13] J. Giesen, J. Mueller, S. Laue, and S. Swiercy. Approximating Concavely Parameterized Optimization Problems. In Advances in Neural Information Processing Systems, 2012.

[14] G. Guennebaud, B. Jacob, and Others. Eigen v3. http://eigen.tuxfamily.org, 2010.

[15] T. Hastie, S. Rosset, R. Tibshirani, and J. Zhu. The entire regularization path for the support vector machine. Journal of Machine Learning Research, 5:1391–1415, 2004.

[16] T. Joachims. Estimating the generalization performance of a SVM efficiently. In International Conference on Machine Learning, 2000.

[17] M. Lee, S. Keerthi, C. Ong, and D. DeCoste. An efficient method for computing leave-one-out error in support vector machines with Gaussian kernels. IEEE Transactions on Neural Networks, 15:750–7, 2004.

[18] C. Lin, R. Weng, and S. Keerthi. Trust Region Newton Method for Large-Scale Logistic Regression. The Journal of Machine Learning Research, 9:627–650, 2008.

[19] J. Liu, Z. Zhao, J. Wang, and J. Ye. Safe Screening with Variational Inequalities and Its Application to Lasso. In International Conference on Machine Learning, volume 32, 2014.

[20] J. Mairal and B. Yu. Complexity analysis of the Lasso reguralization path. In International Conference on Machine Learning, 2012.

[21] K. Ogawa, Y. Suzuki, and I. Takeuchi. Safe screening of non-support vectors in pathwise SVM computation. In International Conference on Machine Learning, 2013.

[22] S. Rosset and J. Zhu. Piecewise linear regularized solution paths. Annals of Statistics, 35:1012–1030, 2007.

[23] S. Shalev-Shwartz and S. Ben-David. Understanding machine learning. Cambridge University Press, 2014.

[24] J. Snoek, H. Larochelle, and R. Adams. Practical Bayesian Optimization of Machine Learning Algorithms. In Advances in Neural Information Processing Sysrtems, 2012.

[25] V. Vapnik. The Nature of Statistical Learning Theory. Springer, 1996.

[26] V. Vapnik and O. Chapelle. Bounds on Error Expectation for Support Vector Machines. Neural Computation, 12:2013–2036, 2000.

[27] J. Wang, J. Zhou, J. Liu, P. Wonka, and J. Ye. A Safe Screening Rule for Sparse Logistic Regression. In Advances in Neural Information Processing Sysrtems, 2014.

[28] Z. Xiang, H. Xu, and P. Ramadge. Learning sparse representations of high dimensional data on large scale dictionaries. In Advances in Neural Information Processing Sysrtems, 2011.

designed for accessibility and to further open science