Safe Sample Screening for Robust Support Vector Machine

2019·Arxiv

Abstract

Abstract

Robust support vector machine (RSVM) has been shown to perform remarkably well to improve the generalization performance of support vector machine under the noisy environment. Unfortunately, in order to handle the non-convexity induced by ramp loss in RSVM, existing RSVM solvers often adopt the DC programming framework which is computationally inefficient for running multiple outer loops. This hinders the application of RSVM to large-scale problems. Safe sample screening that allows for the exclusion of training samples prior to or early in the training process is an effective method to greatly reduce computational time. However, existing safe sample screening algorithms are limited to convex optimization problems while RSVM is a non-convex problem. To address this challenge, in this paper, we propose two safe sample screening rules for RSVM based on the framework of concave-convex procedure (CCCP). Specifically, we provide screening rule for the inner solver of CCCP and another rule for propagating screened samples between two successive solvers of CCCP. To the best of our knowledge, this is the first work of safe sample screening to a non-convex optimization problem. More importantly, we provide the security guarantee to our sample screening rules to RSVM. Experimental results on a variety of benchmark datasets verify that our safe sample screening rules can significantly reduce the computational time.

Introduction

In supervised learning, support vector machine (SVM) (Chang and Lin 2011; Platt 1998; Gu and Sheng 2016; Li et al. 2015; Gu et al. 2018a; Gu et al. 2015; Gu et al. 2016; Gu, Sheng, and Li 2015; Gu 2018) is a powerful classifica-tion method that is widely used to separate data by maximizing the margin between two classes. However, realworld data tend to be massive in quantity but with quite a few unreliable outliers. Traditional SVM usually use convex hinge loss function to calculate the loss of misclassi-fied samples. Since the convex function is unbounded and puts an extremely large penalty on outliers, traditional SVM is unstable in the presence of outliers. Robust support vector machine (RSVM) (Wu and Liu 2007; Shen et al. 2003; Xu, Crammer, and Schuurmans 2006) suppresses the influ-ence of outliers on the decision function through clipping the convex hinge loss to the non-convex ramp loss, and has been shown to perform remarkably well under the noisy environment.

The non-convex objective function of RSVM can be viewed as a difference of convex (DC) (Sriperumbudur and Lanckriet 2009) programming problem which is normally solved by the concave-convex procedure (CCCP) (Yuille and Rangarajan 2003; Collobert et al. 2006; Gu et al. 2018c; Yu et al. 2019) algorithm. The CCCP algorithm iteratively solves a sequence of constrained convex optimization problems. For each loop of CCCP algorithm, it solves a surrogate convex optimization problem which linearizes the concave part of the original DC programming problem. The inner surrogate convex optimization problem is very similar to the problem of SVM, and is normally solved by the sequential minimal optimization (SMO) algorithm (Platt 1998; Vapnik 2013). As pointed out in (Chang and Lin 2011), the time complexity of SMO algorithm is , where is the number of the training samples. Thus, the time complexity of CCCP algorithm to solve RSVM is , where t is the number of loops of the CCCP algorithm. The high computational cost severely hinders the implementation of RSVM and its application to big data.

To address the above challenging problem, one promising approach is safe screening. Ghaoui et al. (2010) first exploited safe screening rules to discard inactive features prior to starting a Lasso solver. They exploited the geometric quantities of the feature space to bound the Lasso dual solution to be within a compact region and only need to solve a smaller optimization problem on the reduced datasets which leads to huge savings in the computational cost and memory usage. Since then, the concept of safe screening has been expanded in two main directions. The first direction is called sequential screening, which performs screening along the entire regularization path which is the sequence of optimal solutions w.r.t. different values of regularization parameter. Sequential screening relies on an additional feasible or optimal solution obtained in advance, which can provide a warm start of the screening process. This direction has been pursued in (Wang et al. 2013; Wang et al. 2014; Liu et al. 2013;

Table 1: Representative safe screening algorithm. (“Type” represents the algorithm screening samples or features).

Xu and Ramadge 2013; Xiang, Wang, and Ramadge 2016; El Ghaoui, Viallon, and Rabbani 2011). However, they are only applicable to algorithms that also compute the regularization paths. The second direction is called dynamic screening (Bonnefoy et al. 2014; Bonnefoy et al. 2015), which performs the screening throughout the optimization algorithm itself. For example, Fercoq et al. (2015) proposed a duality gap based safe feature screening algorithm for lasso. Although dynamic screening might be useless early in the training process, it might become efficient as the algorithm proceeds towards the optimal solution. Further, Rakotomamonjy et al. (2019) expanded safe feature screening rule to lasso with non-convex sparse regularizers. They handled the non-convexity of the objective through the majorizationminimization (MM) principle and provided a warm-start process that allows to propagate screened features from one MM iteration to the next.

Recently, Ogawa et al. (2013) first proposed a safe screening to indentify non-support vectors for SVM. They extended the existing feature-screening methods to samplescreening. On this basis, Ogawa et al. (2014) and Wang et al. (2014) improved its ability to screen inactive samples. However, as sequential screening algorithms, they relies on an additional feasible or optimal solution obtained in advance, which can be very time consuming. To overcome this dif-ficult, Zimmert et al. (2015) proposed a dynamic screening rule using a duality gap function in the primal variables of hinge loss kernel SVM. We summarized several representative safe screening algorithms in table 1. It shows that existing safe feature screening algorithms have been widely used in convex and non-convex problems while existing safe samples algorithms are limited to convex problems. Dynamic screening algorithm for SVM can not provide a warm-start for training the model, so it only works during the training the model. It is obvious that the dynamic samples screening rule for RSVM is still an open problem.

In this paper, we propose two safe sample screening rules for RSVM based on the framework of concave-convex procedure (CCCP). Specifically, we first provide a screening rule for the inner solver of CCCP. Secondly, we provide a new rule for propagating screened samples between two successive solvers of CCCP. To the best of our knowledge, this is the first work of safe sample screening to a non-convex optimization problem. More importantly, we provide the security guarantee to our sample screening rules to RSVM. Experimental results on a variety of benchmark datasets verify that our safe sample screening rules can significantly reduce

the computational time.

Contributions. The main contributions of this paper are summarized as follows:

1. To the best of our knowledge, we are the first to propose a safe samples screening rule for the non-convex problem.

2. By utilizing an iterative CCCP strategy to solve RSVM, we proposed a safe samples screening rule for propagating screened samples between two successive solvers of CCCP.

Preliminaries of Robust Support Vector Machine

In this section, we first give a brief review of RSVM. Then, we give the primal and dual form of RSVM. At last, we give the screening set in the RSVM.

(1) where are the parameters of the model, and is a transformation function from an input space to a highdimensional reproducing kernel Hilbert space. SVM solves the following minimization problem:

where the function is the hinge loss. Since the convex hinge loss function is unbounded and puts an extremely large penalty on outliers, traditional SVM is unstable in the presence of outliers. We clip the hinge loss to get the ramp loss , where . The ramp loss is bounded, meaning that noisy samples cannot influence the solution beyond that of any other misclassified point. Thus, RSVM can effectively suppress the influence of outliers and it solves the following minimization problem:

where o and v are real-valued convex functions. It is easy to see that the objective function (3) is a form of DC.

Primal and Dual Problem

The non-convex objective function of RSVM can be viewed as a DC programming problem which is normally solved by the CCCP algorithm. The main mechanism of CCCP algorithm is to iteratively construct an optimized surrogate objective function which linearizes the concave part of the original DC programming problem. In order to apply the CCCP algorithm to solve the problem (3), we first have to calculate derivative of the concave part with respect to :

The primal problem can be transformed into the following optimized surrogate objective:

In this paper, we call the formulation (6) as convex inner loop (CIL) problem. Using Lagrange multiplier method (Bertsekas 2014), we directly give the dual form of the primal problem as follows:

where H is a positive semidefinite matrix with for all is the kernel function, and is the Lagrange multiplier.

According to the convex optimization theory (Boyd and Vandenberghe 2004), the dual CIL problem (7) can be transformed into the following min-max form:

where are the parameters of the dual CIL problem and is the Lagrangian multiplier. Further, from the KKT theorem (Bazaraa and Shetty 2012), the first-order derivative of with respect to leads to the following KKT conditions:

Screening Set

Safe sample screening is built on the KKT optimality condition (Bertsekas 1997). According to the gradient , we can categorize the n training samples into three cases:

Switching to the primal problem, (10) leads to the following four cases:

If some of the training samples are known to satisfy the case (11) or (14) in advance, we can throw away those samples prior to the training stage. Similarly, if we know that some samples satisfy case (13), we can fix the corresponding at the following training process. Namely, if some knowledge on these five cases are known a-priori, our training task would be extremely easy. The samples satisfy the case (11) or (14) are often called non-support vectors because they have no influence on the resulting classifier.

In this paper, we show that, through our safe sample screening rule, some of the non-SVs and some of the samples satisfying case (13) or (14) can be screened out prior to the training process. Then, in the latter training process, we can train the model with fewer samples to reduce computation time while ensuring consistent results. Suppose that we obtain an active set A (a subset of D) after applying our safe sample screening rule, correspondingly, we can define an inactive set that the variables are fixed. The original optimization (7) can be reduced into a smaller optimization problem as follows.

which is a smaller optimization problem. Notice that different from existing approximate shrinking heuristics (Chang and Lin 2011; Joachims 1999b; Gu et al. 2018b; Joachims 1999a; Fan, Chen, and Lin 2005) which sample through a boundary without well theory guarantees, the active set obtained by our sample screening rule is safe and reliable.

Safe Screening Rule for Single CIL Problem

In this section, we first provide the safe screening rule for single CIL problem. Then, we give the implementation of single CIL problem. Finally, we analyze the security analysis of sample screening rule.

Safe Screening Rule

As stated in the duality theory, the dual problem is a lower bound on the primal problem . When strong duality theorem (Boyd and Vandenberghe 2004) is satisfied, the optimal solution of the dual problem is equal to the optimal solution of the primal problem. We define the duality gap functions as follows:

and respectively. The weak duality theorem (Boyd and Vandenberghe 2004) guarantees that duality gap is always greater than 0. In the following, we first show that the duality gap is a strong convex function. Property 1. The duality gap is strongly convex with parameter . Then

We provide the detailed proof in the appendix. According to the strongly convex property of the duality gap, we can easily get that the euclidean distance between arbitrarily feasible solution and the optimal solution is always less than the current duality gap. Corollary 1. Let be the optimal solution of the primal problem. Then we have

Based on Corollary 1, we further obtain the inequality relation between the euclidean distance from the feasible solution to the optimal solution of any sample and the current duality gap. Corollary 2. Let be the optimal solution of the dual problem. Denote the entries of the associated kernel matrix, then for all we have:

to Corollary 2, we know that the optimal solution is always in a circle with the feasible solution as the center of the circle and as the radius. Thus, when this circle does not contain the point , we can screen out this sample. The safe sample screening rule is summarized as follows:

Figure 1: Illustration of safe sample screening rule. The points O in the figure represents support vector i.e. . During the training process, if a certain sample i is a support vector, the feasible solution must

be in a circle centered at O and radius in any iteration. Correspondingly, if the feasible solution is at point B, its optimal solution must be in a circle of the same radius r. At this time, the optimal solution must be greater than 0. On the contrary, when the feasible solution is at point A, we are not sure that the optimal solution is equal to 0.

We give the illustration of safe sample screening rule in Figure 1.

Interpretation

We use a SMO algorithm to solve the CIL problem in its dual form (7). The core idea of SMO algorithm is to heuristically select two samples that violate the KKT condition to the largest extent to update. Then, we will update the gradient of all the samples and the parameter b. SMO algorithm repeat the process until it converges. The gradient of all the samples is defined as follows:

During the training process, in order to use safe sample screening rule, we need to compute the duality gap. Major time-consuming of duality gap is compute . In the following, we will show that how to use the gradient in the update process to compute the duality gap easily. According to the KKT conditions, the is defined as follows:

and the can be easily solved as:

Thus, we can avoid recalculating kernel by maintaining the gradient in each iteration of SMO algorithm. The duality gap in the early stage can always be large, which makes the dual and primal estimations inaccurate and finally results in ineffective screening rules. We typically start sample screening after 50 iterations and screen the samples every 10 iterations. We summarized the safe sample screening rule for single CIL problem in Algorithm 1.

Theoretical Analysis

The main advantage of our safe sample screening algorithm is its theoretic guarantee. In the following, we prove that all inactive samples would be detected and screened by our screening rule after a finite number of iterations of the algorithm.

Property 2. Define the screening set of the CIL problem as and obtained at iteration k of an algorithm solving the CIL problem. Then, there exists s.t. .

We provide the detailed proof in the appendix.

Safe Screening Rule for Successive CIL Problems

In this section, we give the propagation behavior of the screened samples.

Propagating Screening Rule

Prior to this we have introduced the inner solver for single CIL problem and its safe screening rule, we are going to analyze how this rule can be improved into solving successive CIL problems. Each iteration of the CCCP algorithm approximates the concave part by its tangent and minimizes the resulting convex function, and the its tangent is always an upper bound of the concave part:

For each inner problem of CCCP, we can perform screening by using as defined in (5), which improves the efficiency of CCCP. However, since the ’s are also expected to vary for different iterations, we do not know whether the a screen sample of iteration k can be safely screened in iteration k+1. Thus, in the next iteration, we usually need to solve a new CIL problem due to the change of . In the following, we derive conditions that could be used to propagate screened samples from one iteration to the next in a CCCP algorithm.

First of all, we give the relation of feasible solutions of any two subproblems:

Then, we consider the relation of duality gap of any two subproblems:

By utilizing the relation from one iteration to the next in a CCCP algorithm, we can obtain the Property 3.

Property 3. Consider a CIL problem with and its feasible solutions allowing to screen samples according to (20)-(21). Suppose that we have a new set of weight of defining a new CIL problem. Given a primal-dual feasible solution for the latter problem, a safe sample screening rule for sample i reads

where m, n and q are constants such that anddenote vector for all .

We provide the detailed proof in the appendix. To make this safe sample screening rule tractable, we first need a feasible solution of the dual CIL problem in iteration k + 1, then we can obtain an upper bound on the norm of and a bound on the difference of the duality gap and threshold respectively. Interestingly, when using the primal solution as our feasible solution in iteration k + 1, the safe sample screening rule given in Property 3 does not involve any additional dot product and is cheaper to compute. The propagation of screened samples provide a warm-start process for the next iteration in a CCCP algorithm. We summarized the safe sample screening rule for successive CIL problems in Algorithm 2.

Experiments

In this section, we first present the experimental setup, and then provide the experimental results and discussions.

Figure 2: Average computational time of four contrast algorithms under different setting.

Experimental Setup

Design of Experiments: In the experiments, we compare the computation time of different algorithms for computing the optimization problem (3) to verify the effectiveness of our algorithm. The active set technique (also called shrinking technique) is used by two of the most commonly used state-of-the-art SVM solvers, LIBSVM (Chang and Lin 2011) and SVMLight (Joachims 1999b). Similarly, the active set technique solves a smaller optimization problem (15) by screening inactive samples to reduce the computational time. However, these methods do not have theoretic guarantee w.r.t. whether a training sample can be safely remove. In the experiments, we combine our safe sample screening rules with active set technology to reduce the computational time. Specifically, we use our safe sample screening rules at the beginning of the experiment during which the screening operation is invoked every 10 iterations until the duality gap is smaller than . Then, we use active set technique for the rest of the training process.

In addition, we compared our safe sample screening algo-

rithm with traditional RSVM algorithm, which uses all the samples to train the model during the whole training process. The compared algorithms are summarized as follows.

1. Safe: Our proposed safe sample screening algorithm.

2. Shrink: The active set technique without safe screening guarantee (Chang and Lin 2011; Joachims 1999b).

3. Shrink+Safe: our safe sample screening rules combined with active set technique.

4. Non screening: The traditional RSVM algorithm with CCCP (Collobert et al. 2006).

Implementation: We implement our algorithm in MATLAB. For kernel, the linear kernel and Gaussian kernel expare used in all experiments. The parameter C is selected from the set {0.1, 1, 10, 100}. The Gaussian kernel parameter is selected from the set {0.05, 0.5, 5}. The ramp loss function parameter s is fixed at 0. The optimization precision is set to be . For each dataset, we randomly selected 20000

Figure 3: The screening rate of different datasets.

samples for training.

Datasets: Table 2 summarizes the four benchmark datasets (CodRNA, ijcnn1, a9a, letter) used in the experiments. There are from LIBSVM 1sources. Originally, the Letter is a 26-class dataset (i.e., the alphabet “A”-“Z”). We created a binary version of Letter dataset by classifying alphabet A to M versus N to Z.

Table 2: The benchmark datasets used in the experiments.

Results and Discussions Figure 2 presents the average computational time of four competing algorithms under different setting. Compared with Non screening, our safe sample screening rule can effectively reduce almost 50% computational time in most settings. The result clearly demonstrate that our safe sample screening rule combined with active set technology is the most efficient method for reducing computational time, sig-nificantly outperforming the standalone active set technique. This is because the active set technique is not safe, meaning that, when it erroneously screens useful samples, it needs to repeat the training after correcting those mistakes. Our safe sample screening rule can safely screen samples until close to the optimal solution. Then, when using the activity

set technique, we can try to avoid screening active samples by mistake as much as possible. Even if some samples are wrongly screened, the retrained process only requires fewer samples.

Figure 3 presents the screening rate of different datasets. The results clearly demonstrate that when the Gaussiam kernel parameter is small, our safe sample screening rule can effectively screen half of the inactive samples at the beginning of training process. As the number of iterations increases, our safe sample screening rule can screen almost all inactive samples. In Figure 3 (d), (k), (I), we only screen a few inactive samples because the SVM model contains a lot of support vectors and is not sparse in samples at this setting.

Conclusion

In this paper, we propose two safe sample screening rules for RSVM based on the CCCP algorithm. Specifically, we first provide a screening rule for the inner solver of CCCP. Secondly, we provide a new rule for propagating screened samples between two successive solvers of CCCP. We also provide the security guarantee to our sample screening rules to RSVM. To the best of our knowledge, this is the first work of safe sample screening to a non-convex optimization problem. Experimental results on a variety of benchmark datasets verify that our safe sample screening rules can significantly reduce the computational time.

Acknowledgments

This work was supported by Six talent peaks project (No. XYDXX-042) and the 333 Project (No. BRA2017455) in Jiangsu Province and the National Natural Science Foundation of China (No: 61573191).

References

[Bazaraa and Shetty 2012] Bazaraa, M. S., and Shetty, C. M. 2012. Foundations of optimization, volume 122. Springer Science & Business Media.

[Bertsekas 1997] Bertsekas, D. P. 1997. Nonlinear programming. Journal of the Operational Research Society 48(3):334–334.

[Bertsekas 2014] Bertsekas, D. P. 2014. Constrained optimization and Lagrange multiplier methods. Academic press.

[Bonnefoy et al. 2014] Bonnefoy, A.; Emiya, V.; Ralaivola, L.; and Gribonval, R. 2014. A dynamic screening principle for the lasso. In 2014 22nd European Signal Processing Conference (EUSIPCO), 6–10. IEEE.

[Bonnefoy et al. 2015] Bonnefoy, A.; Emiya, V.; Ralaivola, L.; and Gribonval, R. 2015. Dynamic screening: Accelerating first-order algorithms for the lasso and group-lasso. IEEE Transactions on Signal Processing 63(19):5121–5132.

[Boyd and Vandenberghe 2004] Boyd, S., and Vandenberghe, L. 2004. Convex optimization. Cambridge university press.

[Chang and Lin 2011] Chang, C.-C., and Lin, C.-J. 2011. Libsvm: A library for support vector machines. ACM transactions on intelligent systems and technology (TIST) 2(3):27.

[Collobert et al. 2006] Collobert, R.; Sinz, F.; Weston, J.; and Bottou, L. 2006. Large scale transductive svms. Journal of Machine Learning Research 7(Aug):1687–1712.

[El Ghaoui, Viallon, and Rabbani 2011] El Ghaoui, L.; Vial- lon, V.; and Rabbani, T. 2011. Safe feature elimination for the lasso. Submitted, April.

[Fan, Chen, and Lin 2005] Fan, R.-E.; Chen, P.-H.; and Lin, C.-J. 2005. Working set selection using second order information for training support vector machines. Journal of machine learning research 6(Dec):1889–1918.

[Fercoq, Gramfort, and Salmon 2015] Fercoq, O.; Gramfort, A.; and Salmon, J. 2015. Mind the duality gap: safer rules for the lasso. arXiv preprint arXiv:1505.03410.

[Ghaoui, Viallon, and Rabbani 2010] Ghaoui, L. E.; Viallon, V.; and Rabbani, T. 2010. Safe feature elimination for the lasso and sparse supervised learning problems. arXiv preprint arXiv:1009.4219.

[Gu and Sheng 2016] Gu, B., and Sheng, V. S. 2016. A ro- bust regularization path algorithm for -support vector clas-sification. IEEE Transactions on neural networks and learning systems 28(5):1241–1248.

[Gu et al. 2015] Gu, B.; Sheng, V. S.; Wang, Z.; Ho, D.; Os- man, S.; and Li, S. 2015. Incremental learning for -support vector regression. Neural Networks 67:140–150.

[Gu et al. 2016] Gu, B.; Sheng, V. S.; Tay, K. Y.; Romano, W.; and Li, S. 2016. Cross validation through twodimensional solution surface for cost-sensitive svm. IEEE transactions on pattern analysis and machine intelligence 39(6):1103–1121.

[Gu et al. 2018a] Gu, B.; Quan, X.; Gu, Y.; Sheng, V. S.; and Zheng, G. 2018a. Chunk incremental learning for cost-

sensitive hinge loss support vector machine. Pattern Recognition 83:196–208.

[Gu et al. 2018b] Gu, B.; Shan, Y.; Geng, X.; and Zheng, G. 2018b. Accelerated asynchronous greedy coordinate descent algorithm for svms. In IJCAI, 2170–2176.

[Gu et al. 2018c] Gu, B.; Yuan, X.-T.; Chen, S.; and Huang, H. 2018c. New incremental learning algorithm for semi-supervised support vector machine. In Proceedings of the 24th ACM SIGKDD International Conference on Knowledge Discovery & Data Mining, 1475–1484. ACM.

[Gu, Sheng, and Li 2015] Gu, B.; Sheng, V. S.; and Li, S. 2015. Bi-parameter space partition for cost-sensitive svm. In Twenty-Fourth International Joint Conference on Artifi-cial Intelligence.

[Gu 2018] Gu, B. 2018. A regularization path algorithm for support vector ordinal regression. Neural Networks 98:114– 121.

[Joachims 1999a] Joachims, T. 1999a. Making large-scale support vector machine learning practical, advances in kernel methods. Support vector learning.

[Joachims 1999b] Joachims, T. 1999b. Svmlight: Support vector machine. SVM-Light Support Vector Machine http://svmlight. joachims. org/, University of Dortmund 19(4).

[Li et al. 2015] Li, X.; Wang, H.; Gu, B.; and Ling, C. X. 2015. Data sparseness in linear svm. In Twenty-Fourth International Joint Conference on Artificial Intelligence.

[Liu et al. 2013] Liu, J.; Zhao, Z.; Wang, J.; and Ye, J. 2013. Safe screening with variational inequalities and its application to lasso. arXiv preprint arXiv:1307.7577.

[Ogawa et al. 2014] Ogawa, K.; Suzuki, Y.; Suzumura, S.; and Takeuchi, I. 2014. Safe sample screening for support vector machines. arXiv preprint arXiv:1401.6740.

[Ogawa, Suzuki, and Takeuchi 2013] Ogawa, K.; Suzuki, Y.; and Takeuchi, I. 2013. Safe screening of non-support vectors in pathwise svm computation. In International conference on machine learning, 1382–1390.

[Platt 1998] Platt, J. 1998. Sequential minimal optimization: A fast algorithm for training support vector machines.

[Rakotomamonjy, Gasso, and Salmon 2019] Rakotomamonjy, A.; Gasso, G.; and Salmon, J. 2019. Screening rules for lasso with non-convex sparse regularizers. arXiv preprint arXiv:1902.06125.

[Shen et al. 2003] Shen, X.; Tseng, G. C.; Zhang, X.; and Wong, W. H. 2003. On -learning. Journal of the American Statistical Association 98(463):724–734.

[Sriperumbudur and Lanckriet 2009] Sriperumbudur, B. K., and Lanckriet, G. R. 2009. On the convergence of the concave-convex procedure. In Proceedings of the 22nd International Conference on Neural Information Processing Systems, 1759–1767. Curran Associates Inc.

[Vapnik 2013] Vapnik, V. 2013. The nature of statistical learning theory. Springer science & business media.

[Wang et al. 2013] Wang, J.; Zhou, J.; Wonka, P.; and Ye, J. 2013. Lasso screening rules via dual polytope projection. In

Advances in Neural Information Processing Systems, 1070– 1078.

[Wang et al. 2014] Wang, J.; Zhou, J.; Liu, J.; Wonka, P.; and Ye, J. 2014. A safe screening rule for sparse logistic regression. In Advances in neural information processing systems, 1053–1061.

[Wang, Wonka, and Ye 2014] Wang, J.; Wonka, P.; and Ye, J. 2014. Scaling svm and least absolute deviations via exact data reduction. In International conference on machine learning, 523–531.

[Wu and Liu 2007] Wu, Y., and Liu, Y. 2007. Robust trun- cated hinge loss support vector machines. Journal of the American Statistical Association 102(479):974–983.

[Xiang, Wang, and Ramadge 2016] Xiang, Z. J.; Wang, Y.; and Ramadge, P. J. 2016. Screening tests for lasso problems. IEEE transactions on pattern analysis and machine intelligence 39(5):1008–1027.

[Xu and Ramadge 2013] Xu, P., and Ramadge, P. J. 2013. Three structural results on the lasso problem. In 2013 IEEE International Conference on Acoustics, Speech and Signal Processing, 3392–3396. IEEE.

[Xu, Crammer, and Schuurmans 2006] Xu, L.; Crammer, K.; and Schuurmans, D. 2006. Robust support vector machine training via convex outlier ablation. In AAAI, volume 6, 536–542.

[Yu et al. 2019] Yu, S.; Gu, B.; Ning, K.; Chen, H.; Pei, J.; and Huang, H. 2019. Tackle balancing constraint for incremental semi-supervised support vector learning. In Proceedings of the 25th ACM SIGKDD International Conference on Knowledge Discovery & Data Mining, 1587–1595. ACM.

[Yuille and Rangarajan 2003] Yuille, A. L., and Rangarajan, A. 2003. The concave-convex procedure. Neural computation 15(4):915–936.

[Zimmert et al. 2015] Zimmert, J.; de Witt, C. S.; Kerg, G.; and Kloft, M. 2015. Safe screening for support vector machines. In NIPS 2015 Workshop on Optimization in Machine Learning (OPT).