Quick sensitivity analysis for incremental data modification and its application to leave-one-out CV in linear classification problems

2015·arXiv

Abstract

1 Introduction

Given a large number of training instances, the initial training cost of a classifier such as logistic regression (LR) or support vector machine (SVM) would be quite expensive. In principle, there is no simple way around this initial training cost except when suboptimal approximate classifiers (e.g., trained by using random small sub-samples) are acceptable. Unfortunately, such an initial training cost is not the only thing we must care about in practice. In many practical data engineering tasks, the training set with which the initial classifier was trained might be slightly modified. In such a case, it is important to check the sensitivity of the classifier, i.e., how the results would change when the classifier is updated with the slightly modified training set.

Machine learning algorithms particularly designed for updating a classifier when a small number of instances are incrementally added or removed are called incremental learning [3]. For example, when a single instance is added or removed, the solution of a linear predictor can be efficiently computed (see, e.g., [8]). Incremental learning algorithms for SVMs and other related learning frameworks have been intensively studied in the literature [3, 7, 17, 12, 11, 13]. Even for problems whose explicit incremental learning algorithm does not exist, warm start approach, where the original optimal solution is used as an initial starting point for the updating optimization problem, is usually very helpful for reducing incremental learning costs [5, 18].

However, the computational cost of incremental learning is still very expensive if the original training set is large. Except for special cases, any incremental learning algorithms must go through the entire training data matrix at least once, meaning that the complexities depend on the entire training set size. When only a small number of instances are modified, spending a great amount of computational cost for re-optimizing the classifier does not seem to be a well worthy effort because inference results on the updated classifier would not be so different from the original ones. Furthermore, in practical applications, it might be computationally intractable to completely update the classifier every time there is a tiny modification of the training set. In such a situation, it would be nice if we could quickly check the sensitivity of the classifier without actually updating it. Unless the sensitivity is unacceptably large, we might want to use the original classifier as it is.

Our key observation here is that the goal of sensitivity analysis is not to update the classifier itself, but to know how much the results of our interest would change when the classifier is updated with the slightly modified training set. Suppose, for example, that we have a test instance. Then, we would be interested in whether there is a chance that the class label of the test instance could be changed by a minor data modification or not. In order to answer such a question, we propose a novel approach that can quickly compute the sensitivity of a quantity depending on the unknown updated classifier without actually re-optimizing it.

In this paper we study a class of regularized linear binary classification problems with convex loss. We propose a novel framework for this class of problems that can efficiently compute a lower and an upper bounds of a general linear score of the updated classifier. Specifically, denoting the coefficient vector of the updated linear classifier as , our framework allows us to obtain a lower and an upper bounds of a general linear score in the form of , where is an arbitrary vector of the appropriate dimension. An advantage of our framework is that the complexity of computing the bounds depends only on the number of updated instances, and does not depend on the size of the entire training set. This property is quite advantageous in a typical sensitivity analysis where only a small number of instances are updated.

Bounding a linear score in the form of is useful in a wide range of sensitivity analysis tasks. First, by setting , where is a vector with all 0 except 1 in the position, we can obtain a lower and an upper bounds of each coefficient = 1, . . . , d, where d is the input dimension. Another interesting example is the case where , where x is a test instance of our interest. Note that, if the lower/upper bound of is positive/negative, then we can make sure that the test instance is classified as positive/negative, respectively. It means that the class label of a test instance might be available even if we do not know the exact value of .

To the best of our knowledge, there are no other existing studies on sensitivity analysis that can be used as generally as our framework. However, there are some closely-related methods designed for particular tasks. One such example that has been intensively studied in the literature is leave-one-out cross-validation (LOOCV). In each step of an LOOCV, a single instance is taken out from the original training set, and we check whether the left-out instance is correctly classified or not by using the updated classifier. This task exactly fits into our framework because we are only interested in the class label of the left-out instance, and the optimal updated classifier itself is not actually required. Efficient LOOCV methods have been studied for SVMs and other related learning methods [9, 10, 20, 23]. Some of these existing methods are built on a similar idea as ours in the sense that the class label of a left-out test instance is efficiently determined by computing bounds of the linear score . The bounds obtained by our proposed framework are different from the bounds used in these existing LOOCV methods. We empirically show that LOOCV computation algorithm using our framework is much faster than existing methods.

The bound computation technique we use here is inspired from recent studies on safe feature screening, which was introduced in the context of sparse feature modeling [6]. It allows us to identify sparse features whose coefficients turn out to be zero at the optimal solution. The key idea used there is to bound the Lagrange multipliers before actually solving the optimization problem for model fitting. The idea of bounding the optimal solution without actually solving the optimization problem has been recently extended to various directions [6, 22, 15, 14, 21, 16]. Our main technical contribution in this paper is to bring this idea to sensitivity analysis problems and develop a novel framework for efficiently bounding general linear scores with the cost depending only on the number of updated instances.

The rest of the paper is organized as follows. In 2, we describe the problem setup and present three sensitivity analysis tasks that our framework can be applied to. In 3 we present our main result which enables us to compute a lower and an upper bounds of a general linear score with the computational cost depending only on the number of updated instances. In addition, we apply the framework to the three tasks described in 2. In 4, we discuss how to tighten the bounds when the bounds provided by the framework are not sufficiently tight for making a desired inference. 5 is devoted for numerical experiments. 6 concludes the paper and discuss a few future directions of this work. All the proofs are presented in Appendix A.

2 Preliminaries and basic idea

In this section we first formulate the problem setup and clarify the difference between the proposed framework and conventional incremental learning approaches. Then, we discuss three sensitivity analysis tasks in which the proposed framework is useful.

2.1 Problem setup

In this paper we study binary classification problems. We consider an incremental learning setup, where we have already trained a classifier by using a training set, and then a small number of instances are added to and/or removed from the original training set. The goal of conventional incremental learning problems is to update the classifier by re-training it with the updated training set. Hereafter, we denote the original and the updated training sets as and , respectively, where and are the set of indices of the instances in old and new training sets with the sizes := and := , respectively. The input is assumed to be d-dimensional vector and the class label takes either 1 or +1. We denote the set of added and removed instances as and , where and are the set of indices of the added and removed instances with the sizes := |A| and := |R|, respectively. Note that, if one wants to modify an instance in the training set, one can first remove it and then add the modified one.

We consider a linear classifier in the form of

where the classifier predicts the class label ˆ+1} for the given input , while is a vector of classifier’s coefficients. In this paper we consider a class of problems represented as a minimization of an

regularized convex loss. Specifically, the old and the new classifiers are defined as

and

where the first and the second terms of the objective function represent an empirical loss term and an -regularization term, respectively, and 0 is a regularization parameter that controls the balance between these two terms. We assume that ) is differentiable and convex with respect to the second argument.

Examples of such a loss function includes logistic regression loss

and -hinge loss

For any and any , we denote the gradient of the individual loss as

Our main interest is in the cases where the number of added instances and removed instances are both much smaller than the entire training set size or . In such a case, we expect that the difference between and is small. However, if the training set size is large, solving the optimization problem (2) by using an incremental learning algorithm is still very expensive because any incremental learning algorithms require working through the entire training data matrix at least once, meaning that the complexity of such an incremental learning is at least ).

Our approach is different from conventional incremental learning. In this paper we propose a novel framework that enables us to make inferences about the new solution without actually solving the optimization problem (2). The proposed framework can efficiently compute a lower and an upper bounds

of, what we call, a linear score

where is an arbitrary vector of dimension d. An advantage of this framework is that the computational cost of computing these bounds depends only on the number of updated instances and does not depend on the entire training set size or , i.e., the complexity is ). This property is quite advantageous in a typical sensitivity analysis where is much larger than . These bounds are computed based on the old optimal solution . We denote the lower and the upper bounds as

The proposed framework can be kernelized for nonlinear classification problems if the inner products and ) for any can be represented by the kernel function.

In the following three subsections, we discuss three sensitivity analysis tasks in which the above proposed framework might be useful.

Figure 1: Examples of coefficient bounds ) and [5], for an artificial toy dataset with = 1000 and d = 5. The blue, red and pink error bars indicate the bounds when = 1 (0.1%), 5 (0.5%), and 10 (1%), respectively. The unknown true coefficients [5], are indicated by .

2.2 Sensitivity of coefficients

Let ], be a vector of all 0 except 1 in the element. Then, by setting := in (5), we can

compute a lower and an upper bounds of the new classifier’s coefficient = ] such that

Figure 1 illustrates such coefficient’s bounds for a simple toy dataset. Given a lower and an upper bounds of the coefficients, we can, in principle, obtain the bounds of any quantities depending on . Bounding the largest possible change of the new classifier’s coefficients or a quantity depending on it would be beneficial for making decisions in practical tasks.

2.3 Sensitivity of class labels

Next, let us consider sensitivity analysis of the new class label for a test instance , i.e., we would like

to know

By setting := x in (5), we can compute a lower and an upper bounds such that

Here, it is interesting to note that, using the following simple facts:

Figure 2: Examples of test instance score bounds ) and ) for 10 test instances in the same dataset as in Figure 1. The blue, red and pink error bars indicate the bounds when = 1 (0.1%), 5 (0.5%), and 10 (1%), respectively, and the unknown true scores are indicated by . Note that, except for the 2nd and the 3rd test instances with = 10 (pink), the signs of the lower and the upper bounds are same, meaning that the class labels of these test instances are immediately available without actually updating the classifier.

the class label ˆy can be available without actually obtaining if the bounds are sufficiently tight such that the signs of the lower and the upper bounds are same. If the number of updated instances is relatively smaller than the entire training set size or , we expect that the two solutions and would not be so different. In such cases, as we demonstrate empirically in 5, the bounds in (6) are sufficiently tight in many cases. Figure 2 illustrates the tightness of the bounds in a toy dataset.

2.4 Leave-one-out cross-validation (LOOCV)

One of the traditional problem setups to which our proposed framework can be naturally applied is leave-

one-out cross-validation (LOOCV). The LOOCV error is defined as

as

Here, our idea is to regard the solution obtained by the whole training set as , and as . By

setting := in (5), we can compute the bounds such that

These bounds in (8) can be used to know whether the left out instance is correctly classified or not. If the lower bound is positive, the left-out instance will be correctly classified, while it will be mis-classified if the upper bound is negative.

Using (8), we can also obtain the bounds on the LOOCV error itself:

where ) is the indicator function. In numerical experiments, we illustrate that this approach works quite well.

3 Quick sensitivity analysis

In this section we present our main results on our quick sensitivity analysis framework. The following theorem tells that we can compute a lower and an upper bounds of a general linear score by using the original solution .

The proof is presented in Appendix A.

An advantage of the bounds in (10) is that the computational complexity does not depend on the total number of instances, but only on the number of modified instances. It is easy to confirm that the main computational cost of these bounds is in the computation of in (9), and its complexity is ).

The tightness of the bounds, i.e., the difference between the upper and the lower bounds is written as

In a typical sensitivity analysis where and are much smaller than , the tightness in (11) would be small. Note also that the tightness depends inversely on the regularization parameter . If is very small and close to zero, the bounds become very loose.

3.1 Sensitivity analysis of coefficients

As discussed in 2.2, by substituting := ], into (10), we obtain a lower and an upper bounds of the coefficient of the new classifier.

Note that the third term does not depend on ], i.e., the tightness of the bounds in (12) is common for all the coefficients ].

Given a lower and an upper bounds of the coefficients ], we can obtain the bounds of any quantities depending on . For example, it is straightforward to know how much the classifier’s coefficients can change by the incremental operation when the amount of the change is measured in terms of some norm of .

Some readers might note that a lower and an upper bounds of a general linear score can be

simply obtained by using the bounds of each coefficients ]. Such naive bounds are given as

The tightness of the bounds in (14) is written as

which is clearly much looser than (11). Thus, if the quantity of the interest is written in a linear score form , we should use the bounds in (10) rather than (14).

3.2 Sensitivity analysis of class labels

Next, we use Theorem 1 for sensitivity analysis of new class labels. As discussed in 2.3, for an input vector , we can obtain a lower and an upper bounds of a linear score by setting . From (7), we can know the new class label if the signs of the lower and the upper bounds are same.

Corollary 4 is useful in transductive setups [19] where we are only interested in the class labels of the prespecified set of test inputs.

3.3 Quick leave-one-out cross-validation

In LOOCV, we repeat leaving out a single instance from the training set, and check whether it is correctly classified or not by the new classifier which is trained without the left-out instance. Thus, each step of LOOCV computation can be considered as an incremental operation with = 0 and = 1. Denoting the left-out instance as (], the task is to inquire whether the left-out instance is correctly classified or not, which is known by checking the sign of ) = .

Corollary 5. Consider a single step of LOOCV computation where an instance (], is left

4 Tightening linear score bounds via a suboptimal solution

In the previous section we introduced a framework that can quickly compute a lower and an upper bounds of a linear score of the new classifier. Unfortunately, it is not always the case that these bounds are sufficiently tight for making a desired inference on the new classifier. For example, if the lower and the upper bounds of do not have the same sign for a test input x, we cannot tell which class it would be classified to. In this section we discuss how to deal with such a situation.

The simplest way to handle such a situation is just to use conventional incremental learning algorithms. If we completely solve the optimization problem (2) by an incremental learning algorithm, we can obtain itself. However, if our goal is only to make a particular inference about the new classifier, we do not have to solve the optimization problem (2) completely until convergence. In this section we propose a similar framework for computing a lower and an upper bounds of a linear score by using a suboptimal solution before convergence which would be obtained during the optimization of problem (2).

We denote such a suboptimal solution as ˆ. In order to compute the bounds, we use the gradient

information of the problem (2), which we denote

The complexity of computing the gradient vector from scratch is ). However, if we are using a gradient-based optimization algorithm such as conjugate gradient or quasi-Newton methods, we should have already computed the gradient vector in each iteration of the optimization algorithm. The following theorem provides a lower and an upper bound of a linear score by using the current gradient information. If we already have computed g( ˆ), these bounds can be obtained very cheaply.

The proof is presented in Appendix A.

A nice property of the bounds in (17) is that the tightness

is linear in the norm of the gradient ( ˆ. It means that, as the optimization algorithm for (2) proceeds, the gap between the lower and the upper bounds in (17) decreases, and it converges to zero as

Table 1: Benchmark datasets used in the experiments.

the solution converges to the optimal one. Theorem 6 can be used as a stopping criterion for incremental learning optimization problem (2). For example, in a sensitivity analysis of class labels, one can proceed the optimization process until the signs of the lower and the upper bounds in (17) become same.

5 Numerical experiments

In this section we describe numerical experiments. In 5.1 we illustrate the tightness and the computational efficiency of our bounds in two sensitivity analysis tasks described in 2.2 and 2.3. In 5.2 we apply our framework to LOOCV computation as described in 2.4 and compare its performance with conventional LOOCV computation methods.

Table 1 summarizes the datasets used in the experiments. They are all taken from libsvm dataset repository [4]. For the experiments in 5.1, we used larger datasets D5-D8. For LOOCV experiments in 5.2, we used smaller datasets D1-D4. As examples of the loss function , we used LR loss (3) and SVM loss (4). In 5.1 we only show the results on logistic regression. In 5.2 we compare our results on SVMs with conventional LOOCV methods particularly designed for SVMs. For logistic regression, we only compare our framework with conventional incremental learning algorithm because there is no particular LOOCV computation method for logistic regression. As an incremental learning algorithm, which is used as competitor and as a part of our algorithm for LOOCV computation, we used the approach in [18]. All the computations were conducted by using a single core of an HP workstation Z820 (Xeon(R) CPU E5-2693 (3.50GHz), 64GB MEM).

5.1 Results on two sensitivity analysis tasks

Here we show the results on two sensitivity analysis tasks described in 2.2 and 2.3. We empirically evaluate the tightness of the bounds and the computational costs for larger datasets D5-D8. First, we see how the results change as the number of added and/or removed instances changes among 01%, 0.02%, 0.05%, 0.1%, 0.2%, 0.5%, 1%} of the entire training set size . Next, we see the results when the number of the entire training set size changes among 10%, 20%, . . ., 90%, 99%} of , while the number of added and/or removed instances is fixed to = 0.001.

Algorithm 1 Proposed LOOCV method (op1)

In the first sensitivity analysis task about coefficients (see 2.2 and 3.1), we simply computed the difference between the upper and the lower bounds ) for evaluating the tightness of the bounds. For the second sensitivity analysis task about class labels (see 2.3 and 3.2), we examined the percentage of the test instances for which the signs of the lower and the upper bounds are same. Remember that the class label can be immediately available when the lower and the upper bounds have same sign.

Tables 2 - 7 show the results for the former task. (Figure 3 depicts the results on D8 as an example). These results indicate that the bounds are fairly tight if the is relatively smaller than . The computational costs of our proposed framework (blue thick curves) are negligible compared with the costs of actual incremental learning (red thick curves).

Tables 8 - 13 show the results for the latter task (Figure 4 depicts the results on D8 as an example). The results here indicate that, in most cases, the bounds are sufficiently tight for making the signs of the lower and the upper bounds same. It means that, in most cases, the new class labels after incremental operation are available without actually updating the classifier itself.

The results presented here were obtained with the regularization parameter = 0.01, 0.1, 1. We observed that, for larger , the bounds became tighter. These all experiment results are mean and variance of performing 30 times.

5.2 Leave-one-out cross-validation

We applied the proposed framework to LOOCV task, and compare its computational efficiency with existing approaches. We consider two options. In the first option (op1), we only used the method described in 3.3. In the second option (op2), we also used the method described in 4. Algorithm 1 is the pseudo-code for computing LOOCV errors by using the proposed framework with op1. Briefly speaking, for each of the left-out instance (), we first compute the lower and the upper bounds of ). Then, if the lower/upper bound is greater/smaller than zero, we confirm that the instance is correctly/incorrectly classified. When the signs of the two bounds are different, the class label is unknown. In such a case, we use conventional incremental learning algorithm in [18]. In op1, we ran the incremental learning algorithm until its convergence and obtain itself. In op2, we stopped the optimization process at which the signs of the bounds in (17) became same.

For SVMs, several LOOCV methods have been studied in the literature [19, 10]. For the experiments with SVM loss, we thus compare our approach with the methods in [19] and [10]. The former approach merely exploits the fact that adding and/or removing non-support vectors does not change the classifier. The method called estimator [10] also provides a lower bound of ) without actually obtaining . For the experiments with logistic regression loss, we compare our approaches only with incremental learning approach [18] because there are no other competing methods.

We used the above LOOCV computation methods in model selection tasks for linear and nonlinear classification problems. In linear case, the task is to find the regularization parameter that minimizes the LOOCV error. In nonlinear case, we used Gaussian RBF in the form of ) = exp(), where [100] were randomly selected from []. Here, the task is to select the optimal combination of (that minimizes the LOOCV error.

For further speed-up, we also conducted experiments with two simple tricks. In the first trick we used the lower and the upper bounds of the LOOCV error itself. If the lower bound of one model is greater than the upper bound of another model, the former model would never be selected as the best model, meaning that the LOOCV error computation process can be stopped. The second simple trick is to conduct incremental learning operations in the increasing order of ). It is based on a simple observation that the class label of an instance whose ) value is small tends to be mis-classified. Note that these two tricks can be used not only for our proposed framework, but also for other competing approaches.

Tables 14 and 15 show the results without and with the tricks, respectively. We see that the computational cost of our proposed framework (especially op2) are much smaller than competing methods. It indicates that our bounds in (15) is tighter in many cases than the existing bounds for LOOCV error computation.

6 Conclusions and future works

In this paper we introduced a novel framework for sensitivity analysis of large scale classification problems. The proposed framework provides a lower and an upper bounds of a general linear score on the updated classifier without actually re-optimized it. The advantage of the proposed framework is that the computational cost only depends on the sizes of the modified instances, which is particularly advantageous in typical sensitivity analysis task where only relatively small number of instances are updated. We discussed three tasks to which the proposed framework can be applied. As a future work, we plan to apply the proposed framework to stream learning.

7 Acknowledgement

For this work, IT was partially supported by JST CREST, MEXT Kakenhi 26106513, and MEXT Kakenhi 26106513.

A Proofs

In this section we prove Theorems 1 and 6. First we present the following proposition.

See, for example, Proposition 2.1.2 in [1] or Section 4.2.3 in [2] for the proof of Proposition 7.

Proof of Theorem 1. From Proposition 7 and the optimality of for the problem (2) 1

From the convexity of ,

By summing up (22) for all ,

By completing the square of (24), we have

Let

Then, (29) is compactly written as

Eq. (30) indicates that the new optimal solution is within a ball with center m and radius r. Thus, we have a lower and an upper bounds of a linear score as follows:

In fact, the solutions of (31) and (32) can be analytically obtained, and thus the lower bound ) and the upper bound ) can be explicitly obtained by using Lagrange multiplier method. Using a

Lagrange multiplier 0, the problem (31) is rewritten as

where is strictly positive because the constraint is strictly active at the optimal solution.

By letting = 0, the optimal is written as

Substituting into ),

Therefore,

The upper bound of in (32) can be similarly obtained as

By substituting m and r in (29) into (33) and (34), we have (10a) and (10b).

Theorem 6 can be shown in a similar way as above.

Proof of Theorem 6. From Proposition 7 and the optimality of ,

From the convexity of ,

Using (36) and (37) and the definition of g( ˆ), the inequality (35) can be rewritten as the following condition on the new optimal solution :

where

Then, the lower and the upper bounds in (17) are obtained by solving the following minimization and

maximization problems

Using the standard Lagrange multiplier method, the solutions of (39) and (40) are analytically obtained as (17a) and (17b), respectively.

References

[1] P D. Bertsekas. Nonlinear Programming. Athena Scientific, 1999.

[2] S. Boyd and L. Vandenberghe. Convex optimization. Cambridge University Press, 2004.

[3] G. Cauwenberghs and T. Poggio. Incremental and decremental support vector machine learning. In Advances in Neural Information Processing Systems, 2001.

[4] C. Chang and C. Lin. LIBSVM : A Library for Support Vector Machines. ACM Transactions on Intelligent Systems and Technology, 2:1–39, 2011.

[5] D. DeCoste and K. Wagstaff. Alpha seeding for support vector machines. In ACM SIGKDD Conference on Knowledge Discovery and Data Mining, 2000.

[6] L. El Ghaoui, V. Viallon, and T. Rabbani. Safe feature elimination in sparse supervised learning. Pacific Journal of Optimization, 2012.

[7] S. Fine and K. Scheinberg. Incremental learning and selective sampling via parametric optimization framework for SVM. In Advances in Neural Information Processing Systems, 2001.

[8] T. Hastie, R. Tibshirani, and J. Friedman. The elements of statistical learning. Springer, 2001.

[9] T. Jaakkola and D. Haussler. Probabilistic kernel regression models. In International Conference on Artificial Intelligence and Statistics, 1999.

[10] T. Joachims. Estimating the generalization performance of a SVM efficiently. In International Conference on Machine Learning, 2000.

[11] M. Karasuyama and I. Takeuchi. Multiple incremental decremental learning of support vector machine. In Advances in Neural Information Processing Systems, 2009.

[12] P. Laskov, C. Gehl, S. Kr¨uger, and K. R. M¨uller. Incremental support vector learning: analysis, implementation and applications. Journal of Machine Learning Research, 7:1909–1936, 2006.

[13] Z. Liang and Y. Li. Incremental support vector machine learning in the primal and applications. Neurocomputing, 72:2249–2258, 2009.

[14] J. Liu, Z. Zhao, J. Wang, and J. Ye. Safe Screening with Variational Inequalities and Its Application to Lasso. In International Conference on Machine Learning, volume 32, 2014.

[15] K. Ogawa, Y. Suzuki, and I. Takeuchi. Safe screening of non-support vectors in pathwise SVM compu- tation. In International Conference on Machine Learning, 2013.

[16] A. Shibagaki, Y. Suzuki, and I. Takeuchi. Approximately optimal selection of regularization parameters in cross-validation for regularized classifiers. arXiv, 2015.

[17] A. Shilton, M. Palaniswami, D. Ralph, and A. C. Tsoi. Incremental training of support vector machines. IEEE Transactions on Neural Networks, 16:114–131, 2005.

[18] C. H. Tsai, C. Y. Lin, and C. J. Lin. Incremental and decremental training for linear classification. In ACM SIGKDD Conference on Knowledge Discovery and Data Mining, 2014.

[19] V. Vapnik. The Nature of Statistical Learning Theory. Springer, 1996.

[20] V. Vapnik and O. Chapelle. Bounds on Error Expectation for Support Vector Machines. Neural Computation, 12:2013–2036, 2000.

[21] J. Wang, J. Zhou, J. Liu, P. Wonka, and J. Ye. A Safe Screening Rule for Sparse Logistic Regression. In Advances in Neural Information Processing Sysrtems, 2014.

[22] Z. Xiang, H. Xu, and P. Ramadge. Learning sparse representations of high dimensional data on large scale dictionaries. In Advances in Neural Information Processing Sysrtems, 2011.

[23] T. Zhang. Leave-one-out bounds for kernel methods. Neural computation, 15:1397–1437, 2003.

Figure 3: Results on sensitivity analysis of coefficients for D8. The tightness of the bounds and the compu- tation time in seconds are plotted for = 0.01 (top), 0.1 (middle), and 1 (bottom).

Figure 4: Results on sensitivity analysis of class labels for D8. The fraction of the test instances whose lower and upper bounds of the decision score have same signs, and the computation time in seconds are plotted for = 0.01 (top), 0.1 (middle), and 1 (bottom).

Table 2: Results on sensitivity analysis of coefficients for various values of (. The tightness of the bounds and the computation time in seconds are listed (= 0.01).

Table 3: Results on sensitivity analysis of coefficients for various values of (. The tightness of the bounds and the computation time in seconds are listed (= 0.1).

Table 4: Results on sensitivity analysis of coefficients for various values of (. The tightness of the bounds and the computation time in seconds are listed (= 1).

Table 5: Results on sensitivity analysis of coefficients for various values of . The tightness of the bounds and the computation time in seconds are listed (= 0.01).

Table 6: Results on sensitivity analysis of coefficients for various values of . The tightness of the bounds and the computation time in seconds are listed (= 0.1).

Table 7: Results on sensitivity analysis of coefficients for various values of . The tightness of the bounds and the computation time in seconds are listed (= 1).

Table 8: Results on sensitivity analysis on class labels for various values of (. The fraction of the test instances whose lower and upper bounds of the decision score have same signs, and the computation time in seconds are listed (= 0.01).

Table 9: Results on sensitivity analysis on class labels for various values of (. The fraction of the test instances whose lower and upper bounds of the decision score have same signs, and the computation time in seconds are listed (= 0.1).

Table 10: Results on sensitivity analysis on class labels for various values of (. The fraction of the test instances whose lower and upper bounds of the decision score have same signs, and the computation time in seconds are listed (= 1).

Table 11: Results on sensitivity analysis on class labels for various values of . The fraction of the test instances whose lower and upper bounds of the decision score have same signs, and the computation time in seconds are listed (= 0.01).

Table 12: Results on sensitivity analysis on class labels for various values of . The fraction of the test instances whose lower and upper bounds of the decision score have same signs, and the computation time in seconds are listed (= 0.1).

Table 13: Results on sensitivity analysis on class labels for various values of . The fraction of the test instances whose lower and upper bounds of the decision score have same signs, and the computation time in seconds are listed (= 1).

Table 14: Computation time [sec] of model selection based on LOOCV (without tricks).

Table 15: Computation time [sec] of model selection based on LOOCV (with tricks).

Designed for Accessibility and to further Open Science