F-SVM: Combination of Feature Transformation and SVM Learning via Convex Relaxation

2015·Arxiv

Abstract

Abstract

The generalization error bound of support vector machine (SVM) depends on the ratio of radius and margin, while standard SVM only considers the maximization of the margin but ignores the minimization of the radius. Several approaches have been proposed to integrate radius and margin for joint learning of feature transformation and SVM classifier. However, most of them either require the form of the transformation matrix to be diagonal, or are non-convex and computationally expensive.

In this paper, we suggest a novel approximation for the radius of minimum enclosing ball (MEB) in feature space, and then propose a convex radius-margin based SVM model for joint learning of feature transformation and SVM classifier, i.e., FSVM. An alternating minimization method is adopted to solve the F-SVM model, where the feature transformation is updated via gradient descent and the classifier is updated by employing the existing SVM solver. By incorporating with kernel principal component analysis, F-SVM is further extended for joint learning of nonlinear transformation and classifier. Experimental results on the UCI machine learning datasets and the LFW face datasets show that F-SVM outperforms the standard SVM and the existing radius-margin based SVMs, e.g., RMM, R-SVMR-SVM

Index Terms—Support vector machine, radius margin bound, convex relaxation, max-margin.

I. INTRODUCTION

SUPPORT vector machine (SVM) and its extensions havebeen one of the most successful machine learning methods [1], [2], and have been adopted in various fields, e.g., computer vision [3], [4], [5], [6], signal processing [7], [8], natural language processing [9], [10] and bioinformatics [11], [12], [13], [14]. Despite its popularity, SVM aims to seek the optimal hyperplane with the maximum margin principle, but the generalization error of SVM actually is a function of the ratio of radius and margin [15]. Given feature space, the radius is fixed and can be ignored, thus SVM can minimize the generalization error by maximizing the margin. However, for joint learning of feature transformation and classifier, the radius information will be valuable and cannot be ignored.

By minimizing the radius-margin ratio, the generalization error of SVM can be optimized for joint learning of feature transformation and classifier. Since the radius-margin error bound is non-convex, relaxation and approximation of radius is generally adopted in the existing models [16], [17]. Several approaches have been proposed from the perspective of radius-margin error [11], [16], [17], [18], but most ones suffer from the limitations of computational burden and simplified forms

X. Wu, W. Zuo and Y. Zhu are with School of Computer Science and Technology, Harbin Institute of Technology, Harbin, China.

L. Lin is with School of Advanced Computing, Sun Yat-sen University, Guangzhou, China.

of transformation. RMM [16] only considers the spread of the data along the direction perpendicular to the classification hyperplane. Radius-margin based SVMs, e.g., MR-SVM [11], R-SVM[18] and RSVM[18], are based on the constraint that the linear transformation matrix should be diagonal.

Another strategy for joint feature transformation and clas-sifier learning is to incorporate metric learning with SVM, where metric learning can be adopted to learn a better linear transformation matrix [19], [17], [20], [21], [22], [23], [24], [25], [26]. One simple approach to combine metric learning and SVM is to directly deploy the transformation obtained using metric learning into SVM. This approach, however, usually cannot lead to satisfactory performance improvement [27]. Therefore, other approaches have been proposed to integrate metric learning to SVM, e.g., support vector metric learning (SVML) [27] and metric learning with SVM (MSVM) [17]. But SVML [27] was designed for RBF-SVM and ignored the radius information, while MSVM [17] is non-convex.

In this paper, we propose a novel radius-margin based SVM model for joint learning of feature transformation and SVM classifier, i.e., F-SVM. Compared with the existing radius-margin based SVM methods, we derive novel lower and upper bounds for the relaxation of the radius-margin ratio. Unlike MR-SVM [11], R-SVM[18] and RSVM[18] which are suggested for joint feature weighting and SVM learning, FSVM can simultaneously learn feature transformation L and classifier (w, b). Compared with the existing metric learning for SVM methods, our F-SVM model considers both the radius and the margin information, and is convex. Then, an alternating minimization algorithm is proposed to solve our FSVM model, which iterates by updating feature transformation and classifier alternatively. Note that kernel SVM is equivalent to perform linear SVM in the kernel PCA space. We further suggest to conduct linear FSVM in the kernel PCA space for joint learning of nonlinear transformation and classifier. The contribution of this paper is of three-fold:

• A novel convex formulation of radius-margin based SVM model, i.e., F-SVM, is proposed. Unlike MR-SVM [11], R-SVM[18] and RSVM[18], our F-SVM is capable of joint learning feature transformation and classifier, and is robust against outliers. Experimental results show that F-SVM outperforms SVM and the existing radius-margin based SVMs.

• In F-SVM, we derive the lower and upper bounds for the radius of minimum enclosing ball (MEB) in feature space, resulting in a novel approximation of the radius. Compared with the approximations proposed in [18], ours is much simple and can be easily adopted in developing

• An alternating minimization algorithm is developed for solving F-SVM via iterating between gradient descent and SVM learning. Therefore, the off-the-shelf SVM solvers can be employed to improve the computational efficiency. Moreover, a semi-whitened PCA method is developed for the initialization of .

The remainder of the paper is organized as follows: Section 2 reviews the related work on the radius-margin ratio based bounds and their applications. Section 3 describes the model and algorithm of the proposed F-SVM method. Section 4 extends F-SVM to the kernelized version for nonlinear clas-sification. Section 5 provides the experimental results on the UCI machine learning datasets and the LFW dataset. Finally, we conclude the paper in Section 6.

II. RELATED WORK

The radius-margin bound not only provides theoretical explanation on the generalization performance of SVM [1], but also has been extensively adopted for improving kernel classification methods, e.g., model selection [28], [29], multiple kernel learning (MKL) [30], [31], [32], [33], and mapping of nominal attributes [34]. Denote a training set by and a feature space by . In [30], [31], the radius R of minimum enclosing ball (MEB) in feature space is computed as:

Assuming that the training set is separable, given the optimal hyperplane (w, b), Vapnik [1] suggested a radius-margin error bound which showed that the expectation of the misclassifica-tion probability depends on .

The standard SVM is known as a max-margin method which only considers the margin in the algorithm. When the feature space is fixed, the radius is a constant and can thus be ignored. But in many classification tasks, the model parameters [28], combination of basis kernels [31], feature reweighting or transformation [12] usually should be learned or tuned based on the training data, where integration of radius has been demonstrated to be very effective in improving the classification performance. In model selection, radius-margin bound has been applied for choosing tradeoff parameter and scaling factors of SVM and -SVM [28]. In multiple kernel learning (MKL) [30], [31], [32] and feature reweighting [12], several variants of radius had been developed.

This paper aims to jointly learn SVM together with feature transformation by minimizing the radius-margin ratio, i.e., radius-margin based SVM, and more detailed review is given on this topic. Except [17], most existing approaches [11], [18], [21] require the transformation matrix to be diagonal, i.e., feature reweighting and selection. Direct use of radius-margin ratio in SVM results in a non-convex optimization problem, which makes the learning algorithm computationally expensive and unstable. By restricting the feature transformation to be diagonal with , Do et al. [11] suggested that the radius is bounded with , where is the radius on dimension k. By approximating with its upper bound , MR-SVM in [11] solved the following convex relaxation problem:

Denote by the half value of the maximum pairwise distances. Do et al. in [18] introduced a tighter bound of the radius and proposed another convex model R-SVM:

Furthermore, R-SVMwas developed in [18] by controlling both the radius and margin with w.

Rather than feature reweighting and selection, Zhu et al. [17] proposed a metric learning with SVM (MSVM) method for joint learning of the linear transformation and SVM classi-fier. In [17], given the transformation matrix A, an alternative of the radius R was adopted, where is the mean of the training samples. Although Zhu et al. [17] claimed that , as demonstrated in Theorem 1 of this work, is an upper bound of R. The MSVM model in [17] was formulated as:

Note that MSVM is non-convex and solved using gradient projection.

In this paper, we propose a novel relaxed convex model of radius-margin based SVM, i.e., F-SVM, for joint learning of feature transformation and SVM classifier. Compared with the existing radius-margin based SVM methods, F-SVM has some distinguishing features. Our F-SVM model is convex, while MSVM [17] is non-convex. Unlike RMM [16], the transformation in F-SVM is learned to minimize the radius of the enclosing ball of all samples rather than to only shrink the sample span along the direction perpendicular to the hyperplane. Moreover, F-SVM is also different with MR-SVM [11], R-SVM[18] and R-SVM[18] from three aspects: (i) Instead of feature reweighting and selection, F-SVM can learn feature transformation and classifier simultaneously; (ii) F-SVM adopts a new approximation for the radius of MEB in feature space; (iii) In F-SVM, individual inequality constraints are combined into one holistic inequality constraint to improve the robustness against outliers. All these make F-SVM very promising for joint learning of feature transformation and SVM classifier, and the results also validate the effectiveness of F-SVM.

III. RADIUS-MARGIN BASED SUPPORT VECTOR MACHINE

A. Problem Formulation

Denote by a training set, where and denote the ith training sample and the corresponding class label, respectively. By introducing the slack variables , SVM aims to find the optimal separating hyperplane by solving the following optimization problem:

where (u, b) are the parameters to describe the learned hyperplane denotes the ith slack variable, and C stands for the tradeoff parameter. The objective function in Eq. (5) aims to maximize the margin while minimizing the empirical risk . For joint learning, we introduce a linear transformation matrix A and integrate the radius information, resulting in the following radius-margin based SVM model:

where the radius R is defined as:

Note that depends on matrix A and the problem in Eq. (6) is non-convex [18]. Denote by the center of all instances, and by the largest squared distance from the center in transformed feature space. Let and . We prove that the radius R is bounded by .

Please refer to Appendix A for the proof of Theorem 1. In [17], Zhu et al. claimed that . From Theorem 1, is only an approximation of R, and counter examples can be easily found to illustrate . Let and . Since the radius R is upper bounded by , we can approximate R with . With simple algebra, the radius-margin SVM model in Eq. (6) is relaxed into the following formulation:

Theorem 2. The problem in Eq. (9) is equivalent with the following problem:

Proof. Denote by the optimal solution to the problem in Eq. (9). Let and . It is obvious to see that is also the optimal solution to the problem in Eq. (9) because .

Next we will show that is the optimal solution to the problem in Eq. (10). If is not the optimal solution to Eq. (10), there must exist some that satisfies all inequality constraints and . Then we can define 1 and have , which is contradictory with the assumption that is the optimal solution to Eq. (9). Thus, we can solve the problem in Eq. (10) with the optimal solution , and then obtain the optimal solution to Eq. (9).

Without loss of generality, we assume and seek the corresponding optimal w and M by solving Eq. (10). Moreover, to make the model robust against outliers and noisy samples, we combine the individual inequality constraints into one integrated inequality constraint [32], resulting in the following radius-margin based SVM model:

By defining the scattering matrix of the training set S = , based on the Lagrangian multiplier method [35], the problem expressed in Eq. (11) can be equivalently reformulated as the following F-SVM model:

where is the regularization parameter determined by . In the following, we prove that our F-SVM model is convex.

Theorem 3. The F-SVM model is a convex optimization problem. The proof can be found in Appendix B.

Fig. 1. Illustration of the alternating minimization algorithm for F-SVM. First, a semi-whitened based initialization scheme is proposed to initialize M. Then, the alternating minimization algorithm is adopted by updating (w, b) and M alternatively. When the algorithm converges, as shown in Fig. 1, we can learn both a better matrix M to reduce the radius of MEB in feature space and a max-margin classifier (w, b).

B. Alternating minimization

In this section, we propose an efficient alternating minimization algorithm to solve the proposed F-SVM model, as illustrated in Fig. 1. First, a semi-whitened PCA based initialization scheme is proposed to initialize M. Then, the alternating minimization algorithm is adopted by updating (w, b) and M alternatively. Fixing M, the model can be reformulated as the SVM in the transformed feature space, and be solved using the off-the-shelf SVM solvers to update (w, b). Fixing (w, b), the gradient descent method [35] is adopted to update the matrix M. When the algorithm converges, as shown in Fig. 1, we can learn both a better matrix M to reduce the radius of MEB in feature space and a max-margin classifier (w, b). The alternating optimization procedure is summarized in Algorithm 1.

Algorithm 1 The alternating minimization algorithm for FSVM

1) Initialization of M: Because the proposed F-SVM model is convex, alternating minimization can converge to global optimum for any initialization of M and (w, b), but proper initialization is helpful in improving the computational effi-ciency. Thus, by further relaxing the F-SVM model in Eq. (12), we propose a semi-whitened PCA based initialization method on M.

Note that is upper bounded by [36]:

where and denote the L-norm and the nuclear norm of a positive semi-definite matrix A, respectively. Based on Eq. (12) and Eq. (13), by setting , the subproblem on M can be rewritten as the problem on B formulated

as:

where . The eigenvalue decomposition of S is , where and the ith column of U denote the ith eigenvalue and eigenvector, rsespectively. With U and , we define as:

Theorem 4. Given a SPD matrix S and , defined in Eq. (15) is the optimal solution to the problem:

The proof can be found in Appendix C. With , the initialization of M in Eq. (12) is then defined as:

Noted that we assume that is known for the initialization of M. From Eq. (17), only affects the scale factorto the linear transformation. Thus, we simply let in our implementation.

It is interesting to point out that in Eq. (17) implies a semi-whitening PCA transformation because , where the linear transformation can then be defined as . In literature [37], was called the PCA whitening transformation, and data whitening has been widely exploited in many applications, e.g., face recognition, object detection, and image classification [38], [39], [40], [41]. Considering its connection with data whitening, semi-whitening is also expected to be effective in other tasks and applications.

2) The subproblem on (w, b): Given M, the F-SVM model can be formulated as:

where . The eigenvalue decomposition of B is B = . By introducing , the transformation matrix L can be rewritten as to . Let and v = Lw. With simple algebra, the problem in Eq. (18) can be reformulated as:

which can be solved using the off-the-shelf SVM solvers. Given the solution can then be obtained.

3) The subproblem on M: Given (w, b), the sub-problem on M can be reformulated as:

Since the objective function in Eq. (20) is convex and differentiable with respect to M, the gradient projection method [35] is adopted to update M. According to [42], the gradient of f(M) can be obtained by:

As presented in Algorithm 1, we use gradient projection

to update M by choosing proper stepsize t and gradually decreasing it along with iterations, where projects a matrix onto the cone of positive semidefinite matrices.

C. Discussion

The proposed F-SVM method has several interesting advantages while compared with the other radius-margin based SVMs, e.g., RMM [16], MR-SVM [11], R-SVM[18], RSVM[18] and M-SVM [17]. RMM [16] is suggested to max- imize the margin while restricting the spread of the data along the direction perpendicular to the separating hyperplane, while our F-SVM is proposed to minimize the convex relaxation of the radius-margin ratio. The generalization error is bounded by the radius and margin ratio, and the radius is determined by the spread along all possible directions rather than only the direction perpendicular to the separating hyperplane, making F-SVM theoretically more promising.

MR-SVM [11], R-SVM[18] and R-SVM[18] aim to learn the diagonal feature transformation with , while F-SVM is developed for joint learning of feature transformation and SVM classifier. Both R-SVMand R-SVMneed to solve a Quadratically Constrained Quadratic Programming (QCQP) optimization problem, which is computationally expensive than the alternating minimization method used in our F-SVM. Moreover, R-SVMand RSVMadopted a tighter approximation of the radius. In F- SVM, a new approximation of the radius is proposed, which is also tighter than that used in MR-SVM [11]. Moreover, the individual inequality constraints on are combined to improve the robustness against outliers. It is interesting to note that we have:

where . Eq. (23) indi- cates that, if all the inequality constraints on , i.e., , are combined into one integrated inequality constraint:

Let and . One can see that the integrated inequality constraint will be equivalent with that adopted in Eq. (11).

MSVM [17] was developed for simultaneous learning of the linear transformation and SVM classifier, but the MSVM model is non-convex and solved using gradient projection. Moreover, although Zhu et al. [17] claimed that , as discussed in Section 3.1, is only a lower bound of R and counter examples can be easily found to illustrate . Compared with MSVM [17], the F-SVM model is convex and robust against noise and outliers, and can be efficient solved using the optimization method introduced in Section 3.2.

IV. KERNELIZATION OF F-SVM

With the incorporation of kernel principal component analysis, linear F-SVM can be extended to kernel version for nonlinear classification. First, we show that kernel SVM is equivalent to perform linear SVM in the kernel PCA space. Then, kernel F-SVM is introduced by conducting linear FSVM in the kernel PCA space.

Let the kernel function be , where defines an implicit mapping from the data space to high or infinite dimensional feature space. For the training set , we use to denote all the PCA eigenvectors corresponding to positive eigenvalues. Let be a set of basis vectors in the complementary space of W. Assuming the training set is centered, for any , we have , and thus can get:

TABLE I SUMMARY OF THE UCI DATASETS USED IN THE EXPERIMENTS.

Let . The dual problem of SVM in the kernel PCA space can be formulated as:

where . Therefore, kernel SVM is equivalent to performing linear SVM in the kernel PCA space. To extend F-SVM to its kernelized version, we first project each training sample to the kernel PCA space , and then solve the following F-SVM model:

where . Algorithm 1 can be adopted to solve the model in Eq. (27). In our implementation, instead of using all the eigenvectors, we also consider to employ the PCA eigenvectors corresponding to first d largest eigenvalues, i.e., , and Section 5.2 reports the empirical result on the influence of d on classification accuracy.

V. EXPERIMENTS

In this section, we use both the UCI machine learning datasets and the Labeled Faces in the Wild (LFW) database to evaluate the proposed F-SVM method, and compare our FSVM with the competing methods, including SVM and several representative radius-margin based SVM methods, i.e., RMM [16], R-SVM[18] and R-SVM[18]. MR-SVM [11] and MSVM [17] are not considered in our experiments because their source codes are not publicly available. The 10-fold cross validation (CV) is adopted to determine the optimal values of hyper-parameters for each method. The mean classification accuracy is adopted by averaging the 100 runs of the 10-fold CV. The methods are evaluated by two performance indicators: accuracy and training time (seconds, s).

A. Evaluation on linear F-SVM using the UCI datasets

We evaluate the performance of linear F-SVM on the 11 datasets from the UCI machine learning repository, where the

TABLE II COMPARISON OF THE AVERAGE CLASSIFICATION ACCURACY (%) OF LINEAR SVM, LINEAR RMM [16], LINEAR R-SVMR-SVMAND LINEAR F-SVM.

TABLE III COMPARISON OF THE TRAINING TIME (RMM [16], LINEAR R-SVMAND LINEAR F-SVM.

reason to choose them is that they had been widely adopted for evaluating SVM and kernel methods [43], [44], [45]. Table I provides a brief summary of these UCI datasets, which includes 6 2-class problems and 5 multi-class problems. Tables II and III list the mean classification accuracy and training time of five linear classifiers, i.e., linear SVM, linear RMM [16], linear R-SVM[18], linear R-SVM[18], and linear F- SVM. RMM [16], R-SVM[18], R-SVM[18], and F-SVM consider both margin and radius information, while SVM only considers margin. As shown in Table II, the radius-margin based SVM methods generally outperform SVM in terms of classification accuracy, which indicates that the incorporation of radius can improve the classification performance. As listed in Table III, the training time of SVM is much less than the other four methods, indicating that the introduction of radius makes the model more complex to train.

We further compare linear F-SVM with the competing methods. From Table II, F-SVM achieves higher classification accuracy than SVM on 9, RMM [16] on 7, R-SVM[18] on 7, and R-SVM[18] on 8 of the 11 datasets. The better classification accuracy of our F-SVM should be attributed to that: (i) compared with SVM, F-SVM incorporates radius in the convex model; (ii) unlike RMM [16], our F-SVM considers the spread along all directions rather than only the direction perpendicular to the separating hyperplane. (iii) instead of feature reweighting and selection in R-SVM[18] and R-SVM[18], general linear transformation is learned in F-SVM. To improve the efficiency of F-SVM in training, we adopt the warm-start strategy, where the solution (w, b) of the previous iteration is used as the initialization of the next iteration. From Table III, one can see that our F-SVM is only a little slower than SVM, but is much more efficient than the other competing methods in training. F-SVM is about times faster than RMM [16], R-SVMand R-SVMin [18]. In summary, F-SVM obtains the best classification accuracy among all competing methods, and is more efficient in training than the other radius-margin based SVM methods.

TABLE IV COMPARISON OF THE AVERAGE CLASSIFICATION ACCURACY (%) OF KERNEL SVM, KERNEL RMM [16], KERNEL R-SVMR-SVMAND KERNEL F-SVM.

B. Evaluation on kernel F-SVM using the UCI datasets

In this subsection, we evaluate the performance of kernel F-SVM on the 11 UCI datasets. The Gaussian RBF kernel is adopted in our experiments, which includes an extra kernel parameter . As discussed in Section 4, we also consider the number of kernel PCA components in kernel F-SVM. Using four datasets, i.e., Breast cancer, Thyroid, Heart, and German, Fig. 2 illustrates the classification accuracy of kernel SVM and kernel F-SVM under different kernel PCA dimensions. It is interesting to note that, the proper decreasing of kernel PCA dimension can consistently improve the classification accuracy both for kernel SVM and kernel F-SVM. Also from Fig. 2 one can see that the kernel F-SVM is superior to kernel SVM under different dimensions. One possible explanation may be that the decreasing of kernel PCA dimension would make the learned transformation more stable.

Tables IV and V list the mean classification accuracy and training time of five linear classifiers, i.e., kernel SVM, kernel RMM [16], kernel R-SVM[18], kernel R-SVM[18], and kernel F-SVM. For kernel methods, the superiority of F-SVM against the competing methods is more significant. The Kernel F-SVM outperforms kernel SVM on 10, kernel RMM [16] on 9, kernel R-SVM[18] on 11, and kernel R-SVM[18] on 11 of all the 11 datasets in terms of classification accuracy. By training time, the kernel F-SVM is a little slower than SVM, but is about times faster than the other competing methods.

C. Results on the LFW Database

In this subsection, the LFW database is used to evaluate FSVM for face verification. The database consists of more than 13,233 face images from 5,749 persons. The face images in the LFW database were collected from the Internet, and vary

TABLE V COMPARISON OF THE TRAINING TIME (RMM [16], KERNEL R-SVMKERNEL F-SVM.

Fig. 2. Classification accuracy (%) of kernel SVM and kernel F-SVM under different kernel PCA dimensions.

in pose, illumination, expression, and age, making LFW very suitable for studying unconstrained face verification. The face recognition method can be evaluated with two test protocols for LFW: the restricted and the unrestricted settings. Under the restricted setting, the only available information is whether each pair of training images is matched or not, and the performance of the face verification method is evaluated by 10-fold cross validation on a set of 3000 positive and 3000 negative image pairs.

In our experiment, we adopt the restricted setting with the face images aligned by the funneling method [46]. Fig. 3 shows some examples of similar and dissimilar pairs. We extract two kinds of features for each face image: SIFT feature and attribute feature, and compare F-SVM with SVM, RMM [16], R-SVM[18], and R-SVM[18], and several represen- tative face verification methods, including LDML [24], Nowak [47], V1-like/MKL [48], and MERL+Nowak [49].

Fig. 4 shows the verification accuracies of SVM and F-SVM under different PCA dimensions by using the attribute feature and the combined features of SIFT and attributes, respectively. F-SVM using the combined features of SIFT and attributes (F-SVM-combined) achieves its best performance of 83.25% when the dimension d = 300, and 82.58% when the dimension d = 73 using the attribute features (F-SVM-attribute). SVM using the combined features of SIFT and attributes (SVMcombined) achieves its best performance of 81.90% when the

Fig. 3. Examples of face image pairs in the LFW database.

dimension d = 400, and 80.12% when the dimension d = 73 using the attribute features (SVM-attribute). Thus, F-SVM can get better accuracy than SVM on the LFW database.

We further compare F-SVM with several other face verifi-cation methods. Table 6 lists the accuracy of F-SVM, SVM, RMM [16], R-SVM[18], R-SVM[18], LDML [24],Nowak [47], V1-like/MKL [48], and MERL+Nowak [49]. We report the accuracy of F-SVM, SVM, RMM [16], R-SVM[18], RSVM[18], LDML [24] using the attribute and the SIFT fea- tures, report the accuracy of Nowak [47] and MERL+Nowak [49] using SIFT and geometry feature, and report the accuracy of V1-like/MKL [48] using the V1-like features. For either the combined features of SIFT and attributes or the attribute features, F-SVM achieves higher accuracy than SVM, RMM [16],R-SVM[18], R-SVM[18], LDML [24] separately from Table VI. Fig. 5 shows the ROC curves of the competing methods. Also one can see that F-SVM-combined gets better performance than other face verification methods.

VI. CONCLUSION

In this paper, we proposed a convex radius-margin based SVM model (F-SVM) for joint learning of feature transformation and SVM classifier. For the formulation of F-SVM, lower and upper bounds of the radius of MEB are introduced to derive a novel approximation of radius-margin ratio, and all the individual inequality constraints are combined into one integrated inequality constraint, resulting in a convex relaxation of the radius-margin based SVM model. For model optimization, a semi-whitened PCA based method is proposed for the initialization of the learned transformation, and an alternating minimization algorithm is adopted to learn the

Fig. 4. Examples of face image pairs in the LFW database.

Fig. 5. ROC curves of different face verification methods on the restricted LFW-funneled database.

TABLE VI COMPARISON OF ACCURACY OBTAINED USING DIFFERENT FACE VERIFICATION METHODS. THE TOP TWO RESULTS ARE SHOWN IN RED AND BLUE FONTS, RESPECTIVELY.

feature transformation and SVM classifier. Further, F-SVM is kernelized by using kernel PCA. Our experimental results show that, F-SVM obtains higher classification accuracy than SVM and the state-of-the-art radius-margin based SVM methods, and is more efficient in training than the other radius-margin based SVM methods, e.g., RMM [16], R-SVM[18] and R-SVM[18]. In our future work, we will extend the proposed relaxed radius-margin based error bound to other classification methods and extend the proposed model for learning other forms of feature transformation tailored for specific applications.

APPENDIX A

Proof. Based on the definition of the radius, we get

Denote by the maximum pairwise distance. We have

Proof. Let . We have . Based on the definition of

we will prove that there exist some which makes . Based on the definition of , we have . Then, we derive

Since and , one can easily see that

Based on the definition of :

Combining the two inequalities above, we can prove .

Finally, by combining Lemmas 13, we obtain the following theorem: Theorem A.1. The margin R is bounded by by:

APPENDIX B

Lemma B.1 [42]. Given two symmetric positive definite (SPD) matrices A and B, we have

Theorem B.2. The problem Eq. (12) is a convex optimization problem.

Proof. Note that all the constraints define a convex set, and and tr (MS) are linear to and M, respectively. Then the key step is to prove that the function is convex for , i.e., for any ,

Note that contains three terms:

First, we have

and then we get

Analogously, we get

With Eq. (30), we have

Combining Eqns. (29)-(32), we get

Thus, the F-SVM model is convex.

APPENDIX C

Theorem 4. Given a SPD matrix S and , defined in Eq. (17) is the optimal solution to the problem:

Proof. is strictly convex with respect to B [28]. Given , we have:

where is the eigenvalue decomposition of B, each column of is a eigenvector, is a diagonal matrix with ).To prove that is the optimal solution, we will show that

With the matrix in Eq. (14), we have

Let W = 0. We have , and 1. Thus , and is the optimal minimizer of .

ACKNOWLEDGMENT

The work is partially supported by the NSFC fund of China under contract NO. 61271093, the program of ministry of education for New Century Excellent Talents in University under Grant NCET-12-0150.

REFERENCES

[1] V. Vapnik, Statistical learning theory. Wiley New York, 1998, vol. 1.

[2] N. Cristianini and J. Shawe-Taylor, An introduction to support vector machines and other kernel-based learning methods. Cambridge university press, 2000.

[3] E. Osuna, R. Freund, and F. Girosi, “Training support vector machines: an application to face detection,” in Proc. IEEE Conf. Computer Vision and Pattern Recognition, 1997, pp. 130–136.

[4] B. Heisele, P. Ho, and T. Poggio, “Face recognition with support vector machines: Global versus component-based approach,” in IEEE International Conference on Computer Vision, vol. 2, 2001, pp. 688– 694.

[5] G. Guo, S. Z. Li, and K. L. Chan, “Support vector machines for face recognition,” Image and Vision computing, vol. 19, no. 9, pp. 631–638, 2001.

[6] J. Wu and H. Yang, “Linear regression-based efficient svm learning for large-scale classification.” IEEE transactions on neural networks and learning systems, 2015.

[7] B. Samanta, K. Al-Balushi, and S. Al-Araimi, “Artificial neural networks and support vector machines with genetic algorithm for bearing fault detection,” Engineering Applications of Artificial Intelligence, vol. 16, no. 7, pp. 657–665, 2003.

[8] S. Chen, A. K. Samingan, and L. Hanzo, “Support vector machine multiuser receiver for ds-cdma signals in multipath channels,” Neural Networks, vol. 12, no. 3, pp. 604–611, 2001.

[9] R. Collobert, J. Weston, L. Bottou, M. Karlen, K. Kavukcuoglu, and P. Kuksa, “Natural language processing (almost) from scratch,” The Journal of Machine Learning Research, vol. 12, pp. 2493–2537, 2011.

[10] I. Tsochantaridis, T. Hofmann, T. Joachims, and Y. Altun, “Support vec- tor machine learning for interdependent and structured output spaces,” in Proc. of international conference on Machine learning, 2004, p. 104.

[11] H. Do, A. Kalousis, and M. Hilario, “Feature weighting using mar- gin and radius based error bound optimization in svms,” in Machine Learning And Knowledge Discovery In Databases, vol. 5781, 2009, pp. 315–329.

[12] J. Weston, S. Mukherjee, O. Chapelle, M. Pontil, T. Poggio, and V. Vapnik, “Feature selection for svms,” in NIPS, vol. 12, 2000, pp. 668–674.

[13] I. Guyon, J. Weston, S. Barnhill, and V. Vapnik, “Gene selection for cancer classification using support vector machines,” Machine learning, vol. 46, no. 1-3, pp. 389–422, 2002.

[14] A. Rakotomamonjy, “Variable selection using svm based criteria,” The Journal of Machine Learning Research, vol. 3, pp. 1357–1370, 2003.

[15] V. Vapnik and O. Chapelle, “Bounds on error expectation for support vector machines,” Neural computation, vol. 12, no. 9, pp. 2013–2036, 2000.

[16] P. K. Shivaswamy and T. Jebara, “Maximum relative margin and data- dependent regularization,” The Journal of Machine Learning Research, vol. 11, pp. 747–788, 2010.

[17] X. Zhu, P. Gong, Z. Zhao, and C. Zhang, “Learning similarity metric with svm,” in International Joint Conference on Neural Networks (IJCNN), 2012, pp. 1–8.

[18] H. Do and A. Kalousis, “Convex formulations of radius-margin based support vector machines,” in Proc. of International Conference on Machine Learning, 2013, pp. 169–177.

[19] E. P. Xing, M. I. Jordan, S. Russell, and A. Y. Ng, “Distance metric learning with application to clustering with side-information,” in Advances in neural information processing systems, 2002, pp. 505–512.

[20] A. Globerson and S. T. Roweis, “Metric learning by collapsing classes,” in Advances in neural information processing systems, 2005, pp. 451– 458.

[21] M. Schultz and T. Joachims, “Learning a distance metric from relative comparisons,” Advances in neural information processing systems, p. 41, 2004.

[22] F. Wang, W. Zuo, L. Zhang, D. Meng, and D. Zhang, “A kernel classification framework for metric learning,” 2013.

[23] C. Shen, J. Kim, F. Liu, L. Wang, and A. Van Den Hengel, “Efficient dual approach to distance metric learning,” IEEE transactions on neural networks and learning systems, vol. 25, no. 2, pp. 394–406, 2014.

[24] M. Guillaumin, J. Verbeek, and C. Schmid, “Is that you? metric learning approaches for face identification,” in IEEE International Conference on Computer Vision, 2009, pp. 498–505.

[25] Q. Wang, W. Zuo, L. Zhang, and P. Li, “Shrinkage expansion adaptive metric learning,” in Proc. European Conference on Computer Vision. Springer, 2014, pp. 456–471.

[26] D. Tran and A. Sorokin, “Human activity recognition with metric learning,” in Proc. European Conference on Computer Vision, 2008, pp. 548–561.

[27] Z. Xu, K. Q. Weinberger, and O. Chapelle, “Distance metric learning for kernel machines,” arXiv preprint arXiv:1208.3422, 2012.

[28] O. Chapelle, V. Vapnik, O. Bousquet, and S. Mukherjee, “Choosing multiple parameters for support vector machines,” Machine learning, vol. 46, no. 1-3, pp. 131–159, 2002.

[29] V. Sydorov, M. Sakurada, and C. H. Lampert, “Deep fisher kernels–end to end learning of the fisher kernel gmm parameters,” in Proc. IEEE Conf. Computer Vision and Pattern Recognition, 2014, pp. 1402–1409.

[30] K. Gai, G. Chen, and C.-s. Zhang, “Learning kernels with radiuses of minimum enclosing balls,” in Advances in neural information processing systems, 2010, pp. 649–657.

[31] H. Do, A. Kalousis, A. Woznica, and M. Hilario, “Margin and radius based multiple kernel learning,” in Machine Learning and Knowledge Discovery in Databases. Springer, 2009, pp. 330–343.

[32] X. Liu, L. Wang, J. Yin, E. Zhu, and J. Zhang, “An efficient approach to integrating radius information into multiple kernel learning,” IEEE Transactions on Cybernetics, vol. 43, no. 2, pp. 557–569, 2013.

[33] X. Liu, L. Zhou, L. Wang, J. Zhang, J. Yin, and D. Shen, “An efficient radius-incorporated mkl algorithm for alzheimer s disease prediction,” Pattern Recognition, 2014.

[34] S. Peng, Q. Hu, Y. Chen, and J. Dang, “Improved support vector machine algorithm for heterogeneous data,” Pattern Recognition, 2014.

[35] S. Boyd and L. Vandenberghe, Convex optimization. Cambridge university press, 2004.

[36] R. A. Hom and C. R. Johnson, “Topics in matrix analysis,” Cambridge UP, New York, 1991.

[37] A. Hyv¨arinen and E. Oja, “Independent component analysis: algorithms and applications,” Neural networks, vol. 13, no. 4, pp. 411–430, 2000.

[38] J. Yang, D. Zhang, and J.-y. Yang, “Is ica significantly better than pca for face recognition?” in IEEE International Conference on Computer Vision, vol. 1, 2005, pp. 198–203.

[39] A. Krizhevsky and G. Hinton, “Learning multiple layers of features from tiny images,” Computer Science Department, University of Toronto, Tech. Rep, vol. 1, no. 4, p. 7, 2009.

[40] B. Hariharan, J. Malik, and D. Ramanan, “Discriminative decorrelation for clustering and classification,” in Proc. European Conference on Computer Vision. Springer, 2012, pp. 459–472.

[41] R. Girshick and J. Malik, “Training deformable part models with decorrelated features,” in IEEE International Conference on Computer Vision, 2013, pp. 3016–3023.

[42] K. B. Petersen, M. S. Pedersen et al., “The matrix cookbook,” Technical University of Denmark, vol. 450, pp. 7–15, 2008.

[43] G. R¨atsch, T. Onoda, and K.-R. M¨uller, “Soft margins for adaboost,” Machine learning, vol. 42, no. 3, pp. 287–320, 2001.

[44] B. Scholkopft and K.-R. Mullert, “Fisher discriminant analysis with kernels,” in Signal Processing Society Workshop Neural Networks for Signal Processing, 1999, pp. 23–25.

[45] G. C. Cawley and N. L. Talbot, “Efficient leave-one-out cross-validation of kernel fisher discriminant classifiers,” Pattern Recognition, vol. 36, no. 11, pp. 2585–2592, 2003.

[46] G. B. Huang, V. Jain, and E. Learned-Miller, “Unsupervised joint alignment of complex images,” in IEEE International Conference on Computer Vision, 2007, pp. 1–8.

[47] E. Nowak and F. Jurie, “Learning visual similarity measures for com- paring never seen objects,” in Proc. IEEE Conf. Computer Vision and Pattern Recognition, 2007, pp. 1–8.

[48] N. Pinto, J. J. DiCarlo, and D. D. Cox, “How far can you get with a modern face recognition test set using only simple features?” in Proc. IEEE Conf. Computer Vision and Pattern Recognition, 2009, pp. 2591– 2598.

[49] G. B. Huang, M. J. Jones, and E. Learned-Miller, “Lfw results using a combined nowak plus merl recognizer,” in Workshop on Faces in’RealLife’Images: Detection, Alignment, and Recognition, 2008.

[50] J.-F. Cai, E. J. Cand`es, and Z. Shen, “A singular value thresholding al- gorithm for matrix completion,” SIAM Journal on Optimization, vol. 20, no. 4, pp. 1956–1982, 2010.

[51] G. A. Watson, “Characterization of the subdifferential of some matrix norms,” Linear Algebra and its Applications, vol. 170, pp. 33–45, 1992.

designed for accessibility and to further open science