b

DiscoverSearch
About
My stuff
F-SVM: Combination of Feature Transformation and SVM Learning via Convex Relaxation
2015·arXiv
Abstract
Abstract

The generalization error bound of support vector machine (SVM) depends on the ratio of radius and margin, while standard SVM only considers the maximization of the margin but ignores the minimization of the radius. Several approaches have been proposed to integrate radius and margin for joint learning of feature transformation and SVM classifier. However, most of them either require the form of the transformation matrix to be diagonal, or are non-convex and computationally expensive.

In this paper, we suggest a novel approximation for the radius of minimum enclosing ball (MEB) in feature space, and then propose a convex radius-margin based SVM model for joint learning of feature transformation and SVM classifier, i.e., FSVM. An alternating minimization method is adopted to solve the F-SVM model, where the feature transformation is updated via gradient descent and the classifier is updated by employing the existing SVM solver. By incorporating with kernel principal component analysis, F-SVM is further extended for joint learning of nonlinear transformation and classifier. Experimental results on the UCI machine learning datasets and the LFW face datasets show that F-SVM outperforms the standard SVM and the existing radius-margin based SVMs, e.g., RMM, R-SVM+ andR-SVM+µ .

Index Terms—Support vector machine, radius margin bound, convex relaxation, max-margin.

SUPPORT vector machine (SVM) and its extensions havebeen one of the most successful machine learning methods [1], [2], and have been adopted in various fields, e.g., computer vision [3], [4], [5], [6], signal processing [7], [8], natural language processing [9], [10] and bioinformatics [11], [12], [13], [14]. Despite its popularity, SVM aims to seek the optimal hyperplane with the maximum margin principle, but the generalization error of SVM actually is a function of the ratio of radius and margin [15]. Given feature space, the radius is fixed and can be ignored, thus SVM can minimize the generalization error by maximizing the margin. However, for joint learning of feature transformation and classifier, the radius information will be valuable and cannot be ignored.

By minimizing the radius-margin ratio, the generalization error of SVM can be optimized for joint learning of feature transformation and classifier. Since the radius-margin error bound is non-convex, relaxation and approximation of radius is generally adopted in the existing models [16], [17]. Several approaches have been proposed from the perspective of radius-margin error [11], [16], [17], [18], but most ones suffer from the limitations of computational burden and simplified forms

X. Wu, W. Zuo and Y. Zhu are with School of Computer Science and Technology, Harbin Institute of Technology, Harbin, China.

L. Lin is with School of Advanced Computing, Sun Yat-sen University, Guangzhou, China.

of transformation. RMM [16] only considers the spread of the data along the direction perpendicular to the classification hyperplane. Radius-margin based SVMs, e.g., MR-SVM [11], R-SVM+[18] and RSVM+µ[18], are based on the constraint that the linear transformation matrix should be diagonal.

Another strategy for joint feature transformation and clas-sifier learning is to incorporate metric learning with SVM, where metric learning can be adopted to learn a better linear transformation matrix [19], [17], [20], [21], [22], [23], [24], [25], [26]. One simple approach to combine metric learning and SVM is to directly deploy the transformation obtained using metric learning into SVM. This approach, however, usually cannot lead to satisfactory performance improvement [27]. Therefore, other approaches have been proposed to integrate metric learning to SVM, e.g., support vector metric learning (SVML) [27] and metric learning with SVM (MSVM) [17]. But SVML [27] was designed for RBF-SVM and ignored the radius information, while MSVM [17] is non-convex.

In this paper, we propose a novel radius-margin based SVM model for joint learning of feature transformation and SVM classifier, i.e., F-SVM. Compared with the existing radius-margin based SVM methods, we derive novel lower and upper bounds for the relaxation of the radius-margin ratio. Unlike MR-SVM [11], R-SVM+[18] and RSVM+µ[18] which are suggested for joint feature weighting and SVM learning, FSVM can simultaneously learn feature transformation L and classifier (w, b). Compared with the existing metric learning for SVM methods, our F-SVM model considers both the radius and the margin information, and is convex. Then, an alternating minimization algorithm is proposed to solve our FSVM model, which iterates by updating feature transformation and classifier alternatively. Note that kernel SVM is equivalent to perform linear SVM in the kernel PCA space. We further suggest to conduct linear FSVM in the kernel PCA space for joint learning of nonlinear transformation and classifier. The contribution of this paper is of three-fold:

A novel convex formulation of radius-margin based SVM model, i.e., F-SVM, is proposed. Unlike MR-SVM [11], R-SVM+[18] and RSVM+µ[18], our F-SVM is capable of joint learning feature transformation and classifier, and is robust against outliers. Experimental results show that F-SVM outperforms SVM and the existing radius-margin based SVMs.

In F-SVM, we derive the lower and upper bounds for the radius of minimum enclosing ball (MEB) in feature space, resulting in a novel approximation of the radius. Compared with the approximations proposed in [18], ours is much simple and can be easily adopted in developing

image

An alternating minimization algorithm is developed for solving F-SVM via iterating between gradient descent and SVM learning. Therefore, the off-the-shelf SVM solvers can be employed to improve the computational efficiency. Moreover, a semi-whitened PCA method is developed for the initialization of  M = LT L.

The remainder of the paper is organized as follows: Section 2 reviews the related work on the radius-margin ratio based bounds and their applications. Section 3 describes the model and algorithm of the proposed F-SVM method. Section 4 extends F-SVM to the kernelized version for nonlinear clas-sification. Section 5 provides the experimental results on the UCI machine learning datasets and the LFW dataset. Finally, we conclude the paper in Section 6.

The radius-margin bound not only provides theoretical explanation on the generalization performance of SVM [1], but also has been extensively adopted for improving kernel classification methods, e.g., model selection [28], [29], multiple kernel learning (MKL) [30], [31], [32], [33], and mapping of nominal attributes [34]. Denote a training set by S = {(x1, y1) , ..., (xn, yn)}and a feature space by  H : Φ (x). In [30], [31], the radius R of minimum enclosing ball (MEB) in feature space is computed as:

image

Assuming that the training set is separable, given the optimal hyperplane (w, b), Vapnik [1] suggested a radius-margin error bound which showed that the expectation of the misclassifica-tion probability depends on  R2∥w∥22.

The standard SVM is known as a max-margin method which only considers the margin  1/∥w∥22in the algorithm. When the feature space is fixed, the radius is a constant and can thus be ignored. But in many classification tasks, the model parameters [28], combination of basis kernels [31], feature reweighting or transformation [12] usually should be learned or tuned based on the training data, where integration of radius has been demonstrated to be very effective in improving the classification performance. In model selection, radius-margin bound has been applied for choosing tradeoff parameter and scaling factors of SVM and  L1-SVM [28]. In multiple kernel learning (MKL) [30], [31], [32] and feature reweighting [12], several variants of radius had been developed.

This paper aims to jointly learn SVM together with feature transformation by minimizing the radius-margin ratio, i.e., radius-margin based SVM, and more detailed review is given on this topic. Except [17], most existing approaches [11], [18], [21] require the transformation matrix to be diagonal, i.e., feature reweighting and selection. Direct use of radius-margin ratio  R2∥w∥22in SVM results in a non-convex optimization problem, which makes the learning algorithm computationally expensive and unstable. By restricting the feature transformation to be diagonal  Dµ = Diag (µ)with  µk ≥ 0, Do et al. [11] suggested that the radius is bounded with  maxk µkR2k ≤R2µ ≤ �k µkR2k, where  Rkis the radius on dimension k. By approximating  R2µwith its upper bound  �k µkR2k, MR-SVM in [11] solved the following convex relaxation problem:

image

Denote  ROby the half value of the maximum pairwise distances. Do et al. in [18] introduced a tighter bound of the radius  RO ≤ Rµ ≤ 1+√32 ROand proposed another convex model R-SVM+µ:

image

Furthermore, R-SVM+was developed in [18] by controlling both the radius and margin with w.

Rather than feature reweighting and selection, Zhu et al. [17] proposed a metric learning with SVM (MSVM) method for joint learning of the linear transformation and SVM classi-fier. In [17], given the transformation matrix A, an alternative ¯R = maxi ∥Axi − A¯x∥22of the radius R was adopted, where ¯xis the mean of the training samples. Although Zhu et al. [17] claimed that  R = ¯R, as demonstrated in Theorem 1 of this work, ¯Ris an upper bound of R. The MSVM model in [17] was formulated as:

image

Note that MSVM is non-convex and solved using gradient projection.

In this paper, we propose a novel relaxed convex model of radius-margin based SVM, i.e., F-SVM, for joint learning of feature transformation and SVM classifier. Compared with the existing radius-margin based SVM methods, F-SVM has some distinguishing features. Our F-SVM model is convex, while MSVM [17] is non-convex. Unlike RMM [16], the transformation in F-SVM is learned to minimize the radius of the enclosing ball of all samples rather than to only shrink the sample span along the direction perpendicular to the hyperplane. Moreover, F-SVM is also different with MR-SVM [11], R-SVM+[18] and R-SVM+µ[18] from three aspects: (i) Instead of feature reweighting and selection, F-SVM can learn feature transformation and classifier simultaneously; (ii) F-SVM adopts a new approximation for the radius of MEB in feature space; (iii) In F-SVM, individual inequality constraints are combined into one holistic inequality constraint to improve the robustness against outliers. All these make F-SVM very promising for joint learning of feature transformation and SVM classifier, and the results also validate the effectiveness of F-SVM.

A. Problem Formulation

Denote  S = {(x1, y1) , ..., (xn, yn)}by a training set, where  xi ∈ Rdand  yi ∈ {−1, +1}denote the ith training sample and the corresponding class label, respectively. By introducing the slack variables  ξi (i = 1, 2, ..., n), SVM aims to find the optimal separating hyperplane by solving the following optimization problem:

image

where (u, b) are the parameters to describe the learned hyperplane  uT x + b = 0, ξidenotes the ith slack variable, and C stands for the tradeoff parameter. The objective function in Eq. (5) aims to maximize the margin  γ = 1/∥u∥2while minimizing the empirical risk �ni=1 ξi. For joint learning, we introduce a linear transformation matrix A and integrate the radius information, resulting in the following radius-margin based SVM model:

image

where the radius R is defined as:

image

Note that  R2depends on matrix A and the problem in Eq. (6) is non-convex [18]. Denote  x0by the center of all instances, and ¯Rby the largest squared distance from the center in transformed feature space. Let  x0 = ¯x = �ni=1 xiand ¯R = maxi ∥Axi − A¯x∥22. We prove that the radius R is bounded by ¯R.

image

Please refer to Appendix A for the proof of Theorem 1. In [17], Zhu et al. claimed that  R = ¯R. From Theorem 1, ¯Ris only an approximation of R, and counter examples can be easily found to illustrate  R ̸= ¯R. Let  w = AT uand M = AT A. Since the radius R is upper bounded by ¯R, we can approximate R with ¯R. With simple algebra, the radius-margin SVM model in Eq. (6) is relaxed into the following formulation:

image

Theorem 2. The problem in Eq. (9) is equivalent with the following problem:

image

Proof. Denote  (ˆw,ˆb, ˆξ, ˆM, ˆR)by the optimal solution to the problem in Eq. (9). Let ˜M = ˆM/ ˆR2and ˜R = 1. It is obvious to see that  (ˆw,ˆb, ˆξ, ˜M, ˜R)is also the optimal solution to the problem in Eq. (9) because  F(ˆw,ˆb, ˆξ, ˆM, ˆR) =F(ˆw,ˆb, ˆξ, ˜M, ˜R).

Next we will show that  (ˆw,ˆb, ˆξ, ˜M)is the optimal solution to the problem in Eq. (10). If  (ˆw,ˆb, ˆξ, ˜M)is not the optimal solution to Eq. (10), there must exist some (w∗, b∗, ξ∗, M∗)that satisfies all inequality constraints and L(w∗, b∗, ξ∗, M∗) < L(ˆw,ˆb, ˆξ, ˜M). Then we can define ˜R =1 and have  F(w∗, b∗, ξ∗, M∗, ˜R) < F(ˆw,ˆb, ˆξ, ˜M, ˜R), which is contradictory with the assumption that  (ˆw,ˆb, ˆξ, ˜M, ˜R)is the optimal solution to Eq. (9). Thus, we can solve the problem in Eq. (10) with the optimal solution  (ˆw,ˆb, ˆξ, ˜M), and then obtain the optimal solution  (ˆw,ˆb, ˆξ, ˜M, ˜R)to Eq. (9).

Without loss of generality, we assume ˜R = 1and seek the corresponding optimal w and M by solving Eq. (10). Moreover, to make the model robust against outliers and noisy samples, we combine the individual inequality constraints (xi − ¯x)T M(xi − ¯x) ≤ 1, i = 1, 2, · · · , ninto one integrated inequality constraint [32], resulting in the following radius-margin based SVM model:

image

By defining the scattering matrix of the training set S = �ni=1 (xi − ¯x)(xi − ¯x)T, based on the Lagrangian multiplier method [35], the problem expressed in Eq. (11) can be equivalently reformulated as the following F-SVM model:

image

where  ρis the regularization parameter determined by  κ. In the following, we prove that our F-SVM model is convex.

Theorem 3. The F-SVM model is a convex optimization problem. The proof can be found in Appendix B.

image

Fig. 1. Illustration of the alternating minimization algorithm for F-SVM. First, a semi-whitened based initialization scheme is proposed to initialize M. Then, the alternating minimization algorithm is adopted by updating (w, b) and M alternatively. When the algorithm converges, as shown in Fig. 1, we can learn both a better matrix M to reduce the radius of MEB in feature space and a max-margin classifier (w, b).

B. Alternating minimization

In this section, we propose an efficient alternating minimization algorithm to solve the proposed F-SVM model, as illustrated in Fig. 1. First, a semi-whitened PCA based initialization scheme is proposed to initialize M. Then, the alternating minimization algorithm is adopted by updating (w, b) and M alternatively. Fixing M, the model can be reformulated as the SVM in the transformed feature space, and be solved using the off-the-shelf SVM solvers to update (w, b). Fixing (w, b), the gradient descent method [35] is adopted to update the matrix M. When the algorithm converges, as shown in Fig. 1, we can learn both a better matrix M to reduce the radius of MEB in feature space and a max-margin classifier (w, b). The alternating optimization procedure is summarized in Algorithm 1.

Algorithm 1 The alternating minimization algorithm for FSVM

image

1) Initialization of M: Because the proposed F-SVM model is convex, alternating minimization can converge to global optimum for any initialization of M and (w, b), but proper initialization is helpful in improving the computational effi-ciency. Thus, by further relaxing the F-SVM model in Eq. (12), we propose a semi-whitened PCA based initialization method on M.

Note that  wT M−1wis upper bounded by [36]:

image

where  ∥A∥2and  ∥A∥∗denote the L2-norm and the nuclear norm of a positive semi-definite matrix A, respectively. Based on Eq. (12) and Eq. (13), by setting  B = M−1, the subproblem on M can be rewritten as the problem on B formulated

as:

image

where  τ ′ = ρ/∥w∥2. The eigenvalue decomposition of S is S = UΛUT, where  Λ = diag(λ1, λ2, · · · , λd) (λ1 ≥ λ2 ≥· · · ≥ λd ≥ 0), λiand the ith column of U denote the ith eigenvalue and eigenvector, rsespectively. With U and  Λ, we define ˆBas:

image

Theorem 4. Given a SPD matrix S and  τ ′ > 0, ˆBdefined in Eq. (15) is the optimal solution to the problem:

image

The proof can be found in Appendix C. With ˆB, the initialization of M in Eq. (12) is then defined as:

image

Noted that we assume that  ∥w∥2is known for the initialization of M. From Eq. (17),  ∥w∥2only affects the scale factor√τ ′to the linear transformation. Thus, we simply let  ∥w∥2 = 1in our implementation.

It is interesting to point out that  M0in Eq. (17) implies a semi-whitening PCA transformation because  M0 =UΛ−1/2UT, where the linear transformation can then be defined as  Λ−1/4UT xi. In literature [37],  Λ−1/2UT xiwas called the PCA whitening transformation, and data whitening has been widely exploited in many applications, e.g., face recognition, object detection, and image classification [38], [39], [40], [41]. Considering its connection with data whitening, semi-whitening is also expected to be effective in other tasks and applications.

2) The subproblem on (w, b): Given M, the F-SVM model can be formulated as:

image

where  B = M−1. The eigenvalue decomposition of B is B = VΣVT. By introducing  B = LT L, the transformation matrix L can be rewritten as to  L = Σ12 VT. Let  zi = Σ− 12 VT xiand v = Lw. With simple algebra, the problem in Eq. (18) can be reformulated as:

image

which can be solved using the off-the-shelf SVM solvers. Given the solution  v, w = VΣ− 12 vcan then be obtained.

3) The subproblem on M: Given (w, b), the sub-problem on M can be reformulated as:

image

Since the objective function in Eq. (20) is convex and differentiable with respect to M, the gradient projection method [35] is adopted to update M. According to [42], the gradient of f(M) can be obtained by:

image

As presented in Algorithm 1, we use gradient projection

image

to update M by choosing proper stepsize t and gradually decreasing it along with iterations, where  PS+ (·)projects a matrix onto the cone of positive semidefinite matrices.

C. Discussion

The proposed F-SVM method has several interesting advantages while compared with the other radius-margin based SVMs, e.g., RMM [16], MR-SVM [11], R-SVM+[18], RSVM+µ[18] and M-SVM [17]. RMM [16] is suggested to max- imize the margin while restricting the spread of the data along the direction perpendicular to the separating hyperplane, while our F-SVM is proposed to minimize the convex relaxation of the radius-margin ratio. The generalization error is bounded by the radius and margin ratio, and the radius is determined by the spread along all possible directions rather than only the direction perpendicular to the separating hyperplane, making F-SVM theoretically more promising.

MR-SVM [11], R-SVM+[18] and R-SVM+µ[18] aim to learn the diagonal feature transformation  Dµ = Diag (µ)with  µk ≥ 0, while F-SVM is developed for joint learning of feature transformation and SVM classifier. Both R-SVM+and R-SVM+µneed to solve a Quadratically Constrained Quadratic Programming (QCQP) optimization problem, which is computationally expensive than the alternating minimization method used in our F-SVM. Moreover, R-SVM+and RSVM+µadopted a tighter approximation  ROof the radius. In F- SVM, a new approximation ¯Rof the radius is proposed, which is also tighter than that used in MR-SVM [11]. Moreover, the individual inequality constraints on ¯Rare combined to improve the robustness against outliers. It is interesting to note that we have:

image

where  St = �i,j (xi − xj)(xi − xj)T. Eq. (23) indi- cates that, if all the inequality constraints on  RO, i.e., ��D√µxi − D√µxj��2 ≤ r, are combined into one integrated inequality constraint:

image

Let  κ′ = 4nκand  M = DT√µD√µ. One can see that the integrated inequality constraint will be equivalent with that adopted in Eq. (11).

MSVM [17] was developed for simultaneous learning of the linear transformation and SVM classifier, but the MSVM model is non-convex and solved using gradient projection. Moreover, although Zhu et al. [17] claimed that  R = ¯R, as discussed in Section 3.1, ¯Ris only a lower bound of R and counter examples can be easily found to illustrate  R ̸= ¯R. Compared with MSVM [17], the F-SVM model is convex and robust against noise and outliers, and can be efficient solved using the optimization method introduced in Section 3.2.

With the incorporation of kernel principal component analysis, linear F-SVM can be extended to kernel version for nonlinear classification. First, we show that kernel SVM is equivalent to perform linear SVM in the kernel PCA space. Then, kernel F-SVM is introduced by conducting linear FSVM in the kernel PCA space.

Let the kernel function be  K(xi, xj) = ϕ(xi)T ϕ(xj), where  ϕ (x)defines an implicit mapping from the data space to high or infinite dimensional feature space. For the training set S = {(x1, y1) , ..., (xn, yn)}, we use  W = [w1, w2, ..., wD]to denote all the PCA eigenvectors corresponding to positive eigenvalues. Let ¯Wbe a set of basis vectors in the complementary space of W. Assuming the training set is centered, for any  xi, we have ¯WT ϕ(xi) = 0, and thus can get:

image

TABLE I SUMMARY OF THE UCI DATASETS USED IN THE EXPERIMENTS.

image

Let  fi = WT ϕ(xi). The dual problem of SVM in the kernel PCA space can be formulated as:

image

where  ⟨fi, fj⟩ = K (xi, xj). Therefore, kernel SVM is equivalent to performing linear SVM in the kernel PCA space. To extend F-SVM to its kernelized version, we first project each training sample  xito the kernel PCA space  fi = WT ϕ(xi), and then solve the following F-SVM model:

image

where  Sf = �ni=1 fif Ti. Algorithm 1 can be adopted to solve the model in Eq. (27). In our implementation, instead of using all the eigenvectors, we also consider to employ the PCA eigenvectors corresponding to first d largest eigenvalues, i.e., W = [w1, w2, ..., wD], and Section 5.2 reports the empirical result on the influence of d on classification accuracy.

In this section, we use both the UCI machine learning datasets and the Labeled Faces in the Wild (LFW) database to evaluate the proposed F-SVM method, and compare our FSVM with the competing methods, including SVM and several representative radius-margin based SVM methods, i.e., RMM [16], R-SVM+[18] and R-SVM+µ[18]. MR-SVM [11] and MSVM [17] are not considered in our experiments because their source codes are not publicly available. The 10-fold cross validation (CV) is adopted to determine the optimal values of hyper-parameters for each method. The mean classification accuracy is adopted by averaging the 100 runs of the 10-fold CV. The methods are evaluated by two performance indicators: accuracy and training time (seconds, s).

A. Evaluation on linear F-SVM using the UCI datasets

We evaluate the performance of linear F-SVM on the 11 datasets from the UCI machine learning repository, where the

TABLE II COMPARISON OF THE AVERAGE CLASSIFICATION ACCURACY (%) OF LINEAR SVM, LINEAR RMM [16], LINEAR R-SVM+ [18], LINEARR-SVM+µ [18],AND LINEAR F-SVM.

image

TABLE III COMPARISON OF THE TRAINING TIME (s) OF LINEAR SVM, LINEARRMM [16], LINEAR R-SVM+ [18], LINEAR R-SVM+µ [18],AND LINEAR F-SVM.

image

reason to choose them is that they had been widely adopted for evaluating SVM and kernel methods [43], [44], [45]. Table I provides a brief summary of these UCI datasets, which includes 6 2-class problems and 5 multi-class problems. Tables II and III list the mean classification accuracy and training time of five linear classifiers, i.e., linear SVM, linear RMM [16], linear R-SVM+[18], linear R-SVM+µ[18], and linear F- SVM. RMM [16], R-SVM+[18], R-SVM+µ[18], and F-SVM consider both margin and radius information, while SVM only considers margin. As shown in Table II, the radius-margin based SVM methods generally outperform SVM in terms of classification accuracy, which indicates that the incorporation of radius can improve the classification performance. As listed in Table III, the training time of SVM is much less than the other four methods, indicating that the introduction of radius makes the model more complex to train.

We further compare linear F-SVM with the competing methods. From Table II, F-SVM achieves higher classification accuracy than SVM on 9, RMM [16] on 7, R-SVM+[18] on 7, and R-SVM+µ[18] on 8 of the 11 datasets. The better classification accuracy of our F-SVM should be attributed to that: (i) compared with SVM, F-SVM incorporates radius in the convex model; (ii) unlike RMM [16], our F-SVM considers the spread along all directions rather than only the direction perpendicular to the separating hyperplane. (iii) instead of feature reweighting and selection in R-SVM+[18] and R-SVM+µ[18], general linear transformation is learned in F-SVM. To improve the efficiency of F-SVM in training, we adopt the warm-start strategy, where the solution (w, b) of the previous iteration is used as the initialization of the next iteration. From Table III, one can see that our F-SVM is only a little slower than SVM, but is much more efficient than the other competing methods in training. F-SVM is about  103∼104times faster than RMM [16], R-SVM+and R-SVM+µin [18]. In summary, F-SVM obtains the best classification accuracy among all competing methods, and is more efficient in training than the other radius-margin based SVM methods.

TABLE IV COMPARISON OF THE AVERAGE CLASSIFICATION ACCURACY (%) OF KERNEL SVM, KERNEL RMM [16], KERNEL R-SVM+ [18], KERNELR-SVM+µ [18],AND KERNEL F-SVM.

image

B. Evaluation on kernel F-SVM using the UCI datasets

In this subsection, we evaluate the performance of kernel F-SVM on the 11 UCI datasets. The Gaussian RBF kernel is adopted in our experiments, which includes an extra kernel parameter  σ. As discussed in Section 4, we also consider the number of kernel PCA components in kernel F-SVM. Using four datasets, i.e., Breast cancer, Thyroid, Heart, and German, Fig. 2 illustrates the classification accuracy of kernel SVM and kernel F-SVM under different kernel PCA dimensions. It is interesting to note that, the proper decreasing of kernel PCA dimension can consistently improve the classification accuracy both for kernel SVM and kernel F-SVM. Also from Fig. 2 one can see that the kernel F-SVM is superior to kernel SVM under different dimensions. One possible explanation may be that the decreasing of kernel PCA dimension would make the learned transformation more stable.

Tables IV and V list the mean classification accuracy and training time of five linear classifiers, i.e., kernel SVM, kernel RMM [16], kernel R-SVM+[18], kernel R-SVM+µ[18], and kernel F-SVM. For kernel methods, the superiority of F-SVM against the competing methods is more significant. The Kernel F-SVM outperforms kernel SVM on 10, kernel RMM [16] on 9, kernel R-SVM+[18] on 11, and kernel R-SVM+µ[18] on 11 of all the 11 datasets in terms of classification accuracy. By training time, the kernel F-SVM is a little slower than SVM, but is about  103∼104times faster than the other competing methods.

C. Results on the LFW Database

In this subsection, the LFW database is used to evaluate FSVM for face verification. The database consists of more than 13,233 face images from 5,749 persons. The face images in the LFW database were collected from the Internet, and vary

TABLE V COMPARISON OF THE TRAINING TIME (s) OF KERNEL SVM, KERNELRMM [16], KERNEL R-SVM+ [18], KERNEL R-SVM+µ [18], ANDKERNEL F-SVM.

image

Fig. 2. Classification accuracy (%) of kernel SVM and kernel F-SVM under different kernel PCA dimensions.

in pose, illumination, expression, and age, making LFW very suitable for studying unconstrained face verification. The face recognition method can be evaluated with two test protocols for LFW: the restricted and the unrestricted settings. Under the restricted setting, the only available information is whether each pair of training images is matched or not, and the performance of the face verification method is evaluated by 10-fold cross validation on a set of 3000 positive and 3000 negative image pairs.

In our experiment, we adopt the restricted setting with the face images aligned by the funneling method [46]. Fig. 3 shows some examples of similar and dissimilar pairs. We extract two kinds of features for each face image: SIFT feature and attribute feature, and compare F-SVM with SVM, RMM [16], R-SVM+[18], and R-SVM+µ[18], and several represen- tative face verification methods, including LDML [24], Nowak [47], V1-like/MKL [48], and MERL+Nowak [49].

Fig. 4 shows the verification accuracies of SVM and F-SVM under different PCA dimensions by using the attribute feature and the combined features of SIFT and attributes, respectively. F-SVM using the combined features of SIFT and attributes (F-SVM-combined) achieves its best performance of 83.25% when the dimension d = 300, and 82.58% when the dimension d = 73 using the attribute features (F-SVM-attribute). SVM using the combined features of SIFT and attributes (SVMcombined) achieves its best performance of 81.90% when the

image

Fig. 3. Examples of face image pairs in the LFW database.

dimension d = 400, and 80.12% when the dimension d = 73 using the attribute features (SVM-attribute). Thus, F-SVM can get better accuracy than SVM on the LFW database.

We further compare F-SVM with several other face verifi-cation methods. Table 6 lists the accuracy of F-SVM, SVM, RMM [16], R-SVM+[18], R-SVM+µ[18], LDML [24],Nowak [47], V1-like/MKL [48], and MERL+Nowak [49]. We report the accuracy of F-SVM, SVM, RMM [16], R-SVM+[18], RSVM+µ[18], LDML [24] using the attribute and the SIFT fea- tures, report the accuracy of Nowak [47] and MERL+Nowak [49] using SIFT and geometry feature, and report the accuracy of V1-like/MKL [48] using the V1-like features. For either the combined features of SIFT and attributes or the attribute features, F-SVM achieves higher accuracy than SVM, RMM [16],R-SVM+[18], R-SVM+µ[18], LDML [24] separately from Table VI. Fig. 5 shows the ROC curves of the competing methods. Also one can see that F-SVM-combined gets better performance than other face verification methods.

In this paper, we proposed a convex radius-margin based SVM model (F-SVM) for joint learning of feature transformation and SVM classifier. For the formulation of F-SVM, lower and upper bounds of the radius of MEB are introduced to derive a novel approximation of radius-margin ratio, and all the individual inequality constraints are combined into one integrated inequality constraint, resulting in a convex relaxation of the radius-margin based SVM model. For model optimization, a semi-whitened PCA based method is proposed for the initialization of the learned transformation, and an alternating minimization algorithm is adopted to learn the

image

Fig. 4. Examples of face image pairs in the LFW database.

image

Fig. 5. ROC curves of different face verification methods on the restricted LFW-funneled database.

TABLE VI COMPARISON OF ACCURACY OBTAINED USING DIFFERENT FACE VERIFICATION METHODS. THE TOP TWO RESULTS ARE SHOWN IN RED AND BLUE FONTS, RESPECTIVELY.

image

feature transformation and SVM classifier. Further, F-SVM is kernelized by using kernel PCA. Our experimental results show that, F-SVM obtains higher classification accuracy than SVM and the state-of-the-art radius-margin based SVM methods, and is more efficient in training than the other radius-margin based SVM methods, e.g., RMM [16], R-SVM+[18] and R-SVM+µ[18]. In our future work, we will extend the proposed relaxed radius-margin based error bound to other classification methods and extend the proposed model for learning other forms of feature transformation tailored for specific applications.

image

Proof. Based on the definition of the radius, we get

image

Denote  Rpby the maximum pairwise distance. We have

image

Proof. Let  x′i = Axi−A¯x. We have  Axi−Axj = x′i−x′j. Based on the definition of ¯R

image

we will prove that there exist some  j∗which makes ∥x′i∗ − x′j∗∥2 ≥ ∥x′i∗∥2. Based on the definition of  ¯x, we have �j x′j = 0. Then, we derive

image

Since  ∥x′j∗∥2 ≥ 0and  −2x′j∗x′i∗ ≥ 0, one can easily see that

image

Based on the definition of  Rp:

image

Combining the two inequalities above, we can prove ¯R ≤Rp.

Finally, by combining Lemmas 1∼3, we obtain the following theorem: Theorem A.1. The margin R is bounded by ¯Rby:

image

Lemma B.1 [42]. Given two symmetric positive definite (SPD) matrices A and B, we have

image

Theorem B.2. The problem Eq. (12) is a convex optimization problem.

Proof. Note that all the constraints define a convex set, and �i ξiand tr (MS) are linear to  ξand M, respectively. Then the key step is to prove that the function  wT M−1wis convex for  M ≻ 0, i.e., for any  1 ≥ θ ≥ 0,

image

Note that  (θw1+(1−θ)w2)T (θM1+(1−θ)M2)−1 (θw1+(1−θ)w2)contains three terms:

image

First, we have

image

and then we get

image

Analogously, we get

image

With Eq. (30), we have

image

Combining Eqns. (29)-(32), we get

image

Thus, the F-SVM model is convex.

Theorem 4. Given a SPD matrix S and  τ ′ > 0, ˆBdefined in Eq. (17) is the optimal solution to the problem:

image

Proof.  L(B, τ ′)is strictly convex with respect to B [28]. Given  g(B) = τ ′tr�B−1S�, we have:

image

where ¯U¯Σ ¯UTis the eigenvalue decomposition of B, each column of ¯Uis a eigenvector, ¯Σis a diagonal matrix with ¯Σ = diag(σ1, σ2, · · · , σd) (0 ≤ σ1 ≤ σ2 ≤ · · · ≤ σd).To prove that ˆBis the optimal solution, we will show that

image

With the matrix ˆBin Eq. (14), we have

image

Let W = 0. We have  UT W = 0, WU = 0, and  ∥W∥2 ≤1. Thus  UUT ∈ ∂���ˆB���∗, and  ˆBis the optimal minimizer of L(B, τ ′).

The work is partially supported by the NSFC fund of China under contract NO. 61271093, the program of ministry of education for New Century Excellent Talents in University under Grant NCET-12-0150.

[1] V. Vapnik, Statistical learning theory. Wiley New York, 1998, vol. 1.

[2] N. Cristianini and J. Shawe-Taylor, An introduction to support vector machines and other kernel-based learning methods. Cambridge university press, 2000.

[3] E. Osuna, R. Freund, and F. Girosi, “Training support vector machines: an application to face detection,” in Proc. IEEE Conf. Computer Vision and Pattern Recognition, 1997, pp. 130–136.

[4] B. Heisele, P. Ho, and T. Poggio, “Face recognition with support vector machines: Global versus component-based approach,” in IEEE International Conference on Computer Vision, vol. 2, 2001, pp. 688– 694.

[5] G. Guo, S. Z. Li, and K. L. Chan, “Support vector machines for face recognition,” Image and Vision computing, vol. 19, no. 9, pp. 631–638, 2001.

[6] J. Wu and H. Yang, “Linear regression-based efficient svm learning for large-scale classification.” IEEE transactions on neural networks and learning systems, 2015.

[7] B. Samanta, K. Al-Balushi, and S. Al-Araimi, “Artificial neural networks and support vector machines with genetic algorithm for bearing fault detection,” Engineering Applications of Artificial Intelligence, vol. 16, no. 7, pp. 657–665, 2003.

[8] S. Chen, A. K. Samingan, and L. Hanzo, “Support vector machine multiuser receiver for ds-cdma signals in multipath channels,” Neural Networks, vol. 12, no. 3, pp. 604–611, 2001.

[9] R. Collobert, J. Weston, L. Bottou, M. Karlen, K. Kavukcuoglu, and P. Kuksa, “Natural language processing (almost) from scratch,” The Journal of Machine Learning Research, vol. 12, pp. 2493–2537, 2011.

[10] I. Tsochantaridis, T. Hofmann, T. Joachims, and Y. Altun, “Support vec- tor machine learning for interdependent and structured output spaces,” in Proc. of international conference on Machine learning, 2004, p. 104.

[11] H. Do, A. Kalousis, and M. Hilario, “Feature weighting using mar- gin and radius based error bound optimization in svms,” in Machine Learning And Knowledge Discovery In Databases, vol. 5781, 2009, pp. 315–329.

[12] J. Weston, S. Mukherjee, O. Chapelle, M. Pontil, T. Poggio, and V. Vapnik, “Feature selection for svms,” in NIPS, vol. 12, 2000, pp. 668–674.

[13] I. Guyon, J. Weston, S. Barnhill, and V. Vapnik, “Gene selection for cancer classification using support vector machines,” Machine learning, vol. 46, no. 1-3, pp. 389–422, 2002.

[14] A. Rakotomamonjy, “Variable selection using svm based criteria,” The Journal of Machine Learning Research, vol. 3, pp. 1357–1370, 2003.

[15] V. Vapnik and O. Chapelle, “Bounds on error expectation for support vector machines,” Neural computation, vol. 12, no. 9, pp. 2013–2036, 2000.

[16] P. K. Shivaswamy and T. Jebara, “Maximum relative margin and data- dependent regularization,” The Journal of Machine Learning Research, vol. 11, pp. 747–788, 2010.

[17] X. Zhu, P. Gong, Z. Zhao, and C. Zhang, “Learning similarity metric with svm,” in International Joint Conference on Neural Networks (IJCNN), 2012, pp. 1–8.

[18] H. Do and A. Kalousis, “Convex formulations of radius-margin based support vector machines,” in Proc. of International Conference on Machine Learning, 2013, pp. 169–177.

[19] E. P. Xing, M. I. Jordan, S. Russell, and A. Y. Ng, “Distance metric learning with application to clustering with side-information,” in Advances in neural information processing systems, 2002, pp. 505–512.

[20] A. Globerson and S. T. Roweis, “Metric learning by collapsing classes,” in Advances in neural information processing systems, 2005, pp. 451– 458.

[21] M. Schultz and T. Joachims, “Learning a distance metric from relative comparisons,” Advances in neural information processing systems, p. 41, 2004.

[22] F. Wang, W. Zuo, L. Zhang, D. Meng, and D. Zhang, “A kernel classification framework for metric learning,” 2013.

[23] C. Shen, J. Kim, F. Liu, L. Wang, and A. Van Den Hengel, “Efficient dual approach to distance metric learning,” IEEE transactions on neural networks and learning systems, vol. 25, no. 2, pp. 394–406, 2014.

[24] M. Guillaumin, J. Verbeek, and C. Schmid, “Is that you? metric learning approaches for face identification,” in IEEE International Conference on Computer Vision, 2009, pp. 498–505.

[25] Q. Wang, W. Zuo, L. Zhang, and P. Li, “Shrinkage expansion adaptive metric learning,” in Proc. European Conference on Computer Vision. Springer, 2014, pp. 456–471.

[26] D. Tran and A. Sorokin, “Human activity recognition with metric learning,” in Proc. European Conference on Computer Vision, 2008, pp. 548–561.

[27] Z. Xu, K. Q. Weinberger, and O. Chapelle, “Distance metric learning for kernel machines,” arXiv preprint arXiv:1208.3422, 2012.

[28] O. Chapelle, V. Vapnik, O. Bousquet, and S. Mukherjee, “Choosing multiple parameters for support vector machines,” Machine learning, vol. 46, no. 1-3, pp. 131–159, 2002.

[29] V. Sydorov, M. Sakurada, and C. H. Lampert, “Deep fisher kernels–end to end learning of the fisher kernel gmm parameters,” in Proc. IEEE Conf. Computer Vision and Pattern Recognition, 2014, pp. 1402–1409.

[30] K. Gai, G. Chen, and C.-s. Zhang, “Learning kernels with radiuses of minimum enclosing balls,” in Advances in neural information processing systems, 2010, pp. 649–657.

[31] H. Do, A. Kalousis, A. Woznica, and M. Hilario, “Margin and radius based multiple kernel learning,” in Machine Learning and Knowledge Discovery in Databases. Springer, 2009, pp. 330–343.

[32] X. Liu, L. Wang, J. Yin, E. Zhu, and J. Zhang, “An efficient approach to integrating radius information into multiple kernel learning,” IEEE Transactions on Cybernetics, vol. 43, no. 2, pp. 557–569, 2013.

[33] X. Liu, L. Zhou, L. Wang, J. Zhang, J. Yin, and D. Shen, “An efficient radius-incorporated mkl algorithm for alzheimer s disease prediction,” Pattern Recognition, 2014.

[34] S. Peng, Q. Hu, Y. Chen, and J. Dang, “Improved support vector machine algorithm for heterogeneous data,” Pattern Recognition, 2014.

[35] S. Boyd and L. Vandenberghe, Convex optimization. Cambridge university press, 2004.

[36] R. A. Hom and C. R. Johnson, “Topics in matrix analysis,” Cambridge UP, New York, 1991.

[37] A. Hyv¨arinen and E. Oja, “Independent component analysis: algorithms and applications,” Neural networks, vol. 13, no. 4, pp. 411–430, 2000.

[38] J. Yang, D. Zhang, and J.-y. Yang, “Is ica significantly better than pca for face recognition?” in IEEE International Conference on Computer Vision, vol. 1, 2005, pp. 198–203.

[39] A. Krizhevsky and G. Hinton, “Learning multiple layers of features from tiny images,” Computer Science Department, University of Toronto, Tech. Rep, vol. 1, no. 4, p. 7, 2009.

[40] B. Hariharan, J. Malik, and D. Ramanan, “Discriminative decorrelation for clustering and classification,” in Proc. European Conference on Computer Vision. Springer, 2012, pp. 459–472.

[41] R. Girshick and J. Malik, “Training deformable part models with decorrelated features,” in IEEE International Conference on Computer Vision, 2013, pp. 3016–3023.

[42] K. B. Petersen, M. S. Pedersen et al., “The matrix cookbook,” Technical University of Denmark, vol. 450, pp. 7–15, 2008.

[43] G. R¨atsch, T. Onoda, and K.-R. M¨uller, “Soft margins for adaboost,” Machine learning, vol. 42, no. 3, pp. 287–320, 2001.

[44] B. Scholkopft and K.-R. Mullert, “Fisher discriminant analysis with kernels,” in Signal Processing Society Workshop Neural Networks for Signal Processing, 1999, pp. 23–25.

[45] G. C. Cawley and N. L. Talbot, “Efficient leave-one-out cross-validation of kernel fisher discriminant classifiers,” Pattern Recognition, vol. 36, no. 11, pp. 2585–2592, 2003.

[46] G. B. Huang, V. Jain, and E. Learned-Miller, “Unsupervised joint alignment of complex images,” in IEEE International Conference on Computer Vision, 2007, pp. 1–8.

[47] E. Nowak and F. Jurie, “Learning visual similarity measures for com- paring never seen objects,” in Proc. IEEE Conf. Computer Vision and Pattern Recognition, 2007, pp. 1–8.

[48] N. Pinto, J. J. DiCarlo, and D. D. Cox, “How far can you get with a modern face recognition test set using only simple features?” in Proc. IEEE Conf. Computer Vision and Pattern Recognition, 2009, pp. 2591– 2598.

[49] G. B. Huang, M. J. Jones, and E. Learned-Miller, “Lfw results using a combined nowak plus merl recognizer,” in Workshop on Faces in’RealLife’Images: Detection, Alignment, and Recognition, 2008.

[50] J.-F. Cai, E. J. Cand`es, and Z. Shen, “A singular value thresholding al- gorithm for matrix completion,” SIAM Journal on Optimization, vol. 20, no. 4, pp. 1956–1982, 2010.

[51] G. A. Watson, “Characterization of the subdifferential of some matrix norms,” Linear Algebra and its Applications, vol. 170, pp. 33–45, 1992.


Designed for Accessibility and to further Open Science