Worst-Case Linear Discriminant Analysis as Scalable Semidefinite Feasibility Problems

2014·Arxiv

Abstract

Abstract

In this paper, we propose an efficient semidefinite programming (SDP) approach to worst-case linear discriminant analysis (WLDA). Compared with the traditional LDA, WLDA considers the dimensionality reduction problem from the worst-case viewpoint, which is in general more robust for classifica-tion. However, the original problem of WLDA is non-convex and difficult to optimize. In this paper, we reformulate the optimization problem of WLDA into a sequence of semidefinite feasibility problems. To efficiently solve the semidefinite feasibility problems, we design a new scalable optimization method with quasi-Newton methods and eigen-decomposition being the core components. The proposed method is orders of magnitude faster than standard interior-point based SDP solvers.

Experiments on a variety of classification problems demonstrate that our approach achieves better performance than standard LDA. Our method is also much faster and more scalable than standard interior-point SDP solvers based WLDA. The computational complexity for an SDP with m constraints and matrices of size d by d is roughly reduced from to in our case).

Index Terms—Dimensionality Reduction, Worst-Case Linear Discriminant Analysis, Semidefinite Programming

I. INTRODUCTION

Dimensionality reduction is a critical problem in machine learning, pattern recognition and computer vision, which for linear case learns a transformation matrix to project the input data to a lower-dimensional space such that the important structure or geometry of input data is preserved. It can help us to eliminate the inherent noise of data, and improve the classification performance. It can also decrease the computational complexities of subsequent machine learning tasks. There are two classical dimensionality reduction methods used widely in many applications, principal component analysis (PCA) and linear discriminant analysis (LDA). PCA is an unsupervised linear dimensionality reduction method, which seeks a subspace of the data that have the maximum variance and subsequently projects the input data onto it. PCA may not give good classification performance due to its unsupervised nature. LDA is in supervised fashion, which aims to maximize the average distance between each two class means and minimize the average distance between each two samples within the same class. However, it has some limitations: 1) For c-class data, the target dimension of the projected subspace is restricted to be at most . In this sense, LDA is suboptimal and

The authors are with School of Computer Science, The University of Adelaide, Australia. C. Shen and A. van den Hengel are also with Australian Centre for Robotic Vision. Correspondence should be addressed to C. Shen (chunhua.shen@adelaide.edu.au).

may cause so called class separation problem that LDA tends to merge classes which are close in the original space; 2) It assumes that conditional probability density functions are normally distributed, which does not hold in many cases; 3) The scatter matrices are required to be nonsingular.

There are a large number of extension works to LDA and PCA [1], [2], [3], [4], [5], [6], [7], [8], [9], [10]. Among these methods, our focus in this paper is on the discrimination criterion of worst-case linear discriminant analysis (WLDA), which was proposed by Zhang et al. [5]. Instead of using average between-class and within-class distances as LDA, WLDA considers scatter measures from the worst-case view, which arguably is more suitable for applications like classi-fication. Specifically, WLDA tries to maximize the minimum of pairwise distances between class means, and minimize the maximum of within-class pairwise distances over all classes. Due to the complex formulation of its criterion, the problem of WLDA is difficult to optimize.

In [5], the problem of WLDA was firstly relaxed to a metric learning problem on , which can be solved by a sequence of SDP optimization procedures, where SDP problems are solved by standard interior-point methods (We denote it as Zhang et al. (SDP)). However, standard interior-point SDP solvers scale poorly to problems with high dimensionality, as the computational complexity is , where d is the problem size, and m is the number of constraints in SDP. An alternative optimization procedure was then proposed in [5] for large-scale WLDA problems, in which one column of the transformation matrix was found at each iteration while fixing the other directions. We denote this method as Zhang et al. (SOCP) since it was reformulated as a series of secondorder cone programming (SOCP) problems lastly. Typically, this greedy method does not guarantee to find a globally optimal solution.

In this paper we propose a fast SDP approach to solve WLDA problem. The problem is converted to a sequence of SDP feasibility problems using bisection search strategy, which can find a globally optimal solution to the relaxed problem. More importantly, we adopt a novel approach to solve SDP feasibility problem at each iteration. Motivated by [11], a Frobenius-norm regularized SDP formulation is used, and its Lagrangian dual can be solved effectively by quasi-Newton methods. The computational complexity of this optimization method is dominated by eigen-decomposition at each iteration, which is . The proposed method is denoted as SDWLDA. The main contributions of this work are: 1) By introducing an auxiliary variable, the original WLDA problem is reformulated and can be solved via a sequence of convex feasibility problems, by which the global optimum can be obtained for the relaxed metric learning problem. 2) By virtue of the use of Frobenius norm regularization, the optimization problem can be addressed by solving its Lagrange dual, where first-order methods such as quasi-Newton can be used. This approach is much faster than solving the corresponding primal problem using standard interior-point methods, and can be applied to large-scale problems. Next, we briefly review some relevant work.

Dimensionality reduction In order to overcome the drawbacks of LDA and improve the accuracy in classification, many extensions have been proposed, such as relevant component analysis (RCA) [1], neighborhood component analysis (NCA) [2], null space LDA (NLDA) [3], orthogonal LDA (OLDA) [4], Enhanced fisher discriminant criterion (EFDC) [6], Geometric mean-based subspace selection (GMSS) [7], Harmonic mean-based subspace selection (HMSS) [8], and Max-min distance analysis (MMDA) [9]. Assuming dimensions with large within-class covariance are not relevant to subsequent classification tasks, RCA [1] assigns large weights to “relevant dimensions” and small weights to “irrelevant dimensions”, where the relevance is estimated using equivalence constraints. NCA [2] learns the transformation matrix W directly by minimizing the expected leave-one-out classification error of k-nearest neighbours on the transformed space. Because the objective function to be optimized is not convex, NCA tends to converge to a local optimum. NLDA [3], OLDA [4] and EFDC [6] were proposed to address the problem that standard LDA fails when scatter matrices are singular. NLDA maximizes the between-class distance in the null space of the within-class scatter matrix, while OLDA calculates a set of orthogonal discriminant vectors by diagonalizing the scatter matrices simultaneously. The resulting transformation matrices are both orthogonal for NLDA and OLDA, and they are equivalent to each other under a mild condition [12]. EFDC incorporates the intra-class variation into the Fisher discriminant criterion, so that data from the same class can be mapped to a subspace where both the intra-class compactness and intraclass variation are well preserved. In this way, this method is robust to the intraclass variation and results in a good generalization capability. To avoid the class separation problem of LDA, Tao et al. [13] proposed a general averaged divergence analysis (GADA) framework, which presented a general mean function in place of the arithmetic mean used in LDA. By choosing different mean functions, several subspace selection algorithms have been developed. GMSS [7] investigates the effectiveness of the geometric mean-based subspace selection, which maximizes the geometric mean of Kullback-Leibler (KL) divergences between different class pairs. HMSS [8] maximizes the harmonic mean of the symmetric KL divergences between all class pairs. They adaptively give large weights to class pairs that are close to each other, and result in better class separation performance than LDA. Instead of assigning weights to class pairs, MMDA [9] directly maximizes the minimum pairwise distance of all class pairs in the low-dimensional subspace, which guarantees the separation of all class pairs. However, MMDA does not take into account of the within-class pairwise distances over all classes. Recently, Bian et al. [10] presented an asymptotic generalization analysis of LDA, which enriched the existing theory of LDA further. They showed that the generalization ability of LDA is mainly determined by the ratio of dimensionality to training sample size, where both feature dimensionality and training data size can be proportionally large.

Many dimensionality reduction algorithm such as PCA and LDA can be formulated into a trace ratio optimization problem [14]. Guo et al. [15] presented a generalized Fisher discriminant criterion, which is essentially a trace ratio. They proposed a heuristic bisection way, which was proven to converge to the precise solution. Wang et al. [16] tackled the trace ratio problem directly by an efficient iterative procedure, where a trace difference problem was solved via the eigen-decomposition method in each step. Shen et al. [17] provided a geometric revisit to the trace ratio problem in the framework of optimization on the Grassmann manifold. Different from [16], they proposed another efficient algorithm, which employed only one step of the parallel Rayleigh quotient iteration at each iteration. Kokiopoulou et al. [18] also treated the dimensionality reduction problem as trace optimization problems, and gave an overview of the eigenvalue problems encountered in dimensionality reduction area. They made a comparition between nonlinear and linear methods for dimensionality reduction, including Locally Linear Embedding (LLE), Laplacean Eigenmaps, PCA, Locality Preserving Projections (LPP), LDA, etc., and showed that all the eigenvalue problems in explicit linear projections can be regarded as projected analogues of the socalled nonlinear projections.

Different from the aforementioned methods, WLDA considers the dimensionality reduction problem from a worst-case viewpoint. It maximizes the worst-case between-class scatter matrix and minimizes the worst-case within-class scatter matrix simultaneously, which can lead to more robust classifica-tion performance. The inner maximization and minimization over discrete variables make it different from the general trace ratio problem, and difficult to solve. The method of solving the general trace ratio problem cannot be extended here directly. Furthermore, different from the iterative algorithm for trace ratio optimization problem [16], we formulate the WLDA problem as a sequence of SDP problems, and propose an efficient SDP solving method. The eigen-decomposition we used is to solve the Lagrange dual gradient, which differs from that employed in solving the trace ratio optimization problem.

Solving large-scale SDP problems Instead of learning the transformation matrix W, quadratic Mahalanobis distance metric learning methods (which are highly related to dimensionality reduction methods) optimize over , in order to obtain a convex relaxation. The transformation matrix W can be recovered from the eigen-decomposition of Z. Because Z is positive semidefinite (p.s.d.) by definition, quadratic Mahalanobis metric learning methods optimizing on Z usually need to solve an SDP problem.

Xing et al. [19] formulated metric learning as a convex (SDP) optimization problem, and a globally optimal solution can be obtained. Weinberger et al. [20] presented a distance metric learning method, which optimizes a Mahalanobis metric such that the k-nearest neighbours always belong to the same class while samples from different classes are separated by a large margin. In terms of SDP solver, they proposed an alternate projection method, where the learned metric Z is projected back onto the p.s.d. cone by eigen-decomposition at each iteration. MMDA [9] was solved approximately by a sequence of SDP problems using standard interior-point methods. Shen et al. [21] proposed a novel SDP based method for directly solving trace quotient problems for dimensionality reduction. With this method, globally-optimal solutions can be obtained for trace quotient problems.

As we can see, many aforementioned methods used standard interior-point SDP solvers, which are unfortunately computationally expensive (computational complexity is ) and scale poorly to large-scale problems. Thus an efficient SDP optimization approach is critical for large-scale metric learning problems.

There are many recent work to address large-scale SDP problems arising from distance metric learning and other computer vision tasks. Shen et al. [11] proposed a fast SDP approach for solving Mahalanobis metric learning problem. They introduced a Frobenius-norm regularization in the objective function of SDP problems, which leads to a much simpler Lagrangian dual problem: the objective function is continuously differentiable and p.s.d. constraints in the dual can be eliminated. L-BFGS-B was used to solve the dual, where a partial eigen-decomposition needed to be calculated at each iteration. Wang et al. [22] also employed a similar dual approach to solve binary quadratic problems for computer vision tasks, such as image segmentation, co-segmentation, image registration. SDP optimization method in [11], [22] can be seen as an extension of the works in [23], [24], which considered semidefinite least-squares problems. The key motivation of [23], [24] is that the objective function of the corresponding dual problem is continuously differentiable but not twice differentiable, therefore first-order methods can be applied. Malick [24] and Boyd and Xiao [23] proposed to use quasi-Newton methods and projected gradient methods respectively, to solve the Lagrangian dual of semidefinite least-squares problems. Semismooth Newton-CG methods [25] and smoothing Newton methods [26] are also exploited for semidefinite least-squares problems, which require much less number of iterations at the cost of higher computational complexity per iteration (full eigen-decomposition plus conjugate gradient).

Alternatively, stochastic (sub)gradient descent (SGD) methods [27] were also employed to solve SDP problems. Combining with alternating direction methods [28], [29], SGD can be used for SDP problems with inequality and equality constraints. The computational bottleneck of typical SGD is the projection of one infeasible point onto the p.s.d. cone at each iteration, which leads to the eigen-decomposition of a matrix. A number of methods have been proposed to speed up the projection operation at each iteration. Chen et al. [30] proposed a low-rank SGD method, in which rank- k stochastic gradient is constructed and then the projection operation is simplified to compute at most k eigenpairs. In the works of [31], [32], [33], [34], the distance metric is updated by rank-one matrices iteratively, and no eigen-decomposition or only one leading eigenpair is required. Note that SGD methods usually need more iterations to converge than the dual approaches based on quasi-Newton methods [11].

The most related work to ours may be Shen et al.’s [11]. We use similar SDP optimization technique as that in [11]. However, SDP feasibility problems are considered in our paper while the work in [11] focuses on standard SDP problems with linear objective functions.

Notation We use a bold lower-case letter x to denote a column vector, and a bold capital letter X to denote a matrix. is the transposition of indicates the set of matrices. represents an identity matrix. indicates that the matrix is positive semidefinite. denotes the inner product of two matrices or vectors. indicates the trace of a matrix. is the Frobenius norm of a matrix. returns a diagonal matrix with the input elements on its main diagonal. Suppose that the eigen-decomposition of a symmetric matrix is , where U is the orthonormal matrix of eigenvectors of X, and are the corresponding eigenvalues, we define the positive and negative parts of X respectively as

It is clear that holds.

II. WORST-CASE LINEAR DISCRIMINANT ANALYSIS

We briefly review WLDA problem proposed by [5] firstly. Given a training set of n samples ), which consists of classes , where class contains samples. As we mentioned before, the aim of linear dimensionality reduction is to find a transformation matrix with .

We define the within-class scatter matrix of the kth class as

which is also the covariance matrix for the kth class, and is the class mean of the kth class . The between-class scatter matrix of the ith and jth classes is defined as

Unlike LDA which seeks to minimize the average within-class pairwise distance, the within-class scatter measure used in WLDA is defined as

which is the maximum of the within-class pairwise distances over all classes.

On the other hand, the between-class scatter measure used in WLDA is defined as

uses the minimum of the pairwise distances between class means, instead of the average of distances between each class mean and the sample mean employed in LDA.

The optimality criterion of WLDA is defined as to maximize the ratio of the between-class scatter measure to the within-class scatter measure:

As stated in [5], this problem (7) is not easy to optimize with respect to W, so a new variable is introduced, and the problem (7) is formulated as a metric learning problem.

Theorem 2.1. Define sets , and . Then is the set of extreme points of , and is the convex hull of .

This theorem has been widely used and its proof can be found in [35]. According to Theorem 2.1, the orthonormal constraint on W can be relaxed to convex constraints on Z = , and the problem (7) can be relaxed to the following problem defined on a p.s.d. variable Z [5]:

Once the optimal solution is obtained, the optimal for problem (7) can be recovered using the top r eigenvectors of .

In [5], Zhang et al. proposed two methods to solve (7), as we stated in Section I. In the first one, an iterative algorithm was presented to solve the relaxed problem (8), where an SDP problem needs to be solved at each step by standard SDP solver. This method is not scalable to problems with high dimensionality or large training data points. The second one is based on a greedy approach, which cannot guarantee to find a globally optimal solution.

Hence, in the next section, we will describe our algorithm (so called SD-WLDA) of finding the transformation matrix that maximizes (8), and demonstrate how to solve it using an efficient approach.

III. A FAST SDP APPROACH TO WLDA

In this section, problem (8) is firstly reformulated into a sequence of SDP optimization problems based on bisection search. Then, a Frobenius norm regularization is introduced and the SDP problem in each step is solved through Lagrangian dual formulation. With this SD-WLDA method, the global optimum can be acquired for the relaxed problem (8). The computational complexity can be reduced as well by solving the dual problem using quasi-Newton methods, compared with solving the primal problem directly using interior-point based algorithm.

A. Problem Reformulation

By introducing an auxiliary variable , problem (8) can be rewritten as

There are two variables to be optimized in problem (9), but we are interested in finding Z that can maximize . Problem (9) is clearly non-convex with respect to and Z since the constraint (9b) is not convex. However, noting that (9b) will become linear if is given, we employ the bisection search strategy and convert the optimization problem (9) into a set of convex feasibility problems, by which the global optimum can be computed effectively.

Let denote the optimal value of (9a). Given , if the convex feasibility problem

find Z, (10a)

i < j c, c, (10b)

is feasible, then . Otherwise, if the above problem is infeasible, then .

Algorithm 1 shows the bisection search based optimization process. Once a feasible solution is obtained which maximize will be the globally optimal solution to the relaxed problem (8). The optimal for problem (7) can be acquired using the top r eigenvectors of .

B. Lagrangian Dual Formulation

Algorithm 1 shows that an SDP feasibility problem needs to be solved at each step during the bisection search process. Considering that standard interior-point SDP solvers have a computational complexity of , where d is the dimension of input data, and m is the number of constraints in SDP, it becomes quite expensive for processing high-dimensional data. In this subsection, we reformulate the feasibility problem (10) into a Frobenius norm regularized SDP problem, which can be efficiently solved via its Lagrangian dual using first-order methods like quasi-Newton. The computational complexity will be reduced to . The primal solution can then be calculated from the dual solution based on Karush-Kuhn-Tucker (KKT) [36, p. 243] conditions.

The problem (10) can be expressed equivalently in the following form:

i < j c, c, (11b) r, (11c)

In the above formulation, the variable is replaced by X =

constraints (11b) and (11c) correspond to (10b) and (10c), respectively. The constraints (11d) stem from the fact that .

Proposition 3.1. Given , if the problem (10) and equivalently (11) is feasible, one feasible solution exists for the following semidefinite least-squares problem:

If the problem (12) is feasible, its optimal solution can be used as a feasible solution to (11), and one solution to problem (10) can be acquired as well.

The problem (12) is a standard semidifinite least-square problem and can be solved readily by off-the-shelf SDP solvers. However, as we mentioned before, the computational complexity is really high if we solve the primal problem directly by standard interior-point SDP solvers. It will greatly hamper the use of WLDA in large-scale problems. Thanks to the Frobenius norm regularization in the objective function of (12), we can use Lagrangian dual approach to solve the problem easily.

Introducing the Lagrangian multipliers corresponding to the constraints (12b)-(12d), and a symmetric matrix corresponding to the p.s.d. constraint (12e), the following result can be acquired.

Proposition 3.2. The Lagrangian dual problem of (12) can be simplified in the following form:

where

Furthermore, the optimal solution to problem (12) is , where , which is calculated based on the optimal dual variables , and .

From the definition of and the operator is forced to be p.s.d. and block-diagonal, so the optimal solution to problem (10) can be acquired easily. In addition, it is noticed that the objective function of (13) is differentiable (but not necessarily to be twice differentiable), it allows us to solve the dual problem efficiently using first-order methods, such as quasi-Newton methods.

is g(u, v, p) =

where

C. Feasibility Condition

As stated in Proposition 3.1, if (10) is feasible, a solution can be found by solving the problem (12). During running quasi-Newton algorithms to solve the problem (12), an infeasibility condition of the problem (10) needs to be checked iteratively, which is presented here:

Proposition 3.3. If the following conditions are satisfied

then the problem (10) is considered as infeasible.

This infeasibility condition can be deduced from a general conic feasibility problem presented in [37]. Explanations are presented in detail in the appendix. We check this condition at each iteration of quasi-Newton algorithms. is evaluated during calculating the dual objective function, so it will not bring extra computational cost. Once the condition (15) is satisfied, problem (10) (equivalently (11)) is not feasible and quasi-Newton algorithms will be stopped. Otherwise, quasiNewton algorithms keep running until convergence, and then a feasible solution to the problem (11) is found.

D. Solving the Feasibility Problem

In this subsection, we summarize the procedure of solving the problem (10) by our fast SDP optimization algorithm. It has been domenstrated that we can find the feasible solution to (10) by solving the dual problem (13) with quasi-Newton

methods. In this work, we use L-BFGS-B [38], [39], a limitedmemory quasi-Newton algorithm package, which can handle the problem with box constraints. Here we only need to provide the callback function to L-BFGS-B, which calculates the objective function of (13) and its gradient. The procedure of finding the feasible solution Z is described in Algorithm 2.

E. Computational Complexity Analysis

The computational complexity of L-BFGS-B is O(Km), where K is a moderate number between 3 to 20, m =

is the problem size to be solved by L-BFGS-B, which is equal to the number of constraints in the primal SDP problem (12). At each iteration of L-BFGS-B, the eigen-decomposition of a matrix is carried out to compute , which is used to evaluate all the dual objective values, gradients, as well as the feasibility conditions (15). The computational complexity is . Hence, the overall computational complexity of our algorithm SD-WLDA is . Since , eigen-decomposition dominates the most computational time of SD-WLDA, which is . On the other hand, solving an SDP problem using standard interior-point methods needs a computational complexity of . Since m > d in our case, our algorithm is much faster than interior-point methods, and can be used to large-scale problems.

IV. EXPERIMENTS

In this section, experiments are performed to verify the performance of SD-WLDA. We conduct comparisons between SD-WLDA and other methods on both classification performance and computational complexity. The classification performance is contrasted between SD-WLDA and LDA, LMNN, OLDA. We also compare the performance of our SDWLDA with both optimization methods proposed by Zhang et al. in [5] (Zhang et al. (SDP) and Zhang et al. (SOCP) receptively). This can be used to verify the correctness of our algorithm. The computational complexity is compared between SD-WLDA, standard interior-point algorithms to solve our SDP formulation (SDPT3 [40] and SeDuMi [41]), Zhang et al. (SDP) which uses SDPT3 as well, and Zhang et al. (SOCP).

All algorithms are tested on a 2.7GHz Intel CPU with 20G memory. The SD-WLDA algorithm is implemented in Matlab, where the Fortran code of L-BFGS-B is employed to solve the dual problem (13). The Matlab routine “eig” is used to compute eigen-decomposition. The tolerance setting of L-BFGS-B is set to default. The tolerance in Algorithm 1 is set to , and the parameter in the feasibility condition (15) is set to .

A. Experiments on UCI Datasets

Some UCI datasets [42] are used here firstly. We perform 30 random splits for each dataset, with 70% as training samples and 30% as test samples. The classification performance is evaluated based on 5 nearest neighbour (5-NN) classifier. For fair comparison with LDA, the final dimension is set to .

The experimental results are presented in Table I, where the baseline results are obtained by applying 5-NN classifier on the original feature space directly. For each dataset, the experiment runs 30 times, and the error rate is reported by the mean error as well as the standard deviation. The smallest classification error is shown in bold. The results illustrate that WLDA gives smaller classification error rates compared to other algorithms in most datasets. The classification results by our fast SDP solving algorithm SD-WLDA and Zhang et al. (SDP) are quite similar, with small difference coming from numerical error during computation. The error rates calculated by Zhang et al. (SOCP) are sometimes quite different from that by Zhang et al. (SDP) and SD-WLDA, e.g., on “Heart” and “Waveform” datasets, which results from different relaxation methods and optimization procedures employed.

In terms of computational speed, SD-WLDA approach is much faster than other methods. Zhang et al. (SDP), which is also solved by standard interior-point algorithm SDPT3, is faster than Ours (SDPT3), because of different SDP problem formulations. The merit of SD-WLDA on computation is even more dramatic for high dimensional problems. For example, we compare the computational speeds of SD-WLDA and SDPT3 on the datasets of “Iris” and “Waveform”, which have the same number of classes. SD-WLDA is about 5 times faster than Zhang et al. (SDP), and 20 times faster than Ours (SDPT3) on “Iris” which has 105 training samples with the input dimension as 4, whereas it becomes 12 times quicker than Zhang et al. (SDP), and 300 times quicker than Ours (SDPT3) on “Waveform” which has 3500 training samples with the input dimension as 40. The computational time increases more significantly for SeDuMi with respect to input dimension. Zhang et al. (SOCP) has no computational advantage on solving problems with few training samples and low dimensionality. The computational superiority appears when dimensionality increases, referring to the results on “Waveform” dataset, which is quicker than Zhang et al. (SDP). Because of the column-wise iteration solving method of Zhang et al. (SOCP), the computational complexity of Zhang et al. (SOCP) relates closely with the final dimension. That is why Zhang et al. (SOCP) is quite

TABLE I TEST ERRORS AND COMPUTATIONAL TIME OF DIFFERENT METHODS ON UCI DATASETS WITH CLASSIFIER. THE TEST ERROR IS THE AVERAGE OVER RANDOM SPLITS, WITH STANDARD DEVIATION SHOWN IN THE BRACKET. THE COMPUTATIONAL TIME IS ALSO THE AVERAGE OVER SD-WLDA IS EFFICIENT IN COMPUTATION, AND GIVES COMPARABLE CLASSIFICATION PERFORMANCE COMPARED TO OTHER METHODS.

fast on “Sonar” and “Ionosphere” datasets, which set the final dimension to 1. However, it still slower than SD-WLDA.

To prove the robust classification performance of SDWLDA, we change the ratio of training samples from 20% to 80% on datasets “Sonar” and “Ionosphere”. For each value of , we calculate the average test error as well as the standard deviation across 10 trials by SD-WLDA and LDA respectively. The results in Fig. 1 demonstrate that SD-WLDA is more superior than LDA when there is small number of training samples. This phenomenon illustrates that WLDA alleviates the dependence of classification performance on large number of training samples.

LDA requires the data to map to at most dimension, while SD-WLDA, which is based on an SDP optimization method, does not have such a restriction. Here we perform another experiment by SD-WLDA on “Heart” dataset, with different final dimensions. The result in Table II shows that is not the best final dimension for “Heart”. So SDWLDA algorithm is more flexible.

B. Experiments on Face, Object and Letter Datasets

As shown before, SD-WLDA algorithm can be used to solve large-scale problems due to its efficiency on computation. In this section, experiments are carried out on face, object and letter image datasets, which have high input dimension and more classes. The images are resized to different resolution before experiments, as shown in Table III. PCA has been applied beforehand to reduce the original dimension and also to remove noises. The final dimension is still set to be for fair comparison to LDA.

1) Face recognition: four face databases are employed here. ORL [43] consists of 400 face images of 40 individuals, each with 10 images. We randomly choose 70% of the samples for training and the remaining 30% for test. The Yale dataset [44] contains 165 grey-scale images of 15 individuals, 11 images per subject. We split them into training and test sets by 7/3 sampling as well. PIE dataset (Pose C29) [45] has 40 subjects, and 24 images for each individual. 80% of the samples are chosen randomly for training. UMist dataset [46]

Fig. 1. Test errors of SD-WLDA and LDA on “Sonar” and “Ionosphere” datasets, with different sizes of the training set. The test error is shown by marker, which is the average error of 10 trails for each split. The vertical bar represents the standard deviation. SD-WLDA produces significantly smaller errors than LDA with less number of training samples.

contains 564 grayscale images of 20 different people. We only use 30% of the samples for training to test the performance of SD-WLDA.

Experimental results in Table III show that SD-WLDA gives better classification performance for all datasets. The classification error rates by SD-WLDA and Zhang et al. (SDP) are identical with each other, as PCA used before has already removed the noises, which proves the correctness of our algorithm. However, SD-WLDA is much faster than Zhang et

TABLE II IN CONTRAST TO LDA, SD-WLDA CAN BE USED TO PROJECT DATA TO FINAL DIMENSION LARGER THAN HE TEST ERROR IS BY SD-WLDA WITH CLASSIFIER ON “HEART” DATASET USING DIFFERENT FINAL DIMENSION. THE TEST ERROR IS THE AVERAGE ERROR OF STANDARD DEVIATION IN THE BRACKET.

al. (SDP) method. For example, SD-WLDA runs almost 14 times faster than Zhang et al. (SDP) on Yale dataset. The error rates calculated by Zhang et al. (SOCP) are rather larger, which result from the non-globally optimal solution the algorithm reached. The computational superiority of Zhang et al. (SOCP) does not show up as well.

In order to illustrate the computational speed of SD-WLDA and both methods in [5] with respect to the number of classes and the input data dimension (here it refers to the dimension after PCA) respectively, more experiments are performed on Yale dataset. Firstly, we set the dimension after PCA to be 50, and change the number of classes from 9 to 15. Experimental results in Fig. 2(1) demonstrate that compared with the other methods, the speed of SD-WLDA is less sensitive to the increase of the amount of classes. Since the final dimension r is set to be , the computational complexity of Zhang et al. (SOCP) jumps up obviously with the increase of classes. Secondly, we use all classes, and let the dimension after PCA change from 30 to 115. Experimental results in Fig. 2(2) show that the computational cost of SD-WLDA rises up pretty slower with the growth of input dimension, in contrast to Zhang et al. (SDP). Zhang et al. (SOCP) becomes faster than Zhang et al. (SDP) when input dimension is larger than 90. This phenomenon certifies that Zhang et al. (SOCP) is more suitable for processing high dimensional datasets than Zhang et al. (SDP), as [5] presented. However, this method cannot guarantee to find a global optimal solution. Finally, we test on Yale dataset with an even high input dimension as 800. Experimental results in Table IV demonstrate that SD-WLDA is absolutely faster than Zhang et al. (SDP), both of which lead to similar classification error rates. Although Zhang et al. (SOCP) is comparable with SD-WLDA on computational time, the error rate it obtained is relatively bigger.

In addition, to show the superiority of SD-WLDA on clas-sification, another experiment is conducted using SD-WLDA and LDA on Yale dataset, with the dimension after PCA as 50 and 15 classes. We reduce the final dimension from to 1. The test results shown in Fig. 3 demonstrate that SDWLDA always gives lower test error than LDA, which further proves the good classification performance of SD-WLDA.

2) Object recognition: Three datasets are used here: Coil20 [47], Coil30 [48], and ALOI [49]. Coil20 dataset contains 1440 grey-scale images with black background for 20 objects, with each containing 72 different images. Coil30 dataset consists of 750 RGB images for 100 objects. We choose the first 30 objects and convert them into greyscale images in our experiment. ALOI dataset consists of 1000 objects taken at varied viewing angles, illumination angles, etc.. We use the first 25 objects here, with 24 images for every object. Different training and test splitting ratios are

Fig. 2. Computational time of different SDP optimization methods with the increase of number of classes and input dimensionality, respectively. The computational time is the average time of 10 trials for each setting. SDWLDA is more efficient than Zhang et al. (SDP) and Zhang et al. (SOCP) on computation. The computational complexity of SD-WLDA is less sensitive to the increase of number of classes and input dimensionality, compared to Zhang et al. (SOCP) and Zhang et al. (SDP) respectively.

adopted for different datasets in order to test the performance under different situations, which can be found in Table III. The experimental results demonstrate that SD-WLDA is better in computational speed. Although OLDA produces the smallest classification error on Coil20 (2.91%), SD-WLDA gives a comparable result (3.14%).

3) Letter recognition: The Binary Alphadigits dateset [50] is employed here. The dataset BA1 contains 10 digits of 0 to 9, and BA2 contains 15 capital letters A through O. Experimental results in Table III present that WLDA produces better classification performance on both databases. Zhang et al. (SDP) and SD-WLDA give similar classification results, while Zhang et al. (SOCP) lead to much larger error rates. In terms of computational speed, SD-WLDA runs 5 times faster than Zhang et al. (SDP) on BA1, and more than 9 times faster than Zhang et al. (SDP) on BA2, which has more training samples and number of classes. This experiment demonstrates again that SD-WLDA is more efficient in processing large-scale datasets.

TABLE III TEST ERRORS AND COMPUTATIONAL TIME OF DIFFERENT METHODS ON LAGER-SCALE DATASETS WITH CLASSIFIER. PCA IS APPLIED FIRSTLY. THE TEST ERROR IS THE AVERAGE ERROR WITH STANDARD DEVIATION IN THE BRACKET. THE COMPUTATIONAL TIME IS THE AVERAGE TIME TOO. SD-WLDA GIVES BETTER CLASSIFICATION PERFORMANCE AND FASTER COMPUTATIONAL SPEED THAN OTHER ALGORITHMS.

TABLE IV CLASSIFICATION PERFORMANCE AND COMPUTATIONAL SPEED OF DIFFERENT METHODS ON YALE DATASET, WITH HIGH INPUT DIMENSION. (TRAINING SAMPLES, EST SAMPLES, IMENSION AFTER PCA IS IMENSION IS PRESENTS FASTER COMPUTATIONAL SPEED THAN ZHANG et al. (SDP) AND BETTER CLASSIFICATION PERFORMANCE THAN ZHANG et al. (SOCP).

TABLE V TEST ERRORS OF SD-WLDA AND LDA ON ORL DATASET WITH CLASSIFIER USING DIFFERENT NUMBER OF CLASSES (RAINING SAMPLES, EST SAMPLES, DIMENSION AFTER PCA IS IMENSION IS PRESENTS BETTER CLASSIFICATION PERFORMANCE ESPECIALLY FOR LARGE NUMBER OF CLASSES.

TABLE VI TEST ERRORS OF SD-WLDA AND LDA ON COIL20 DATASET WITH CLASSIFIER, USING DIFFERENT NUMBER OF CLASSES (SAMPLES, EST SAMPLES, DIMENSION AFTER PCA IS IMENSION IS PRESENTS BETTER CLASSIFICATION PERFORMANCE ESPECIALLY FOR LARGE NUMBER OF CLASSES.

4) Classification performance regarding to number of classes: It has been validated that SD-WLDA can give smaller classification errors than LDA using a small size of training set. In this section, we evaluate the classification performance of SD-WLDA and LDA with respect to the number of classes c. Take the datasets ORL and Coil20 as examples. The experimental settings for each dataset are shown in the captions of Table V and VI respectively. We choose different numbers of classes from each dataset, and compare the test errors of SDWLDA and LDA. The results in Tables V and VI demonstrate that when the number of classes is small, SD-WLDA and LDA have comparable classification results, whereas when the number of classes increases, SD-WLDA shows better classification performance.

V. CONCLUSION

In this work, an efficient SDP optimization algorithm has been proposed to solve the problem of worst-case linear discriminant analysis. WLDA takes into account the worst-case pairwise distance between and within classes, which achieves better classification performance than conventional LDA. In order to reduce the computational complexity so that it can be applied to large-scale problems, a fast algorithm has been presented by introducing the Frobenius norm regularization, and its Lagrangian dual can be simplified. Using our algorithm, the global optimum can be obtained in time. The algorithm is simple to implement and much faster than conventional SDP solvers. Experimental results on some UCI databases as well as face and object recognition tasks show the effectiveness on classification performance and the efficiency on computation of SD-WLDA.

Fig. 3. Test errors of SD-WLDA and LDA on Yale dataset with 5-NN classifier using different final dimension (120 training samples, 45 test samples, 15 classes, and dimension after PCA is 50). Test error is expressed by marker, which is the average error of 10 trials. The vertical bar represents the standard deviation. SD-WLDA always produces lower test errors than LDA in this experiment.

REFERENCES

[1] N. Shental, T. Hertz, D. Weinshall, and M. Pavel, “Adjustment learning and relevant component analysis,” in Proc. Eur. Conf. Comp. Vis., 2002, pp. 776–790.

[2] K. Q. Weinberger, J. Blitzer, and L. Saul, “Distance metric learning for large margin nearest neighbor classification,” in Proc. Adv. Neural Inf. Process. Syst., vol. 18, 2006, pp. 1473–1481.

[3] L.-F. Chen, H.-Y. M. Liao, M.-T. Ko, J.-C. Lin, and G.-J. Yu, “A new lda-based face recognition system which can solve the small sample size problem,” Pattern Recogn., vol. 33, no. 10, pp. 1713–1726, 2000.

[4] J. Ye and B. Yu, “Characterization of a family of algorithms for generalized discriminant analysis on undersampled problems,” J. Mach. Learn. Res., vol. 6, no. 4, pp. 483–502, 2005.

[5] Y. Zhang and D.-Y. Yeung, “Worst-case linear discriminant analysis,” in Proc. Adv. Neural Inf. Process. Syst., vol. 23, 2010, pp. 2568–2576.

[6] Q. Gao, J. Liu, J. Zhang, J. Hou, and X. Yang, “Enhanced fisher discriminant criterion for image recognition,” Pattern Recogn., vol. 45, no. 10, pp. 3717–3724, 2012.

[7] D. Tao, X. Li, X. Wu, and S. J. Maybank, “Geometric mean for subspace selection,” IEEE Trans. Anal. Mach. Intell., vol. 31, no. 2, pp. 260–274, 2009.

[8] W. Bian and D. Tao, “Harmonic mean for subspace selection,” in in Proceedings of the 19th International Conference on Pattern Recognition, 2008, pp. 1–4.

[9] ——, “Max-min distance analysis by using sequential sdp relaxation for dimension reduction,” IEEE Trans. Anal. Mach. Intell., vol. 33, no. 5, pp. 1037–1050, 2011.

[10] ——, “Asymptotic generalization bound of fisher’s linear discriminant analysis,” CoRR, vol. abs/1208.3030, 2012. [Online]. Available: http://arxiv.org/abs/1208.3030

[11] C. Shen, J. Kim, and L. Wang, “A scalable dual approach to semidefinite metric learning,” in Proc. IEEE Conf. Comp. Vis. Patt. Recogn., 2011, pp. 2601–2608.

[12] J. Ye and T. Xiong, “Computational and theoretical analysis of null space and orthogonal linear discriminant analysis,” J. Machine Learning Research, vol. 7, pp. 1183–1204, 2006.

[13] D. Tao, X. Li, X. Wu, and S. J. Maybank, “General averaged divergence analysis,” in Proc. of IEEE International Conference on Data Mining, 2007, pp. 302–311.

[14] Y. Jia, F. Nie, and C. Zhang, “Trace ratio problem revisited,” IEEE Trans. Neural Netw., vol. 20, no. 4, pp. 729–735, 2009.

[15] Y. Guo, S. Li, J. Yang, T. Shu, and L. Wu, “A generalized foley- sammon transform based on generalized fisher discriminant criterion and its application to face recognition,” Pattern Recogn. Lett., vol. 24, no. 1-3, pp. 147–158, 2003.

[16] H. Wang, S. Yan, D. Xu, X. Tang, and T. Huang, “Trace ratio vs. ratio trace for dimensionality reduction,” in Proc. IEEE Conf. Comp. Vis. Patt. Recogn., 2007, pp. 1–8.

[17] H. Shen, K.Diepold, and K. Hueper, “A geometric revisit to the trace quotient problem,” in Proc. of 19th International Symposium of Mathematical Theory of Networks and Systems, 2010.

[18] E. Kokiopoulou, J. Chen, and Y. Saad, “Trace optimization and eigen- problems in dimension reduction methods,” Numerical Linear Algebra with Applications, vol. 18, no. 3, pp. 565–602, 2011.

[19] E. P. Xing, A. Y. Ng, M. I. Jordan, and S. Russell, “Distance metric learning with application to clustering with side-information,” in Proc. Adv. Neural Inf. Process. Syst., vol. 15, 2003, pp. 521–528.

[20] K. Q. Weinberger, J. Blitzer, and L. K. Saul, “Distance metric learning for large margin nearest neighbour classification,” J. Mach. Learn. Res., vol. 10, pp. 207–244, 2009.

[21] C. Shen, H. Li, and M. J. Brooks, “Supervised dimensionality reduction via sequential semidefinite programming,” Pattern Recogn., vol. 41, pp. 3644–3652, 2008.

[22] P. Wang, C. Shen, and A. van den Hengel, “A fast semidefinite approach to solving binary quadratic problems,” in Proc. IEEE Conf. Comp. Vis. Patt. Recogn., 2013, pp. 1312–1319.

[23] S. Boyd and L. Xiao, “Least-squares covariance matrix adjustment,” SIAM J. Matrix Anal. Appl., vol. 27, no. 2, pp. 532–546, 2005.

[24] J. Malick, “A dual approach to semidefinite least-squares problems,” SIAM J. Matrix Anal. Appl., vol. 26, no. 1, pp. 272–284, 2004.

[25] X.-Y. Zhao, D. Sun, and K.-C. Toh, “A newton-cg augmented lagrangian method for semidefinite programming,” SIAM J. Optim., vol. 20, no. 4, pp. 1737–1765, 2010.

[26] Y. Gao and D. Sun, “Calibrating least squares covariance matrix prob- lems with equality and inequality constraints,” SIAM J. Matrix Anal. Appl., vol. 31, pp. 1432–1457, 2009.

[27] A. Nemirovski, A. Juditsky, G. Lan, and A. Shapiro, “Robust stochastic approximation approach to stochastic programming,” SIAM J. Optim., vol. 19, no. 4, pp. 1574–1609, 2009.

[28] Z. Wen, D. Goldfarb, and W. Yin, “Alternating direction augmented lagrangian methods for semidefinite programming,” Mathematical Programming Computation, vol. 2, no. 3-4, pp. 203–230, 2010.

[29] H. Ouyang, N. He, L. Tran, and A. Gray, “Stochastic alternating direction method of multipliers,” in Proc. Int. Conf. Mach. Learn., 2013, pp. 80–88.

[30] J. Chen, T. Yang, and S. Zhu, “Efficient low-rank stochastic gradient descent methods for solving semidefinite programs,” in Proc. Int. Workshop Artificial Intell. & Statistics, 2014, pp. 122–130.

[31] S. Shalev-Shwartz, Y. Singer, and A. Y. Ng, “Online and batch learning of pseudo-metrics,” in Proc. Int. Conf. Mach. Learn., 2004, pp. 743–750.

[32] J. V. Davis, B. Kulis, P. Jain, S. Sra, and I. S. Dhillon, “Information- theoretic metric learning,” in Proc. Int. Conf. Mach. Learn., 2007, pp. 209–216.

[33] P. Jain, B. Kulis, I. S. Dhillon, and K. Grauman, “Online metric learning and fast similarity search,” in Proc. Adv. Neural Inf. Process. Syst., vol. 8, 2008, pp. 761–768.

[34] C. Shen, J. Kim, L. Wang, and A. van den Hengel, “Positive semidefinite metric learning using boosting-like algorithms,” J. Machine Learning Research, vol. 9, no. 1, pp. 1007–1036, 2012.

[35] M. L. Overton and R. S. Womersley, “On the sum of the largest eigenvalues of a symmetric matrix,” SIAM J. Matrix Anal. Appl., vol. 13, no. 1, pp. 41–45, 1992.

[36] S. P. Boyd and L. Vandenberghe, Convex optimization. Cambridge university press, 2004.

[37] D. Henrion and J. Malick, “Projection methods for conic feasibility problems, applications to polynomial sum-of-squares decompositions,” Optim. Methods and Softw., vol. 26, no. 1, pp. 23–46, 2009.

[38] C. Zhu, R. Byrd, P. Lu, and J. Nocedal, “Algorithm 778: L-BFGS- B: Fortran subroutines for large-scale bound-constrained optimization,” ACM Transaction on Mathematical Software, vol. 23, pp. 550–560, 1997.

[39] J. L. Morales and J. Nocedal, “Remark on algorithm 778: L-bfgs- b: Fortran subroutines for large-scale bound constrained optimization,” ACM Transactions on Mathematical Software (TOMS), vol. 38, no. 1, p. 7, 2011.

[40] K. C. Toh, M. Todd, and R. H. Ttnc, “SDPT3—a MATLAB software package for semidefinite programming,” Optimizat. Methods & Softw., vol. 11, pp. 545–581, 1999.

[41] J. F. Sturm, “Using SeDuMi 1.02, a MATLAB toolbox for optimization over symmetric cones,” Optimizat. Methods & Softw., vol. 11, pp. 625– 653, 1999.

[42] A. Frank and A. Asuncion. (2010) UCI machine learning repository. University of California, Irvine, School of Information and Computer Sciences. [Online]. Available: http://archive.ics.uci.edu/ml

[43] F. S. Samaria and A. C. Harter, “Parameterisation of a stochastic model for human face identification,” in Proc. of 2nd IEEE workshop on Applications of Computer Vision, 1994, pp. 138–142.

[44] P. N. Belhumeur, J. P. Hespanha, and D. J. Kriegman, “Eigenfaces vs. fisherfaces: Recognition using class specific linear projection,” IEEE Trans. Anal. Mach. Intell., vol. 19, no. 7, pp. 711–720, 1997.

[45] D. Cai, X. He, Y. Hu, J. Han, and T. Huang. Popular face data sets in matlab format. [Online]. Available: http://www.cad.zju.edu.cn/home/ dengcai/Data/FaceData.html

[46] D. B. Graham and N. M. Allinson, “Characterizing virtual eigensigna- tures for general purpose face recognition,” Face Recognition: From Theory to Applications; NATO ASI Series F, Computer and Systems Sciences, vol. 163, pp. 446–456, 1998.

[47] S. A. Nene, S. K. Nayar, and H. Murase, “Columbia object image library (coil-20),” in Technical Report CUCS-005-96, Feb 1996.

[48] ——, “Columbia object image library (coil-100),” in Technical Report CUCS-006-96, Feb 1996.

[49] J. Geusebroek, G. Burghouts, and A. Smeulders, “The amsterdam library of object images,” Int’l J. Computer Vision, vol. 61, no. 1, pp. 103–112, 2005.

[50] S.Roweis. Data for MATLAB hankers: Handwritten digits. University of Toronto, Department of Computer Science. [Online]. Available: http://www.cs.nyu.edu/roweis/data.html

A.6. PROOF OF THE PROPOSITION 3.2 This section presents the derivation of Proposition 3.2.

Proof: With the Lagrangian dual multipliers u, v, p and Y, the Lagrangian function of the primal problem (12) can be written as

with and defined as (14).

Based on KKT conditions for convex problems, we have at the , where , and stand for primal and dual optimal variables, respectively. Then the relationship between primal and dual optimal variables is:

Substituting the expression for X back into the Lagrangian (A16), the dual problem is formulated as

Given fixed u, v, p, problem (A18) can be simplified to

which is equivalent to projecting the matrix to the positive semidefinite cone. The closed-form solution to (A19) is:

Substituting this solution Y back into the dual problem (A18), the simplified dual problem can be expressed as (13) presented.

Once the optimal solutions and are obtained by solving (13), we can obtain the primal optimal variable based on (A17) and (A20):

A.7. EXPLANATION ON THE FEASIBILITY CONDITION In this part, we will briefly review a feasibility condition to a conic feasibility problem described in [37], and then extend it to our feasibility problem (10). Consider a conic feasibility problem of finding a point such that

where and are given. Ax = b defines an affine subspace A, and is a convex cone.

Defining the polar cone of K as the set of points whose projection into K is 0, i.e.,

Henrion and Malick [37] proposed the following proposition.

Proposition 7.1. If there exists a point such that the following conditions

are satisfied, there would be no feasible solution to (A22).

Based on this proposition, we can get a feasible condition to our problem (10). In our problem, K is the set of positive semidefinite matrices, and the polar cone of K is the set of negative semidefinite matrices. Then the feasibility problem

gives the certificate of infeasibility of problem (11) which is equivalent to (10). That is to say, if there is a feasible solution to (A25), there is no feasible solution to (10). We formalize this result in the following remark.

Proposition 7.2. (i) If problem (11) is feasible, then (A25) is infeasible;

(ii) If there exists u, v and p such that , then there is no feasibility solution to (11).

Proof: (i) Suppose that there is a feasible solution u, v and p to (A25), and take a feasible point of the problem (11). implies that , i.e.,

Observing that X satisfying the constraints in (11), the above inequality (A26) is equivalent to

Since , and , the above inequality (A27) cannot hold at all, which means there is no feasible solution to (A25).

(ii) is equivalent to . Combining with the condition , (A25) is feasible. Therefore problem (11) would be infeasible according to the contrapositive of (i).

is also equivalent to . Due to numerical reasons, the latter is adopted in the feasibility condition (15).

This section presents the experimental results of data visualization. The input data is projected to two dimensional subspace using the linear transformation matrix learned by PCA, LDA, OLDA and the proposed algorithm SD-WLDA. The data distributions after projection are shown in Fig. A4. Several datasets are evaluated: Yale face dataset [44] with 5 classes (5th to 9th) adopted, ALOI object dataset [49] with 5 classes (10th to 14th) used, Coil20 object dataset [47] with 10 classes (1st to 10th) employed, and the Binary Alphadigits dateset [50] with images of digits 3, 4 and 5 used. As shown in Fig. A4, SD-WLDA separates data better than PCA, LDA and OLDA on those datasets. PCA (unsupervised) preserves directions with the largest variance but much of the discriminant information is lost. LDA (supervised) considers the scatter measure in the average view, so the poor separations of two classes are probably to be concealed by other good separations. For example, in Fig. A4 (6), LDA separates most of classes, but fails to separate one pair. Because some classes are separated far from each other, the LDA criterion cannot demonstrate the fact that one pair of classes have not been separated yet. OLDA solved the nonsingularity limitation of scatter matrices in LDA, however, the scatter measures are still from the average viewpoint. Alternatively, SD-WLDA tries to separate data from the worst-case viewpoint, so the separation of every class-pair is taken into account.

Fig. A4. Visualization of dimensionality reduction results of PCA, LDA, OLDA and SD-WLDA applied to (from left) the Yale, ALOI, Coil20 and BA1 datasets. The feature dimensions are reduced from 50, 100, 50 and 80 (which have been processed by PCA in advance) respectively to 2. The figures show explicitly that the best classification results are achieved by SD-WLDA.

designed for accessibility and to further open science