Fisher’s linear discriminant analysis (FLDA) [1] [2] is one of the most representative dimension reduction techniques in statistical pattern recognition . By projecting examples into a low dimensional subspace with maximum discrimination power, FLDA helps improve the accuracy
The authors are with the Centre for Quantum Computation and Intelligent Systems, Faculty of Engineering and Information Technology, University of Technology, Sydney, Australia. E-mail: wei.bian@student.uts.edu.au and dacheng.tao@uts.edu.au
and the robustness of a decision system [3] [4] [5] [6]. During the past decades, FLDA has been applied to a wide range of areas, from speech/music classification [7] [8], face recognition [9] [10] to financial data analysis [11] [12].
An important property of FLDA is its asymptotic Bayes optimality under the homoscedastic Gaussian assumption [13] [14] [15] , which is a corollary of classical results from multivariate statistics [16]. Actually, as training sample size N goes to infinity, both the within-class scatter matrix (sample covariance) and the between-class scatter matrix
converge to their population counterparts
. Therefore, the empirically optimal projection matrix
of FLDA, obtained by generalized eigendecomposition over
, also converges to its population counterpart W. Thanks to the asymptotic Bayes optimality, we can expect an acceptable performance of FLDA as long as N is sufficiently large. However, this classical result, i.e., the asymptotic Bayes optimality, suffers from two major limitations:
1) It is obtained by fixing the dimensionality D and letting only N increase to infinity.
2) It does not provide quantitative description on the performance of FLDA, especially, how
A. The Contribution of this Paper
To address aforementioned limitations of the classical result, in this paper, we present an asymptotic generalization analysis of FLDA. Our analysis is superior from two aspects. First, we modify the setting of analysis by allowing both D and N to increase and assuming the dimensionality to training sample size ratio has a limit in [0, 1). This makes our result applicable in the case where D and N are proportionally large. Second, we quantitatively examine the generalization ability of FLDA. Denoting by
the generalization discrimination power of FLDA, we intend to bound it from the lower side in terms of D and N, with respect to the population discrimination power
. Taking a binary-class problem, for example: suppose
, then our asymptotic generalization bound shows that
is almost surely larger than
under mild conditions. Further, as a corollary of the discrimination power bound, we also obtain an asymptotic generalization error bound for binary classification with FLDA.
Based on the obtained asymptotic generalization bound, we can get better insight of FLDA. It is commonly known that the performance of covariance estimation has a severe influence to the generalization ability of FLDA. By assuming a sufficient population discrimination power so as to eliminate the influence from between-class matrix estimation, we show that the mere influence from covariance estimation is proportional to the ratio , i.e., due to the imperfection of covariance estimation,
worth noticing that such result holds independent of the covariance
. Besides, the bound shows that the performance of FLDA is substantially determined by the ratio
, given a fixed population discrimination power
. Therefore, N only needs to scale linearly with respect to D for an acceptable generalization ability of FLDA, although a quadratic number of parameters are to be estimated in the sample covariance.
B. Tools
The technical tools used in our asymptotic generalization analysis are from random matrix theory (RMT) [17] [18] [19] [20], the main goal of which is to provide understanding of the statistics of eigenvalues of matrices with entries drawn randomly from various probability distributions. RMT was originally motivated by applications in nuclear physics in 1950’s, and then it was intensively studied in mathematics and statistics. It also found successful applications in engineering fields, e.g., wireless communications [21], recently. In this paper, we make use of two important results from RMT. The first one is the Marˇcenko-Pastur Law [19], which states that the empirical spectral distribution of a Wishart random matrix converges almost surely to a deterministic distribution . The second one is the almost sure convergence of the extreme singular values of a large Gaussian random matrix [20]. We formulate these two results in following propositions.
Proposition 1: Given , whose entries are independently sampled from standard Gaussian distribution N (0, 1), then as both
empirical distribution of the eigenvalues of
converges almost surely to a deterministic limit distribution with density
where
Proposition 2: Letting with i.i.d. entries sampled from
and
C. Notations
Throughout this paper, we will use the following notations. Bold lower case letter a denotes a vector. Bold upper case letter A denotes a matrix. -dimensional vector space.
denotes the set of all
denotes the i-th diagonal entry of a symmetric matrix
denotes the i-th column of
denotes the matrix composed by the first c columns of
denotes the D-dimensional unit sphere located on the original point.
denotes the set of all D by D positive definite matrices.
denotes the
of
are the extreme singular values of
denotes the operator norm of
denotes the i-th eigenvalue of A, sorted in a descent order.
denotes the diagonal matrix composed of the eigenvalues of A, with the eigenvalues sorted in a descent order. R(A) denotes an orthogonal basis of the range or the column space of A.
is the canonical basis of
A. Bounding Generalization Discrimination Power
Suppose we have c + 1 classes, represented by homoscedastic Gaussian distributions in a high-dimensional space , with class means
common covariance matrix
. Assuming the classes have equal prior probability
the following matrix S, which is referred to as the between-class scatter matrix, gives a measure
of class separation,
Suppose the eigendecomposition of has (at most) c nonzero eigenvalues
and associated eigenvectors
. FLDA uses W as a projection matrix to obtain a low-dimensional data representation, and according to Fisher’s criterion, the discrimination
power in the dimension reduced space is given by [22]
In practice, we do not have access to population parameters , but their estimates, i.e., the sample covariance
and the sample between-class scatter matrix
via sample class means
. Denoting by
the empirical projection matrix obtained from generalized eigendecompo-
sition of , the generalization discrimination power of FLDA is given by
which measures how the classes are separated in the dimension reduced space. When data dimensionality D is fixed and training sample size N goes to infinity, the generalization discrimination power (8) will converge to its population counterpart (7), since converges to W. However, such classical result is invalid when D increases proportionally with N. Regarding this, the following theorem gives a new asymptotic result on FLDA’s generalization ability, in a setting where D and N increase to infinity proportionally.
Theorem 1: Suppose the population discrimination power
eralization discrimination power can be factorized as
where . Further, as both the dimensionality D and the training sample size N increase
(, it holds asymptotically
Theorem 1 gives an asymptotically lower bound on the generalization ability of FLDA, in terms of the population discrimination power and the dimensionality to training sample size ratio
. An important feature of the bound is that it is determined by the ratio
rather than the dimensionality D. In other words, a good generalization performance of FLDA only requires a training sample size that scales linearly with respect the dimensionality, although there are a quadratic number of parameters to be estimated in the sample covariance. Figure 1 (a) gives an illustration of the bound under different values of the ratio
Besides, according to (10), the influence of the ratio to the lower bound comes from
two aspects, each through the termand the term
. Note that
allows a tradeoff between
, i.e., when
is sufficiently large,
approaches 0 and thus vanishes from the lower bound (10). The second term
depends on
, and later proofs reveal that it measures how covariance estimation influences the generalization of FLDA. Assuming a sufficient large
which shows that the loss of discrimination power due to the imperfection of covariance estimation is approximately proportion to . To the best of our knowledge, this is the simplest quantitative result on the influence of covariance estimation to FLDA, compared with related studies in the literature [14] [23] [24]. It is worth noticing that, as long as
, the result is independent of the spectrum of the population covariance
, e.g., the extreme eigenvalues
, or the conditional number
B. Bounding Generalization Error of Binary Classification
In binary-class case, FLDA can also be regarded as a linear classifier, where the hyperplane of the linear classifier is perpendicular to the one-dimensional projection vector of dimension reduction. Without loss of generality, suppose
, the generalization error P of
Fig. 1. Asymptotic Generalization Bound of Fisher’s Linear Discriminant Analysis.
binary classification with FLDA can be calculated analytically by [25]
where is the cumulative distribution function (CDF) of the standard Gaussian. If we replace
by its population counterpart
, then (12) gives the Bayes error
Below, we present a corollary of Theorem 1, which gives an asymptotic upper bound of P in terms of
Corollary 1: For binary classification with equal prior probabilities, suppose the population discrimination power , then if both dimensionality D and training sample size N increase (
, the generalization error P of FLDA can be upper
bounded asymptotically by
where
Further since the Bayes error , it holds asymptotically
with
Similar to the discrimination power bound, Corollary 1 shows that, given a binary classification problem with Bayes error , the generalization error of FLDA is also determined by the dimensionality to training sample size ratio
. Figure 1 (b) gives an illustration of the generalization error bound under different values of
C. Related Work
In recent years, asymptotic analysis on FLDA have also been performed in the case where D > N. For example, [14] found that when D increases faster than N the the pseudo-inverse based FLDA approaches to a random guess and therefore suggested a “naive Bayes” approach in this situation. A more detailed analysis on pseudo-inverse FLDA was given in [24] by investigating the estimation error of pseudo-inverse of the sample covariance. Random matrix theory, e.g., Marˇcenko-Pastur Law, was also utilized in [24], so as to bound the expected estimation error in the asymptotic case. The result in this paper provides a complementary theory of FLDA in the setting of D < N, which shows that the generalization ability of FLDA in such situation is mainly determined by the ratio
In contrast to asymptotic analysis, generalization bounds in finite sample case were derived most recently in both linear and kernel spaces, and by using random projection as regularization if D > N [23] [26] [27]. The advantage of these results is they provide explicit probability bounds for finite N and D, while asymptotic results inherently require sufficient large N and D. However, we would like to emphasize that the bounds obtained in this paper have their own merit, by linking the generalization discrimination power (or generalization error) to the population discrimination power (or Bayes error) directly in terms of the ratio as shown by empirical evaluation in later section IV, the bounds hold with high probability (in the empirical sense) for moderate D and N, though they are obtained asymptotically.
In this section, we present the proof of Theorem 1, which are mainly based upon the asymptotic results on eigensystems of the sample covariance and the sample between-class scatter matrix.
A. On ∆(Σ, S|�W)
We begin the proof by bounding the generalization discrimination power of eigenvalues and/or eigenvectors of a normalized version of the sample covariance and sample between-class scatter matrix.
Lemma 1: Given a problem with population discrimination power
there is a nonsingular matrix X that simultaneously diagonalizes
Lemma 2: Given the normalized estimates , and their eigendecompositions
, the generalization discrimination power
can be expressed as
where
Lemma 3: Given from Lemma 2, it holds
where is a unit-length random vector uniformly distributed on the unit sphere
Lemma 2 and Lemma 3 show that the generalization discrimination power of FLDA are determined by the eigensystems of the normalized estimates is actually an estimate of the identity covariance matrix I, we have that given the population discrimination power
, the generalization ability of FLDA, i.e.,
independent of the population covariance
. Next, we present properties on the eigensymstems of
, which are necessary for evaluating the lower bound of
B. Properties of �Σ0
We have the following lemma on the eigensystem of the normalized sample covariance
Lemma 4: Given the eigendecomposition
1) are independent random variables;
2) U follows the Haar distribution, i.e., it is uniformly distributed on the set of all orthonormal
3) denoting by the empirical spectral distribution of the eigenvalues of
The first and the second statements in Lemma 4 can be understood by the fact that empirical estimate of I, whose probability density is invariant to any orthogonal transformation. The last statement is a corollary of the Marˇcenko-Pastur law, i.e., Proposition 1, which says that the empirical spectral distribution of the matrix
entries sampled from N (0, 1), converges almost surely to the deterministic distribution
Further, we need the following lemma on the inverse of the eigenvalues , which says that the energy of
projected onto a random direction is almost surely deterministic in the limit. It is worth noticing that the results in Lemma 5 generalize the results on the expectations
Lemma 5: Suppose is a unit-length random vector uniformly distributed on the unit sphere
and it is independent of
and
C. Properties of �S0
We have the following lemma on the eigenvectors of
Lemma 6: Given the eigendecomposition
where is from the population discrimination power
Recalling Lemma 1, the population counterpart of is actually the diagonal matrix
. Therefore, we expect the first c eigenvectors
to be close to
Lemma 6 shows that the performance of eigenvector estimation is determined by the
and in particular, as
approaches 0 the estimation becomes consistent.
D. Proof of Theorem 1
Now, we are ready to prove our main result Theorem 1, which is a conclusion out of the combination of Lemmas 2, 3, 5 and 6.
By Lemma 6, we have
Then the proof is completed by substituting (29) and (30) into Lemma 2 and Lemma 3.
A. On the Bound of Generalization Discrimination Power
According to Theorem 1, the generalization discrimination power of FLDA for dimension reduction can be factorized as measures the population discrimination power, and each component
of the generalization discrimination power can
be lower bounded by
We evaluate this result on both simulated and real datasets by comparing with the lower bound above.
For simulated data, we fix the ratio the settings give moderate size problems; however, due to the asymptotic characteristic of the bound, which inherently fits to large size problem, the evaluation on moderate size problems is more critical. We generate 1,000 experiments, each having 5 classes with randomly generated population covariance
and class means
. The population discrimination power
, are calculated via eigendecomposition of
is the between-class scatter matrix. For the generalization discrimination power
, the factor
has a close form
formulation as shown by Lemma 2, i.e.,
where are the eigensystems of
are the first c eigenvectors of
with
being the normalized sample covariance and between-class scatter matrix and X simultaneously diagonalizing
. Since a larger discrimination power means a better separation between classes, we expect that on most of the experiments the generalization discrimination power of FLDA can be bounded from the lower side by the generalization bound. Indeed, as shown by Figure 2, the bound holds with an overwhelming probability in the empirical sense (i.e., on more than 990 out of the 1,000 experiments).
We further evaluate the bound of generalization discrimination power on four benchmark datesets from the UCI machine learning repository [28]: 1) the image segmentation (ImageSeg)
Fig. 2. Evaluation of the Generalization Discrimination Power Bound with Simulated Data.
dataset 2, which contains 7 classes and in total 2,310 examples from ; 2) the Landsat dataset, which constants 6 classes and in total 6,435 examples from
; 3) the optical recognition of handwritten digits (Optdigits) dataset, which contains 10 classes and in total 5,620 examples from
; and 4) the USPS handwritten digits dataset, which contains 10 classes and in total 9,298 examples from
. Note that for real dataset, the population parameters
unknown. Thus, we use the entire dataset to get their estimates and treat them as population parameters. Again, we fix the ratio
, i.e., we randomly select examples twice
Fig. 3. Evaluation of the Generalization Discrimination Power Bound with Real Data.
of the dimensionality as the training data. The generalization discrimination powers over 1,000 random experiments are shown in Figure 3. On the panel for each dataset, the columns of the scatters correspond to different components of the generalization discrimination power the horizontal axis location of each column equals the population discrimination power
column number is class number minus 1). On three out of the four datasets, including LandSat, Optdigits and USPS, the generalization discrimination power is properly bounded by the lower bound, with a high probability in the empirical sense. On the ImageSeg dataset, the bound does not hold with high probability as on the other three datasets. The major reason is that the size of the problem is considerably small, with D = 18 and N = 36, while the bound favors large
Fig. 4. Evaluation of the Generalization Error Bound with Simulated Data.
or moderate size problems.
B. On the Bound of Generalization Errors
According to Corollary 1, suppose the Bayes error of a binary classification problem is
then the generalization error P of FLDA can be boudned by
where is the CDF of the standard Gaussian distribution and
To evaluate this result, we perform binary classification with FLDA on 1,000 experiments, with randomly generated covariance matrix and class means. The same as in previous simulation, we fix the ratio . Figure 4 shows the result, where the generalization error of FLDA is properly bounded by the upper bound.
In addition, we run experiments on the previous four real datasets to evaluate the generalization error bound. We randomly select class pairs from each dataset to perform binary classification. We hold out 10% data as the evaluation set, which is used to estimate the “Bayes” error and generalization error. The “Bayes” classifier is obtained by training FLDA on the rest 90% data,
Fig. 5. Evaluation of the Generalization Error Bound with Real Data.
and the empirical classifier is trained with a subset of the rest data, such that N = 2D, namely fixing the ratio . On each dataset, 1,000 random experiments are performed, with the result shown in Figure 5. Similar to the result in Figure 3, on three out of the four datasets, the generalization error can be bounded by the upper bound, while the bound does not dominate all the experiment on the ImageSeg dataset due to the small size of the problem.
This section provides detailed proofs of Lemmas in Section III and Corollary 1 in Section II.
A. Proof of Lemma 1
It is a direct result of the simultaneous diagonalization theorem for a pair of semidefinite matrices [22].
B. Proof of Lemma 2
The proof is divided into two steps.
i) Since X in Lemma 1 is nonsingular, there exists some
where contains the first
is the upper-left
submatrix of
clearly,
ii) In FLDA, are the eigenvectors of
, and we can restrict the scale of
where diagonal matrix. Substituting
into (33) and recalling
Given the eigendecomposition , we have from the first equation in (34) that there must exist some orthogonal matrix
, such that
Further, given the eigendecomposition , we get from the second equation in
(34) that
In addition, since , we can rewrite (36) as
where is the upper-left
submatrix of
. (37) implies the columns of O
must be the left singular vectors of
exist some matrix
where the nonsingularity of A is implied by the nonsingularity of
and
Therefore,
Letting
which together with (41) gives
This completes the proof.
C. Proof of Lemma 3
Recall Lemma 2 that . Denote by
the angle between vector and subspace
Two basic facts that hold for arbitrary vector and subspace A are
and
Then, by using (45) and (46), we get
Denoting is positive and decreasing on
is increasing on
is nonnegative, we have
It remains to calculate
which gives
For , as rescaling does not change the direction of a vector, we can rewrite
where
Note that is a unit-length random vector and is independent of U due to the independency
between . Then, we have
We have known, from Lemma 4, U is uniformly distributed on the set of all orthonormal matrices in is a unit-length random vector independent of
unit-length random vector uniformly distributed on the unit sphere . Finally, (53) gives
This completes the proof.
D. Proof of Lemma 4
Since is a normalized sample covariance, wherein
where is sampled from some
is the sample mean. Letting
which implies
is sampled from the standard Gaussian distribution
then can be rewritten as
One property of in (56) is that, as a random variable, its distribution is invariant to orthogonal similarity transformation, i.e.,
, have the same distribution. This is due to the fact that
corresponds to (56) in the case of replacing
has the same distribution with
, i.e., the standard Gaussian distribution N (0, I). Then, according to Theorem 3.2 in [29], the invariant property to orthogonal similarity transformation implies that the distribution of
is independent of its eigenvectors U but only depends on its eigenvalues
is a random matrix uniformly distributed on the set of all possible orthonormal matrices in
. This completes the statements 1) and 2) in Lemma 4.
Further, (56) can be rewritten as
where . For the first term
, by Proposition 1, we know that the empirical distribution of its eigenvalues converges almost surely to
density,
where
For the second term , clearly it has finite rank c + 1. According to [30], a finite rank perturbation does not effect the convergence of the empirical spectral distribution, i.e.,
. This completes the proof.
E. Proof of Lemma 5
The condition that is a unit-length random vector uniformly distributed on the unit sphere
can be replaced by
with entries independently sampled from N (0, 1/D). This is because, in the later case,
is uniformly distributed on
due to the Strong Law of Large Numbers.
For (26), we divide the proof into two steps. First, we show that and then we calculate the integral.
i) Recall
side of (26) into three terms
and
We show that all the three terms converge almost surely to zero.
For the first term (60), we have
By the same argument in the proof of Lemma 4, we know that
where , with entries independently sampled from N (0, 1). By Proposi- tion 2, we have
. Accordingly,
Then, by , (63) and (65), we have
For the second term (61), since , i.e., it is uniformly bounded, we
apply Theorem 3.4 in [21] and get
For the third term (62), since is nonzero only on
, it is sufficient to examine
Sine is bounded on
, it holds [31]
Further, sine
and
Thus,
ii) We now calculate the integral
Letting and substituting it into (73), we have
Further, letting
Letting
Denoting by the terms in the righthand side of (76), we have
Combining (75) to (79) and noticing
This completes the proof of (26).
For (27), by the same strategy as used in the proof of (26), we have . Below, we calculate the integral.
substituting it into (73), we have
Further, letting
Letting
Thus, by
F. Proof of Lemma 6
By Lemmas 1 and 2, is an estimate of
. Suppose the original distributions of the c+1 classes are
and the between-class scatter matrix is S. Then,
should be the between-class scatter matrix of an equivalent problem with distributions
. Therefore,
is an estimate of
. As there are n training examples per class, we have
, where the entries of
are i.i.d. samples from N (0, 1/n).
Note that the nonzero diagonal entries of , which are actually eigenvalues of
, associated with eigenvectors
implies that
has singular values
and left singular vectors
. Denoting by
the right singular vectors of
we have
Consequently, by
where
Then, by , we have for the first c eigenvectors
Accordingly,
It can be verified that as
where the inequality is due to and the limit is because
follows the distribution
In addition, by Proposition 2 and letting
Thus,
Combining (89), (90) and (92), we obtain
This completes the proof.
G. Proof of Corollary 1
Recall that
and similarly
As long as , we have by Theorem 1 that
with
Below, we verify that it indeed holds
By using similar strategy in the proof of Lemma 2, in particular (39), we have
wherein
since is the first eigenvector of the normalized sample between-scatter matrix
. Substituting (100) into T, we have
For the numerator, we have
Due to the normalization, we know that follows the multivariate Gaussian distribution
being the training data number per class. Then, by Lemma
5 and
Similarly, letting , the same argument gives
and recalling Lemma 5, we have
Similarly,
Therefore, we have the numerator
For the dominator, letting
Note that almost surely. Thus, the dominator must be positive. Therefore, we have T in (99) has limit 0.
FLDA is an important statistical model in pattern recognition. The result obtain in this paper enriches the existing theory of FLDA, by showing that the generalization ability of FLDA is mainly determined by the dimensionality to training sample size ratio N are reasonably large and N > D. Important conclusions from this result include: 1) to ensure FLDA performing well, training sample size only needs to scale linearly with respect to data dimensionality, although a quadratic number of parameters are to be estimated in the sample covariance; and 2) the generalization ability of FLDA (with respect to the Bayes optimum) is independent of the spectral structure of the population covariance, given its nonsingularity and above conditions.
[1] R. Fisher, “The use of multiple measurements in taxonomic problems,” Annals Eugen., vol. 7, pp. 179–188, 1936.
[2] C. Rao, “The utilization of multiple measurements in problems of biological classification,” Journal of the Royal Statistical Society series B: Methodological, vol. 10, pp. 159–203, 1948.
[3] M. Loog, R. P. W. Duin, and R. Haeb-Umbach, “Multiclass linear dimension reduction by weighted pairwise fisher criteria,” IEEE Transactions on Pattern Analysis and Machine Intelligence, vol. 23, no. 7, pp. 762–766, 2001.
[4] D. Tao, X. Li, X. Wu, and S. Maybank, “Geometric mean for subspace selection,” IEEE Transactions on Pattern Analysis and Machine Intelligence, vol. 31, no. 2, pp. 260–274, 2009.
[5] W. Bian and D. Tao, “Max-min distance analysis by using sequential sdp relaxation for dimension reduction.” IEEE Transaction on Pattern Analysis and Machine Intelligence, vol. 33, no. 5, pp. 1037–1050, 2011.
[6] F. De la Torre and T. Kanade, “Multimodal oriented discriminant analysis,” in Proceedings of the 22nd international conference on Machine learning, 2005, pp. 177–184.
[7] G. Potamianos and H. Graf, “Linear discriminant analysis for speechreading,” in Workshop on Multimedia Signal Process, 1998, pp. 221– 226.
[8] E. Alexandre-Cortizo, M. Rosa-Zurera, and F. Lopez-Ferreras, “Application of Fisher linear discriminant analysis to speech/music classification,” in The International Conference on Computer as a Tool, 2005, pp. 1666–1669.
[9] P. Belhumeur, J. Hespanha, and D. Kriegman, “Eigenfaces vs. Fisherfaces: Recognition using class specific linear projection,” IEEE Transactions on Pattern Analysis and Machine Intelligence, vol. 19, no. 7, pp. 711–720, 1997.
[10] T. Kim and J. Kittler, “Locally linear discriminant analysis for multimodally distributed classes for face recognition with a single model image,” IEEE Transactions on Pattern Analysis and Machine Intelligence, vol. 27, no. 3, pp. 318–327, 2005.
[11] E. Altman, “Financial ratios, discriminant analysis and the prediction of corporate bankruptcy,” The Journal of Finance, vol. 23, no. 4, pp. 589–609, 1968.
[12] K. Kumar and S. Bhattacharya, “Artificial neural network vs linear discriminant analysis in credit ratings forecast: A comparative study of prediction performances,” Review of Accounting and Finance, vol. 5, no. 3, pp. 216–227, August 2006.
[13] O. Hamsici and A. Martinez, “Bayes optimality in linear discriminant analysis,” IEEE Transactions on Pattern Analysis and Machine Intelligence, vol. 30, no. 4, pp. 647–657, 2008.
[14] P. J. Bickel and E. Levina, “Some theory for Fisher’s linear discriminant function, ‘naive Bayes’, and some alternatives when there are many more variables than observations,” Bernoulli, vol. 10, no. 6, pp. 989–1010, 2004.
[15] J. Fan, Y. Fan, and Y. Wu, “High-dimensional classification,” in High-dimensional Data Analysis, T. Cai and X. Shen, Eds. New Jersey: World Scientific, 2011, pp. 3–37.
[16] T. Anderson, An Introduction to Multivariate Statistical Analysis, 2nd ed. New York, NY: Wiley, 1984.
[17] E. Wigner, “Characteristic vectors of bordered matrices with infinite dimensions,” The Annals of Mathematics, vol. 62, no. 3, pp. 548–564, 1955.
[18] ——, “On the distribution of the roots of certain symmetric matrices,” The Annals of Mathematics, vol. 67, no. 2, pp. 325–327, 1958.
[19] V. Marˇcenko and L. Pastur, “Distribution of eigenvalues for some sets of random matrices,” Mathematics of the USSRSbornik, vol. 1, p. 457, 1967.
[20] A. Edelman and N. Rao, “Random matrix theory,” Acta Numerica, vol. 14, no. 233-297, p. 139, 2005.
[21] A. Tulino and S. Verd´u, Random matrix theory and wireless communications. Now Publishers Inc, 2004, vol. 1.
[22] K. Fukunaga, Introduction to Statistical Pattern Recognition, Second Edition. Academic Press, September 1990.
[23] R. J. Durrant and A. Kab´an, “A bound on the performance of lda in randomly projected data spaces,” in International Conference on Pattern Recognition, 2010, pp. 4044–4047.
[24] D. C. Hoyle, “Accuracy of pseudo-inverse covariance learning – a random matrix theory analysis,” IEEE Transactions on Pattern Analysis and Machine Intelligence., vol. 33, no. 7, pp. 1470–1481, Jul. 2011.
[25] G. J. Mclachlan, Discriminant Analysis and Statistical Pattern Recognition (Wiley Series in Probability and Statistics). Wiley-Interscience, Aug. 2004.
[26] R. J. Durrant and A. Kaban, “Compressed fisher linear discriminant analysis: classification of randomly projected data,” in ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, 2010, pp. 1119–1128.
[27] R. J. Durrant and A. Kab´an, “Error bounds for kernel fisher linear discriminant in gaussian hilbert space,” Journal of Machine Learning Research - Proceedings Track, vol. 22, pp. 337–345, 2012.
[28] C. Blake and C. Merz, “UCI repository of machine learning databases,” Dept. of Information and Computer Sciences, University of California, Irvine, Tech. Rep., 1998.
[29] A. Edelman, “Eigenvalues and condition numbers of random matrices,” Ph.D. dissertation, Massachusetts Institute of Technology, 1989.
[30] T. Tao, Topics in Random Matrix Theory. American Mathematical Society, 2012.
[31] P. Billingsley, Convergence of Probability Measures, ser. Wiley Series in Probability and Statistics: Probability and Statistics. John Wiley & Sons Inc., 1999, vol. 175.