Asymptotic Generalization Bound of Fisher's Linear Discriminant Analysis

2012·Arxiv

Abstract

Abstract

I. INTRODUCTION

Fisher’s linear discriminant analysis (FLDA) [1] [2] is one of the most representative dimension reduction techniques in statistical pattern recognition . By projecting examples into a low dimensional subspace with maximum discrimination power, FLDA helps improve the accuracy

The authors are with the Centre for Quantum Computation and Intelligent Systems, Faculty of Engineering and Information Technology, University of Technology, Sydney, Australia. E-mail: wei.bian@student.uts.edu.au and dacheng.tao@uts.edu.au

and the robustness of a decision system [3] [4] [5] [6]. During the past decades, FLDA has been applied to a wide range of areas, from speech/music classification [7] [8], face recognition [9] [10] to financial data analysis [11] [12].

An important property of FLDA is its asymptotic Bayes optimality under the homoscedastic Gaussian assumption [13] [14] [15] , which is a corollary of classical results from multivariate statistics [16]. Actually, as training sample size N goes to infinity, both the within-class scatter matrix (sample covariance) and the between-class scatter matrix converge to their population counterparts . Therefore, the empirically optimal projection matrix of FLDA, obtained by generalized eigendecomposition over , also converges to its population counterpart W. Thanks to the asymptotic Bayes optimality, we can expect an acceptable performance of FLDA as long as N is sufficiently large. However, this classical result, i.e., the asymptotic Bayes optimality, suffers from two major limitations:

1) It is obtained by fixing the dimensionality D and letting only N increase to infinity.

2) It does not provide quantitative description on the performance of FLDA, especially, how

A. The Contribution of this Paper

To address aforementioned limitations of the classical result, in this paper, we present an asymptotic generalization analysis of FLDA. Our analysis is superior from two aspects. First, we modify the setting of analysis by allowing both D and N to increase and assuming the dimensionality to training sample size ratio has a limit in [0, 1). This makes our result applicable in the case where D and N are proportionally large. Second, we quantitatively examine the generalization ability of FLDA. Denoting by the generalization discrimination power of FLDA, we intend to bound it from the lower side in terms of D and N, with respect to the population discrimination power . Taking a binary-class problem, for example: suppose , then our asymptotic generalization bound shows that

is almost surely larger than

under mild conditions. Further, as a corollary of the discrimination power bound, we also obtain an asymptotic generalization error bound for binary classification with FLDA.

Based on the obtained asymptotic generalization bound, we can get better insight of FLDA. It is commonly known that the performance of covariance estimation has a severe influence to the generalization ability of FLDA. By assuming a sufficient population discrimination power so as to eliminate the influence from between-class matrix estimation, we show that the mere influence from covariance estimation is proportional to the ratio , i.e., due to the imperfection of covariance estimation, worth noticing that such result holds independent of the covariance . Besides, the bound shows that the performance of FLDA is substantially determined by the ratio , given a fixed population discrimination power . Therefore, N only needs to scale linearly with respect to D for an acceptable generalization ability of FLDA, although a quadratic number of parameters are to be estimated in the sample covariance.

B. Tools

The technical tools used in our asymptotic generalization analysis are from random matrix theory (RMT) [17] [18] [19] [20], the main goal of which is to provide understanding of the statistics of eigenvalues of matrices with entries drawn randomly from various probability distributions. RMT was originally motivated by applications in nuclear physics in 1950’s, and then it was intensively studied in mathematics and statistics. It also found successful applications in engineering fields, e.g., wireless communications [21], recently. In this paper, we make use of two important results from RMT. The first one is the Marˇcenko-Pastur Law [19], which states that the empirical spectral distribution of a Wishart random matrix converges almost surely to a deterministic distribution . The second one is the almost sure convergence of the extreme singular values of a large Gaussian random matrix [20]. We formulate these two results in following propositions.

Proposition 1: Given , whose entries are independently sampled from standard Gaussian distribution N (0, 1), then as both

empirical distribution of the eigenvalues of

converges almost surely to a deterministic limit distribution with density

where

Proposition 2: Letting with i.i.d. entries sampled from

and

C. Notations

Throughout this paper, we will use the following notations. Bold lower case letter a denotes a vector. Bold upper case letter A denotes a matrix. -dimensional vector space. denotes the set of all denotes the i-th diagonal entry of a symmetric matrix denotes the i-th column of denotes the matrix composed by the first c columns of denotes the D-dimensional unit sphere located on the original point. denotes the set of all D by D positive definite matrices. denotes the of are the extreme singular values of denotes the operator norm of denotes the i-th eigenvalue of A, sorted in a descent order. denotes the diagonal matrix composed of the eigenvalues of A, with the eigenvalues sorted in a descent order. R(A) denotes an orthogonal basis of the range or the column space of A. is the canonical basis of

II. MAIN RESULT

A. Bounding Generalization Discrimination Power

Suppose we have c + 1 classes, represented by homoscedastic Gaussian distributions in a high-dimensional space , with class means common covariance matrix . Assuming the classes have equal prior probability the following matrix S, which is referred to as the between-class scatter matrix, gives a measure

of class separation,

Suppose the eigendecomposition of has (at most) c nonzero eigenvalues and associated eigenvectors . FLDA uses W as a projection matrix to obtain a low-dimensional data representation, and according to Fisher’s criterion, the discrimination

power in the dimension reduced space is given by [22]

In practice, we do not have access to population parameters , but their estimates, i.e., the sample covariance and the sample between-class scatter matrix via sample class means . Denoting by the empirical projection matrix obtained from generalized eigendecompo-

sition of , the generalization discrimination power of FLDA is given by

which measures how the classes are separated in the dimension reduced space. When data dimensionality D is fixed and training sample size N goes to infinity, the generalization discrimination power (8) will converge to its population counterpart (7), since converges to W. However, such classical result is invalid when D increases proportionally with N. Regarding this, the following theorem gives a new asymptotic result on FLDA’s generalization ability, in a setting where D and N increase to infinity proportionally.

Theorem 1: Suppose the population discrimination power

eralization discrimination power can be factorized as

where . Further, as both the dimensionality D and the training sample size N increase

(, it holds asymptotically

Theorem 1 gives an asymptotically lower bound on the generalization ability of FLDA, in terms of the population discrimination power and the dimensionality to training sample size ratio . An important feature of the bound is that it is determined by the ratio rather than the dimensionality D. In other words, a good generalization performance of FLDA only requires a training sample size that scales linearly with respect the dimensionality, although there are a quadratic number of parameters to be estimated in the sample covariance. Figure 1 (a) gives an illustration of the bound under different values of the ratio

Besides, according to (10), the influence of the ratio to the lower bound comes from

two aspects, each through the termand the term . Note thatallows a tradeoff between , i.e., when is sufficiently large, approaches 0 and thus vanishes from the lower bound (10). The second term depends on , and later proofs reveal that it measures how covariance estimation influences the generalization of FLDA. Assuming a sufficient large

which shows that the loss of discrimination power due to the imperfection of covariance estimation is approximately proportion to . To the best of our knowledge, this is the simplest quantitative result on the influence of covariance estimation to FLDA, compared with related studies in the literature [14] [23] [24]. It is worth noticing that, as long as , the result is independent of the spectrum of the population covariance , e.g., the extreme eigenvalues , or the conditional number

B. Bounding Generalization Error of Binary Classification

In binary-class case, FLDA can also be regarded as a linear classifier, where the hyperplane of the linear classifier is perpendicular to the one-dimensional projection vector of dimension reduction. Without loss of generality, suppose , the generalization error P of

Fig. 1. Asymptotic Generalization Bound of Fisher’s Linear Discriminant Analysis.

binary classification with FLDA can be calculated analytically by [25]

where is the cumulative distribution function (CDF) of the standard Gaussian. If we replace by its population counterpart , then (12) gives the Bayes error

Below, we present a corollary of Theorem 1, which gives an asymptotic upper bound of P in terms of

Corollary 1: For binary classification with equal prior probabilities, suppose the population discrimination power , then if both dimensionality D and training sample size N increase (, the generalization error P of FLDA can be upper

bounded asymptotically by

where

Further since the Bayes error , it holds asymptotically

with

Similar to the discrimination power bound, Corollary 1 shows that, given a binary classification problem with Bayes error , the generalization error of FLDA is also determined by the dimensionality to training sample size ratio . Figure 1 (b) gives an illustration of the generalization error bound under different values of

C. Related Work

In recent years, asymptotic analysis on FLDA have also been performed in the case where D > N. For example, [14] found that when D increases faster than N the the pseudo-inverse based FLDA approaches to a random guess and therefore suggested a “naive Bayes” approach in this situation. A more detailed analysis on pseudo-inverse FLDA was given in [24] by investigating the estimation error of pseudo-inverse of the sample covariance. Random matrix theory, e.g., Marˇcenko-Pastur Law, was also utilized in [24], so as to bound the expected estimation error in the asymptotic case. The result in this paper provides a complementary theory of FLDA in the setting of D < N, which shows that the generalization ability of FLDA in such situation is mainly determined by the ratio

In contrast to asymptotic analysis, generalization bounds in finite sample case were derived most recently in both linear and kernel spaces, and by using random projection as regularization if D > N [23] [26] [27]. The advantage of these results is they provide explicit probability bounds for finite N and D, while asymptotic results inherently require sufficient large N and D. However, we would like to emphasize that the bounds obtained in this paper have their own merit, by linking the generalization discrimination power (or generalization error) to the population discrimination power (or Bayes error) directly in terms of the ratio as shown by empirical evaluation in later section IV, the bounds hold with high probability (in the empirical sense) for moderate D and N, though they are obtained asymptotically.

III. PROOF OF MAIN RESULT

In this section, we present the proof of Theorem 1, which are mainly based upon the asymptotic results on eigensystems of the sample covariance and the sample between-class scatter matrix.

A. On ∆(Σ, S|�W)

We begin the proof by bounding the generalization discrimination power of eigenvalues and/or eigenvectors of a normalized version of the sample covariance and sample between-class scatter matrix.

Lemma 1: Given a problem with population discrimination power

there is a nonsingular matrix X that simultaneously diagonalizes

Lemma 2: Given the normalized estimates , and their eigendecompositions , the generalization discrimination power

can be expressed as

where

Lemma 3: Given from Lemma 2, it holds

where is a unit-length random vector uniformly distributed on the unit sphere

Lemma 2 and Lemma 3 show that the generalization discrimination power of FLDA are determined by the eigensystems of the normalized estimates is actually an estimate of the identity covariance matrix I, we have that given the population discrimination power , the generalization ability of FLDA, i.e., independent of the population covariance . Next, we present properties on the eigensymstems of , which are necessary for evaluating the lower bound of

B. Properties of �Σ0

We have the following lemma on the eigensystem of the normalized sample covariance

Lemma 4: Given the eigendecomposition

1) are independent random variables;

2) U follows the Haar distribution, i.e., it is uniformly distributed on the set of all orthonormal

3) denoting by the empirical spectral distribution of the eigenvalues of

The first and the second statements in Lemma 4 can be understood by the fact that empirical estimate of I, whose probability density is invariant to any orthogonal transformation. The last statement is a corollary of the Marˇcenko-Pastur law, i.e., Proposition 1, which says that the empirical spectral distribution of the matrix entries sampled from N (0, 1), converges almost surely to the deterministic distribution

Further, we need the following lemma on the inverse of the eigenvalues , which says that the energy of projected onto a random direction is almost surely deterministic in the limit. It is worth noticing that the results in Lemma 5 generalize the results on the expectations

Lemma 5: Suppose is a unit-length random vector uniformly distributed on the unit sphere

and it is independent of

and

C. Properties of �S0

We have the following lemma on the eigenvectors of

Lemma 6: Given the eigendecomposition

where is from the population discrimination power

Recalling Lemma 1, the population counterpart of is actually the diagonal matrix . Therefore, we expect the first c eigenvectors to be close to Lemma 6 shows that the performance of eigenvector estimation is determined by the and in particular, as approaches 0 the estimation becomes consistent.

D. Proof of Theorem 1

Now, we are ready to prove our main result Theorem 1, which is a conclusion out of the combination of Lemmas 2, 3, 5 and 6.

By Lemma 6, we have

Then the proof is completed by substituting (29) and (30) into Lemma 2 and Lemma 3.

IV. EMPIRICAL EVALUATIONS

A. On the Bound of Generalization Discrimination Power

According to Theorem 1, the generalization discrimination power of FLDA for dimension reduction can be factorized as measures the population discrimination power, and each component of the generalization discrimination power can

be lower bounded by

We evaluate this result on both simulated and real datasets by comparing with the lower bound above.

For simulated data, we fix the ratio the settings give moderate size problems; however, due to the asymptotic characteristic of the bound, which inherently fits to large size problem, the evaluation on moderate size problems is more critical. We generate 1,000 experiments, each having 5 classes with randomly generated population covariance and class means . The population discrimination power , are calculated via eigendecomposition of is the between-class scatter matrix. For the generalization discrimination power , the factor has a close form

formulation as shown by Lemma 2, i.e.,

where are the eigensystems of are the first c eigenvectors of with being the normalized sample covariance and between-class scatter matrix and X simultaneously diagonalizing . Since a larger discrimination power means a better separation between classes, we expect that on most of the experiments the generalization discrimination power of FLDA can be bounded from the lower side by the generalization bound. Indeed, as shown by Figure 2, the bound holds with an overwhelming probability in the empirical sense (i.e., on more than 990 out of the 1,000 experiments).

We further evaluate the bound of generalization discrimination power on four benchmark datesets from the UCI machine learning repository [28]: 1) the image segmentation (ImageSeg)

Fig. 2. Evaluation of the Generalization Discrimination Power Bound with Simulated Data.

dataset 2, which contains 7 classes and in total 2,310 examples from ; 2) the Landsat dataset, which constants 6 classes and in total 6,435 examples from ; 3) the optical recognition of handwritten digits (Optdigits) dataset, which contains 10 classes and in total 5,620 examples from ; and 4) the USPS handwritten digits dataset, which contains 10 classes and in total 9,298 examples from . Note that for real dataset, the population parameters unknown. Thus, we use the entire dataset to get their estimates and treat them as population parameters. Again, we fix the ratio , i.e., we randomly select examples twice

Fig. 3. Evaluation of the Generalization Discrimination Power Bound with Real Data.

of the dimensionality as the training data. The generalization discrimination powers over 1,000 random experiments are shown in Figure 3. On the panel for each dataset, the columns of the scatters correspond to different components of the generalization discrimination power the horizontal axis location of each column equals the population discrimination power column number is class number minus 1). On three out of the four datasets, including LandSat, Optdigits and USPS, the generalization discrimination power is properly bounded by the lower bound, with a high probability in the empirical sense. On the ImageSeg dataset, the bound does not hold with high probability as on the other three datasets. The major reason is that the size of the problem is considerably small, with D = 18 and N = 36, while the bound favors large

Fig. 4. Evaluation of the Generalization Error Bound with Simulated Data.

or moderate size problems.

B. On the Bound of Generalization Errors

According to Corollary 1, suppose the Bayes error of a binary classification problem is

then the generalization error P of FLDA can be boudned by

where is the CDF of the standard Gaussian distribution and

To evaluate this result, we perform binary classification with FLDA on 1,000 experiments, with randomly generated covariance matrix and class means. The same as in previous simulation, we fix the ratio . Figure 4 shows the result, where the generalization error of FLDA is properly bounded by the upper bound.

In addition, we run experiments on the previous four real datasets to evaluate the generalization error bound. We randomly select class pairs from each dataset to perform binary classification. We hold out 10% data as the evaluation set, which is used to estimate the “Bayes” error and generalization error. The “Bayes” classifier is obtained by training FLDA on the rest 90% data,

Fig. 5. Evaluation of the Generalization Error Bound with Real Data.

and the empirical classifier is trained with a subset of the rest data, such that N = 2D, namely fixing the ratio . On each dataset, 1,000 random experiments are performed, with the result shown in Figure 5. Similar to the result in Figure 3, on three out of the four datasets, the generalization error can be bounded by the upper bound, while the bound does not dominate all the experiment on the ImageSeg dataset due to the small size of the problem.

V. PROOFS OF LEMMAS AND COROLLARY

This section provides detailed proofs of Lemmas in Section III and Corollary 1 in Section II.

A. Proof of Lemma 1

It is a direct result of the simultaneous diagonalization theorem for a pair of semidefinite matrices [22].

B. Proof of Lemma 2

The proof is divided into two steps.

i) Since X in Lemma 1 is nonsingular, there exists some

where contains the first is the upper-left submatrix of

clearly,

ii) In FLDA, are the eigenvectors of , and we can restrict the scale of

where diagonal matrix. Substituting into (33) and recalling

Given the eigendecomposition , we have from the first equation in (34) that there must exist some orthogonal matrix , such that

Further, given the eigendecomposition , we get from the second equation in

(34) that

In addition, since , we can rewrite (36) as

where is the upper-left submatrix of . (37) implies the columns of O

must be the left singular vectors of

exist some matrix

where the nonsingularity of A is implied by the nonsingularity of

and

Therefore,

Letting

which together with (41) gives

This completes the proof.

C. Proof of Lemma 3

Recall Lemma 2 that . Denote by

the angle between vector and subspace

Two basic facts that hold for arbitrary vector and subspace A are

and

Then, by using (45) and (46), we get

Denoting is positive and decreasing on is increasing on

is nonnegative, we have

It remains to calculate

which gives

For , as rescaling does not change the direction of a vector, we can rewrite

where

Note that is a unit-length random vector and is independent of U due to the independency

between . Then, we have

We have known, from Lemma 4, U is uniformly distributed on the set of all orthonormal matrices in is a unit-length random vector independent of

unit-length random vector uniformly distributed on the unit sphere . Finally, (53) gives

This completes the proof.

D. Proof of Lemma 4

Since is a normalized sample covariance, wherein

where is sampled from some is the sample mean. Letting which implies is sampled from the standard Gaussian distribution

then can be rewritten as

One property of in (56) is that, as a random variable, its distribution is invariant to orthogonal similarity transformation, i.e., , have the same distribution. This is due to the fact that corresponds to (56) in the case of replacing has the same distribution with , i.e., the standard Gaussian distribution N (0, I). Then, according to Theorem 3.2 in [29], the invariant property to orthogonal similarity transformation implies that the distribution of is independent of its eigenvectors U but only depends on its eigenvalues is a random matrix uniformly distributed on the set of all possible orthonormal matrices in . This completes the statements 1) and 2) in Lemma 4.

Further, (56) can be rewritten as

where . For the first term , by Proposition 1, we know that the empirical distribution of its eigenvalues converges almost surely to

density,

where

For the second term , clearly it has finite rank c + 1. According to [30], a finite rank perturbation does not effect the convergence of the empirical spectral distribution, i.e., . This completes the proof.

E. Proof of Lemma 5

The condition that is a unit-length random vector uniformly distributed on the unit sphere can be replaced by with entries independently sampled from N (0, 1/D). This is because, in the later case, is uniformly distributed on due to the Strong Law of Large Numbers.

For (26), we divide the proof into two steps. First, we show that and then we calculate the integral.

i) Recall

side of (26) into three terms

and

We show that all the three terms converge almost surely to zero.

For the first term (60), we have

By the same argument in the proof of Lemma 4, we know that

where , with entries independently sampled from N (0, 1). By Proposi- tion 2, we have . Accordingly,

Then, by , (63) and (65), we have

For the second term (61), since , i.e., it is uniformly bounded, we

apply Theorem 3.4 in [21] and get

For the third term (62), since is nonzero only on , it is sufficient to examine

Sine is bounded on , it holds [31]

Further, sine

and

Thus,

ii) We now calculate the integral

Letting and substituting it into (73), we have

Further, letting

Letting

Denoting by the terms in the righthand side of (76), we have

Combining (75) to (79) and noticing

This completes the proof of (26).

For (27), by the same strategy as used in the proof of (26), we have . Below, we calculate the integral.

substituting it into (73), we have

Further, letting

Letting

Thus, by

F. Proof of Lemma 6

By Lemmas 1 and 2, is an estimate of . Suppose the original distributions of the c+1 classes are and the between-class scatter matrix is S. Then, should be the between-class scatter matrix of an equivalent problem with distributions . Therefore,

is an estimate of . As there are n training examples per class, we have , where the entries of are i.i.d. samples from N (0, 1/n).

Note that the nonzero diagonal entries of , which are actually eigenvalues of , associated with eigenvectors implies that has singular valuesand left singular vectors . Denoting by the right singular vectors of

we have

Consequently, by

where

Then, by , we have for the first c eigenvectors

Accordingly,

It can be verified that as

where the inequality is due to and the limit is because follows the distribution

In addition, by Proposition 2 and letting

Thus,

Combining (89), (90) and (92), we obtain

This completes the proof.

G. Proof of Corollary 1

Recall that

and similarly

As long as , we have by Theorem 1 that

with

Below, we verify that it indeed holds

By using similar strategy in the proof of Lemma 2, in particular (39), we have

wherein

since is the first eigenvector of the normalized sample between-scatter matrix

. Substituting (100) into T, we have

For the numerator, we have

Due to the normalization, we know that follows the multivariate Gaussian distribution being the training data number per class. Then, by Lemma

5 and

Similarly, letting , the same argument gives

and recalling Lemma 5, we have

Similarly,

Therefore, we have the numerator

For the dominator, letting

Note that almost surely. Thus, the dominator must be positive. Therefore, we have T in (99) has limit 0.

VI. CONCLUSION

FLDA is an important statistical model in pattern recognition. The result obtain in this paper enriches the existing theory of FLDA, by showing that the generalization ability of FLDA is mainly determined by the dimensionality to training sample size ratio N are reasonably large and N > D. Important conclusions from this result include: 1) to ensure FLDA performing well, training sample size only needs to scale linearly with respect to data dimensionality, although a quadratic number of parameters are to be estimated in the sample covariance; and 2) the generalization ability of FLDA (with respect to the Bayes optimum) is independent of the spectral structure of the population covariance, given its nonsingularity and above conditions.

REFERENCES

[1] R. Fisher, “The use of multiple measurements in taxonomic problems,” Annals Eugen., vol. 7, pp. 179–188, 1936.

[2] C. Rao, “The utilization of multiple measurements in problems of biological classification,” Journal of the Royal Statistical Society series B: Methodological, vol. 10, pp. 159–203, 1948.

[3] M. Loog, R. P. W. Duin, and R. Haeb-Umbach, “Multiclass linear dimension reduction by weighted pairwise fisher criteria,” IEEE Transactions on Pattern Analysis and Machine Intelligence, vol. 23, no. 7, pp. 762–766, 2001.

[4] D. Tao, X. Li, X. Wu, and S. Maybank, “Geometric mean for subspace selection,” IEEE Transactions on Pattern Analysis and Machine Intelligence, vol. 31, no. 2, pp. 260–274, 2009.

[5] W. Bian and D. Tao, “Max-min distance analysis by using sequential sdp relaxation for dimension reduction.” IEEE Transaction on Pattern Analysis and Machine Intelligence, vol. 33, no. 5, pp. 1037–1050, 2011.

[6] F. De la Torre and T. Kanade, “Multimodal oriented discriminant analysis,” in Proceedings of the 22nd international conference on Machine learning, 2005, pp. 177–184.

[7] G. Potamianos and H. Graf, “Linear discriminant analysis for speechreading,” in Workshop on Multimedia Signal Process, 1998, pp. 221– 226.

[8] E. Alexandre-Cortizo, M. Rosa-Zurera, and F. Lopez-Ferreras, “Application of Fisher linear discriminant analysis to speech/music classification,” in The International Conference on Computer as a Tool, 2005, pp. 1666–1669.

[9] P. Belhumeur, J. Hespanha, and D. Kriegman, “Eigenfaces vs. Fisherfaces: Recognition using class specific linear projection,” IEEE Transactions on Pattern Analysis and Machine Intelligence, vol. 19, no. 7, pp. 711–720, 1997.

[10] T. Kim and J. Kittler, “Locally linear discriminant analysis for multimodally distributed classes for face recognition with a single model image,” IEEE Transactions on Pattern Analysis and Machine Intelligence, vol. 27, no. 3, pp. 318–327, 2005.

[11] E. Altman, “Financial ratios, discriminant analysis and the prediction of corporate bankruptcy,” The Journal of Finance, vol. 23, no. 4, pp. 589–609, 1968.

[12] K. Kumar and S. Bhattacharya, “Artificial neural network vs linear discriminant analysis in credit ratings forecast: A comparative study of prediction performances,” Review of Accounting and Finance, vol. 5, no. 3, pp. 216–227, August 2006.

[13] O. Hamsici and A. Martinez, “Bayes optimality in linear discriminant analysis,” IEEE Transactions on Pattern Analysis and Machine Intelligence, vol. 30, no. 4, pp. 647–657, 2008.

[14] P. J. Bickel and E. Levina, “Some theory for Fisher’s linear discriminant function, ‘naive Bayes’, and some alternatives when there are many more variables than observations,” Bernoulli, vol. 10, no. 6, pp. 989–1010, 2004.

[15] J. Fan, Y. Fan, and Y. Wu, “High-dimensional classification,” in High-dimensional Data Analysis, T. Cai and X. Shen, Eds. New Jersey: World Scientific, 2011, pp. 3–37.

[16] T. Anderson, An Introduction to Multivariate Statistical Analysis, 2nd ed. New York, NY: Wiley, 1984.

[17] E. Wigner, “Characteristic vectors of bordered matrices with infinite dimensions,” The Annals of Mathematics, vol. 62, no. 3, pp. 548–564, 1955.

[18] ——, “On the distribution of the roots of certain symmetric matrices,” The Annals of Mathematics, vol. 67, no. 2, pp. 325–327, 1958.

[19] V. Marˇcenko and L. Pastur, “Distribution of eigenvalues for some sets of random matrices,” Mathematics of the USSRSbornik, vol. 1, p. 457, 1967.

[20] A. Edelman and N. Rao, “Random matrix theory,” Acta Numerica, vol. 14, no. 233-297, p. 139, 2005.

[21] A. Tulino and S. Verd´u, Random matrix theory and wireless communications. Now Publishers Inc, 2004, vol. 1.

[22] K. Fukunaga, Introduction to Statistical Pattern Recognition, Second Edition. Academic Press, September 1990.

[23] R. J. Durrant and A. Kab´an, “A bound on the performance of lda in randomly projected data spaces,” in International Conference on Pattern Recognition, 2010, pp. 4044–4047.

[24] D. C. Hoyle, “Accuracy of pseudo-inverse covariance learning – a random matrix theory analysis,” IEEE Transactions on Pattern Analysis and Machine Intelligence., vol. 33, no. 7, pp. 1470–1481, Jul. 2011.

[25] G. J. Mclachlan, Discriminant Analysis and Statistical Pattern Recognition (Wiley Series in Probability and Statistics). Wiley-Interscience, Aug. 2004.

[26] R. J. Durrant and A. Kaban, “Compressed fisher linear discriminant analysis: classification of randomly projected data,” in ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, 2010, pp. 1119–1128.

[27] R. J. Durrant and A. Kab´an, “Error bounds for kernel fisher linear discriminant in gaussian hilbert space,” Journal of Machine Learning Research - Proceedings Track, vol. 22, pp. 337–345, 2012.

[28] C. Blake and C. Merz, “UCI repository of machine learning databases,” Dept. of Information and Computer Sciences, University of California, Irvine, Tech. Rep., 1998.

[29] A. Edelman, “Eigenvalues and condition numbers of random matrices,” Ph.D. dissertation, Massachusetts Institute of Technology, 1989.

[30] T. Tao, Topics in Random Matrix Theory. American Mathematical Society, 2012.

[31] P. Billingsley, Convergence of Probability Measures, ser. Wiley Series in Probability and Statistics: Probability and Statistics. John Wiley & Sons Inc., 1999, vol. 175.

designed for accessibility and to further open science