Linear Discriminant Analysis (LDA) [1] is a well-known technique for feature extraction and dimension reduction. It has been widely used in many applications [2,3] such as face recognition, image retrieval, etc. Typically, LDA finds the projection directions such that for the projected data, the between-class variance has been maximized relative to the within-class variance, thus achieving maximum discrimination. An intrinsic limitation of classical LDA is that its objective function requires the nonsingularity of one of the scatter matrices. For many applications, such as the microarray data analysis, all scatter matrices can be singular or ill-posed since the data is often with high dimension but low sample size (HDLSS) [4].
Recently, many efforts have been devoted to bear on such HDLSS problems. For example, Krzanowski et al. proposed a pseudo-inverse LDA to approximate the inverse covariance matrix, when the sample covariance matrix is singular. However, the accuracy of pseudo-inverse LDA is usually low and not well guaranteed [5]. Another technique to alleviate this problem is a two-stage algorithm, i.e., PCA+LDA [6,7]. More popularly, regularized LDA approaches are proposed to solve the problem and improve the performance [8]. For example, researchers proposed a series of algorithms to regularize the covariance matrix estimation [2, 5,9]. The regularized linear discriminant hyperplane was studied in [10,11,4]. All regularized LDA approaches intend to improve LDA through regularizing the estimation of key parameters used in LDA, such as the covariance matrix and/or the linear coefficients for discrimination.
One representative regularized LDA approach is Covariance Regularized Discriminant Analysis (CRDA) proposed in [9] based on the sparse inverse covariance estimation leveraging Graphical Lasso [12]. CRDA was originally proposed to estimate the inverse covariance matrix via a shrunken estimator, so as to achieve . Intuitively, through replacing the sample covariance matrix used in LDA with the regularized estimation, the HDLSS problem can be well addressed since the regularized estimators usually outperform the sample covariance matrix estimator [13]. To better elucidate the performance of LDA classifiers with uncertain covariance matrix estimates for Gaussian data classification, [14] studied a model of error rate by matching the estimated vs. true covariance matrices, and the estimated vs. true means. While it is reasonable to assume that
Fig. 1: Medicare visits and Electronic Health Records (EHRs). EHRs of a patient consist of the records of diagnoses and treatments. In this example, totally m medical visits have been placed. For every medical visit, the patient would receive a set of ICD/CPT codes [18] referring the diseases/treatments that have been diagnosed/carried out. One can enable the early diagnosis/detection of diseases through classifying the EHR data, with big data and machine learning techniques.
the estimated mean can easily converge to the population/true mean, the population/true covariance matrix is usually unknown and can be very different with the estimated one [13]. For example, the largest eigenvalue of the sample covariance matrix, which represents the principle component of the data distribution, is not consistent with the population one and the eigenvectors of the sample covariance matrix can be almost orthogonal to the truth under HDLSS [15,16]. Further, the data for classification is usually Non-Gaussian. Thus, it is highly desirable to develop a new analytical model to characterize the error rate for the data with arbitrary distribution (both Gaussian and Non-Gaussian). Two “known factors” of covariance matrix estimation are useful for developing such analytical models, one is the convergence rate and the other one is the sparsity/density of (inverse) covariance matrix estimators [13]. The sparsity/density is already known once the (inverse) covariance matrix is estimated. The convergence rates reflect the maximal error of estimation, and for some estimators, they are well bounded under certain assumptions, such as spectral-norm convergence rate of Graphical Lasso [17].
Among a wide range of HDLSS data classification tasks, in this work, we focus on the problem of using LDA to classify EHR [19] for personalized predictive analytics of target disease. EHRs play a critical role in modern health information management and service innovations. A patient’s EHR contains his/her histories of medical visit, medication, diagnoses, treatment plans, allergies and so on as shown in Fig 1. Per each visit a diagnosis record would be updated indicating the disease state, i.e., a set of codes referring to the diseases that diagnosed at a time of visit. One significant feature is the interchangeability of EHR, as a standard protocol for medical/health data generation, storage and communication. The health information is built and managed by authorized institutions in a unified digital format (e.g., ICD-9/10, CPT-9/10 used in EHR standards) such that researchers and scientists can share and analyze the EHR data to enable innovative health services, such as providing computer-assisted diagnosis and offering medication advice. Among these services, predictive analytics of diseases (or namely early detection of diseases) using patients’ past longitudinal health information of the EHR system, has recently attracted significant attention from research communities. There has been a series of works [20,19,21,22,23,24,25,26], which attempt to predict future disease of patients, through data mining techniques using EHR data. Prior literature usually first selected important features, such as diagnosisfrequencies [19], pairwise diagnosis transitions [22], and graphs of diagnosis sequences [24], to represent the EHR data of the patients. Then, a wide range of supervised learning algorithms were adopted to build predictive models for early disease detection, on top of well-represented EHR data.
In this paper, we first propose a novel analytical model for LDA error rate, based on the statistical convergence of (inverse) covariance matrix estimators and the divergence to the Gaussian distributions. Guided by the proposed analytical model, we propose a novel LDA classifier leveraging the (inverse) covariance matrix estimators with faster convergence rate. We apply our model to a large-scale EHR dataset for the predictive analytics of diseases and demonstrate the advantage of the proposed algorithms over other state-of-the-arts. Specifically, in this paper, we made contributions as follows.
1. We studied the problem of high-dimensional linear classification using LDA models and proposed a novel analytical model, derived from the existing LDA models on Gaussian data [27,14], to understand the LDA error rate for both Gaussian and Non-Gaussian data, with respect to the statistical error of covariance matrices estimation and the divergence between fitted Gaussian distribution and the data.
2. On top of the analytical model, we proposed De-Sparse, which extends the well-known baseline approach – Covariance Regularized Discrimiant Analysis (CRDA) [9,28], using De-sparsified Graphical Lasso [29]. Theoretical analysis based on the proposed analytical model shows that De-Sparse can bound the maximal error rate, under mild conditions. Compared to CRDA, De-Sparse could achieve lower error rate, due to the faster convergence rate of De-sparsified Graphical Lasso.
3. To show the practical contribution of the proposed method, we evaluated De-Sparse extensively through experiments with large-scale, real-world EHR datasets [30]. In the experiments, we used De-Sparse to predict the risk of mental health disorders in college students from ten U.S. universities, using their EHR data from primary care visits. We compared De-Sparse with seven baseline algorithms including other regularized LDA models and downstream classifiers. The evaluation result shows that De-Sparse outperforms all baselines, and further validates our theoretical analysis.
The paper is organized as follows. In Section 2, we review the problem of high-dimensional linear classification using LDA models and summarize the existing work on EHR-based predictive analytics of diseases. In Section 3, we first introduce the existing covariance-regularized discriminant analysis (CRDA) based on Graphical Lasso, then present de-sparsified covariance regularized LDA algorithms, based on novel de-sparsified inverse covariance matrix estimators, to classify EHR samples for the predictive analytics. In Section 4, we validate the proposed algorithms with real-world datasets through extensive experiments. Finally, we conclude the paper in Section 5.
2.1 LDA for Binary Classification
To use Fisher’s Linear Discriminant Analysis (FDA), given N labeled data pairs () and
, 1
refers to a d-dimensional vector, we first estimate the sample covariance matrix (an symmetric
matrix) using maximized likelihood estimator:
where ¯refers to the d-dimensional mean vector of all N training samples (
(
). Then, ¯
and ¯
are estimated as the mean vectors of the positive training samples and negative training samples in the N training samples, respectively.
Given all estimated parameters ¯
, the FDA model classifies a new data vector x as the result of Eq. 2 as follows.
where ) :
refers to the signal function,
refer to the (foreknown) frequencies of positive samples and negative samples in the whole population.
Apparently, the performance of LDA depends on (1) whether the realistic training/testing datasets follow Gaussian distributions and (2) how the mean vectors and inverse covariance matrices are estimated from the datasets.
2.2 Electronic Health Records and Predictive Analytic of Disease
Prior to learning a predictive analytic model for certain diseases, one needs to model the EHR data with a suitable data representation. The most simple yet effective way to represent EHR data is to use diagnosis-frequency vector [31,32,33], which is similar to Term Frequency (TF) or Term Frequency-Inverse Document Frequency (TF-IDF) approach that deals with traditional NLP data [34,35,36]. Given each patient’s EHR data (shown in Fig 1), this representation method first retrieves the diagnosis codes [37] recorded during each visit. Inspired by some Natural Language Processing (NLP) and text mining practices [38], researchers also proposed using some deep learning based NLP approaches to embed EHR records for data representation learning [39,40,41,42,43,44]. For example, [40] studied “Patient2Vec” which embeds patients’ past EHR records into vectors while preserving structural information for personalized predictive analysis. [42] focuses on the interpolation and interpretability of EHR representation learning, where authors well-balanced the performance of predictive analysis and the understanding to the longitude disease progress of each individual patient, both using the EHR data with the learned representation. Comprehensive surveys could be found in [45, 18,46].
In our work, we follow the line of research that uses diagnosis-frequency vector of each patient for EHR-based predictive analysis [31,32,33], as the diagnosis-frequency in a certain duration could well characterize the health status of patients while the coefficient of LDA can represent the significance of every diagnosis code. The frequency of each diagnosis appearing in all past visits (of the last two years) is counted here, followed by further transformation on the frequency of each diagnosis into a vector of frequencies. For example, a diagnosis-frequency vector can be illustrated as , where 0 means the second diagnosis does not exist in all past visits. Note that the dimension
15, 000 when using original ICD-9 codes, d = 295 even when using clustered ICD-9 codes [47], while the number of samples for training N in our experiment is significantly smaller than d.
2.3 Discussion on Preliminaries
In our work, we revisited the linear discriminant analysis as a classifier and learner for High-Dimensional and Low Sample Size (HDLSS) settings. Indeed, many efforts have been made in the literature for HDLSS data classification. For example, in addition to LDA-type methods, a number of feature extraction or variable selection methods have been studied [48,3]. Lin et al., [49] proposed a feature selection algorithm to classify the high-dimensional gene expression data through incorporating the neighborhood entropy-based uncertainty measures. Over the rough set, the same group of authors [50] adopted a joint feature selection approach that incorporates neighborhood entropy and the fisher scores, for tumor classification. Further, some automatic feature weighting paradigm has been proposed to select features for gene expression data classification [51]. These studies demonstrate that the feature selection algorithms could significantly improve the accuracy of HDLSS data classification, while avoiding the full set of features. The over-reduction problem of LDA has been studied in [52]. In addition to the EHR data, similar regularized projection methods have been used for early diagnosis of diseases for biomedical health data [53,54].
In terms of methodologies, the most close work to this study is covariance-regularized linear discriminant analysis (CRDA) [9], Graphical Lasso [17], and the de-sparsified Graphical Lasso [29,55]. CRDA regularizes the estimation of (inverse) covariance matrices inside the estimation of LDA, while improving the performance of LDA for both prediction and inference. Authors in [28] were the first to bring CRDA for EHR classification and early detection of diseases. We included the algorithms in [28] for comparison and found that De-Sparse outperformed CRDA with higher accuracy and F1-score. Compared to the Graphical Lasso estimator [17] that has been frequently used to enhance the inverse covariance matrices estimation, our work followed the ideas of de-biased estimator [56] and used de-sparsified Graphical Lasso estimator [29] to improve the LDA for EHR classification. We would provide a comprehensive discussion on the performance comparisons between Graphical Lasso and de-sparsified one from the perspectives of predictive analytics based on EHR data and LDA.
In this section, we first introduce the baseline algorithm based on CovarianceRegularized Discriminant Analysis (CRDA) that is derived from [28]. Then, we present the proposed algorithm De-Sparse, an extended Covariance Regularized Discriminant Analysis via De-sparsified Graphical Lasso [29]. Then, using our proposed analytical model of LDA error rate, we compare two methods and demonstrate the advantages of De-Sparse.
3.1 CRDA: The Baseline of Covariance-Regularized Discriminant Analysis via Graphical Lasso Inverse Covariance Matrix Estimator
Compared to the sample LDA introduced in Section 2, CRDA [9,28] was proposed to use -penalized inverse covariance matrix estimator to replace the inverse of sample covariance matrix. Given the labeled data pairs for training (
), (
), the algorithm first estimates the sample covariance matrix ¯
and the sample mean vectors ¯
, ¯
using the algorithms introduced in Section 2.1. With the sample covariance matrix ¯
, this method estimates a sparse inverse covariance matrix
using the Graphical Lasso estimator [12] as follows.
Corollary 2 (Graphical Lasso Estimator [12]) Given the sample estimation of the covariance matrix ¯, the Graphical Lasso estimator provides an
-regularized sparse positive-definite approximation to the inverse covariance matrix (denoted as
lows.
where 0 refers to the constraint of symmetric positive definiteness (SPD), the term tr( ¯
log det(
) refers to the negative log-likelihood of the optimization objective
over the sample estimate ¯
refers to the sum of absolute values of the non-diagonal elements in the matrix
(which is the same as the
without diagonal elements considered), and
refers to tuning parameter that makes trade-off between the sparsity and the fitness to samples. Please refer to [12] for the implementation of the algorithms.
Corollary 3 (Statistical Convergence of Graphical Lasso [17]) Suppose the random vector X is with d-dimensions and zero mean (i.e., where the population estimate of the covariance matrix is
) and the inverse of population covariance matrix is
randomly and independently drawn from X, the sample estimate of the covariance matrix here should be ¯
With the increasing number of samples (N) given and growing number of dimensions of the data (d), the graphical lasso estimate based on the sample covariance matrix converges to the population estimate
at the rate of Frobenius-norm under mild sparsity conditions, as follows [17].
where refers to the maximal degree of the graph in
to the
-norm of the input vector, and
refers to the
column vector (1
of the matrix
To the end, the classification rule of CRDA is written as follows
which can be viewed as an LDA classifier using as the covariance matrix. Apparently, the accuracy of CRDA depends on how the covariance matrices and the mean vectors are estimated. We are going to interpret the performance of CRDA in the Section 3.3.
3.2 De-Sparse: The Improved Algorithm of Covariance-Regularized LDA via De-Sparsified Graphical Lasso
As shown in Eq. 3, the estimator of sparse inverse covariance matrix induced -penalization and might hurt the estimation due to the over-penalization or over-sparsification. To address this issue, we proposed a de-sparsified Graphical Lasso estimator [29] to replace the vanilla Graphical Lasso.
Corollary 4 (De-sparsified Graohical Lasso [29]) Given the Graphical Lasso estimator and the sample estimation ¯
, we consider the inverse of Graphical Lasso
as an approximation to the covariance matrix. In this way, the bias of
, caused by the sparsity regularizer of Graphical Lasso, for covariance estimation could be written as follows.
Using the Kronecker product, authors in [29] consider the potential bias term of against the inverse of population covariance matrix as follows.
The de-sparsified Graphical Lasso estimator de-biases the Graphical Lasso estimator
through removing the potential bias term caused by the sparsity reguarlizer, as follows.
On top of the Graphical Lasso, the de-sparsified Graphical Lasso estimator can effi-ciently approximate an estimation of inverse covariance matrix using the data with faster convergence rate in a mild condition.
Corollary 5 (Statistical Convergence of De-sparsified Graphical Lasso [29]) Suppose the random vector X is with d-dimensions and zero mean (i.e., and E (X) = 0), where the population estimate of the covariance matrix is
) and the inverse of population covariance matrix is
samples
randomly and independently drawn from X, the sample estimate of the covariance matrix here should be ¯
. The Graphical Lasso estimator and the de-sparsified estimator are denoted as
, respectively.
With the increasing number of samples (N) given and growing number of dimensions of the data (d), the De-sparsified Graphical Lasso estimator converges to the population estimate
at the rate of
-norm under mild sparsity conditions, as follows [29].
Note that, above convergence rate of de-sparsified Graphical Lasso was obtained under similar sparsity assumptions as [17], while the -norm convergence rates of Graphical Lasso remain unknown yet.
Based on Notations, we denote the De-sparsified Covariance Regularized Discriminant Analysis (namely De-Sparse) as Desparse(x), using the De-sparsified Graphical Lasso and the mean vectors ¯
, ¯
.
With the de-sparsified inverse covariance matrix estimator enjoying better statistical properties, De-Sparse is expected to outperform CRDA with better classi-fication accuracy. Detailed comparison will be discussed in the following sections.
3.3 Performance Analysis of LDA, CRDA, and De-Sparse
In this section, we first review the previous studies on the LDA error rate estimation for Gaussian data [27,14], then we generalize LDA error rate from Gaussian data to non-Gaussian data. Finally, we provide a discussion on the classification accuracy comparison among vanilla LDA, CRDA, and De-Sparse.
3.3.1 LDA Error Rate for Gaussian Data via Random Matrix Theory
We first assume the data for binary classification follow two (unknown) Gaussian distributions with the same covariance matrix but two different means
and
, i.e.,
) for positive samples and
) for negative samples, respectively. Given the LDA classifier
) based on the sample estimated mean vectors ¯
, ¯
and a specific covariance matrix
, the expected error rate of a linear discriminant analysis (i.e., probability of
)) on the data of
),
) is modeled as follows.
Corollary 6 (RMT-based LDA Error Rate Estimation [14]) According to the random matrix theory, [14] models the expectation of classification error rate of LDA (using estimated parameters ¯) on Gaussian distributions
) and
) as
), as follows.
where refers to the CDF function of a standard normal distribution.
According to Corollary 6 and [14], we could conclude that the expected error rate is sensitive with the parameters and
, while the true parameters
are assumed unknown. Compared to the (inverse) covariance matrices estimation, the error of sample mean vector estimation is relatively small [57]. Thus, we adopt the settings in studies [27,2,5] as follows.
Lemma 1 Thus, based on Theorem 1 and the sample mean relaxation (Assumption 1), the expected error rate of ) can be reduced to
In this way, to improve the LDA classifier with the sample mean vectors, there needs an estimator to minimize or lower the expected error rate
(
).
Lemma 2 Furthermore, when the estimated covariance matrix is set to the oracle one
(the LDA is perfectly fitted with the data), the expected error rate reaches the optimal error rate
Above result suggests that when the covariance matrix is perfectly estimated
and
, the LDA classifier would approach to its optimal error rate.
The estimate in Eq. 12 reduces the estimation of LDA classification error rate to the divergence between the population covariance matrix and the estimated one
. On the other hand, Eq. 13 models the generalization error of a model with “perfectly-fitted” covariances [14].
3.3.2 Performance Comparisons
4.1 Experiment Setups
4.2 Design of Experiment
In our research, we focus on studying the improvement of LDA classification caused by (inverse) covariance matrix regularization, thus we don’t compare our method to linear-coefficient-regularized LDA classifiers [10,11,4] or heuristic LDA derivation [6].
4.3 Overall Comparison
4.3.1 Comparisons on Accuracy and F1-Score
the accuracy and F1-score of De-Sparse in most parameter settings are 0.3%– 18.9% higher and 0.14%–71.8% higher, respectively. Compared to SVM, Logistic Regression, and AdaBoost, De-Sparse can achieve 2.3%–19.4% higher accuracy and 7.5%–43.5% higher F1-score. In this case, we can conclude that the classic LDA model cannot perform as well as many other predictive models such as SVM and AdaBoost. However, De-Sparse significantly outperforms these methods in all settings. Thus, we can conclude that De-Sparse overall outperforms the baseline algorithms in all experimental settings. Note that, though De-Sparse outperforms CRDA marginally, De-Sparse enjoys a more tight upper bound of error rate.
4.3.2 Trade-off between Sensitivity and Specificity
4.3.3 Discussion on the Performance Comparison
Table 1: Performance Comparison with Training Set:502, Testing Set: 2000
2
Table 2: Performance Comparison with Training Set:1002, Testing Set: 2000
2
Table 3: Performance Comparison with Training Set:1502, Testing Set: 2000
2
Table 4: Performance Comparison with Training Set:2002, Testing Set: 2000
2
Table 5: Performance Comparison with Training Set:2502, Testing Set: 2000
2
4.4 Empirical Convergence of Parameter Estimation
using de-sparsified Graphical Lasso. Our analysis shows that the proposed algorithm could outperform the existing Covariance-regularized discriminant analysis (CRDA) based on common Graphical Lasso. The experimental results on real-world Electronic Health Record (EHR) datasets show De-Sparse outperforms all baseline algorithms. We interpret the comparison of results and demonstrate the advantage of proposed methods in medicare settings. Further, the empirical studies on estimator comparison validate our analysis.
20. Jimeng Sun, Fei Wang, Jianying Hu, and Shahram Edabollahi. Supervised patient simi- larity measure of heterogeneous patient records. ACM SIGKDD Explorations Newsletter, 14(1):16–24, 2012.
40. Jinghe Zhang, Kamran Kowsari, James H Harrison, Jennifer M Lobo, and Laura E Barnes. Patient2vec: A personalized interpretable deep representation of the longitudinal electronic health record. IEEE Access, 6:65333–65346, 2018.
59. Amin Zollanvari, Ulisses M Braga-Neto, and Edward R Dougherty. Analytic study of performance of error estimators for linear discriminant analysis. IEEE Transactions on Signal Processing, 59(9):4238–4255, 2011.