Sure Screening for Gaussian Graphical Models

2014·Arxiv

Abstract

Abstract

We propose graphical sure screening, or GRASS, a very simple and computationally-efficient screening procedure for recovering the structure of a Gaussian graphical model in the high-dimensional setting. The GRASS estimate of the conditional dependence graph is obtained by thresholding the elements of the sample covariance matrix. The proposed approach possesses the sure screening property: with very high probability, the GRASS estimated edge set contains the true edge set. Furthermore, with high probability, the size of the estimated edge set is controlled. We provide a choice of threshold for GRASS that can control the expected false positive rate. We illustrate the performance of GRASS in a simulation study and on a gene expression data set, and show that in practice it performs quite competitively with more complex and computationally-demanding techniques for graph estimation.

1 Introduction

In recent years, graphical modeling has been a topic of great interest in both the scientific and statistical communities. Applications of graphical modeling are widespread, from computer vision to natural language processing to genomics. In particular, in genomics, graphical models have been extensively used to model gene regulatory networks, composed of tens of thousands of genes. It is typically of interest to infer the structure of the graph based on hundreds, or at most thousands, of observations for which gene expression measurements are available. Consequently, the setting is high-dimensional, in the sense that there are many more features than observations.

Consider the random vector , and the conditional dependence graph G = (Γ, E). Here Γ = {1, . . . , p} is the set of nodes, and E is the set of edges in Γ Γ. A pair (a, b) is contained in the edge set E if and only if is conditionally dependent on , given all remaining variables

The conditional dependence graph takes a particularly simple form if we suppose that is a non-singular covariance matrix. In this setting, a pair of variables is conditionally independent if and only if the corresponding entry of the precision matrix equals zero (Lauritzen, 1996; Mardia et al., 1980). Consequently, in the Gaussian graphical model, recovering the edge set E is equivalent to recovering the sparsity pattern of the precision matrix

Recently, much attention has been devoted to the task of estimating and recovering the sparsity pattern of a large sparse precision matrix; we provide a brief review of the literature here. A number of authors have considered an -penalized likelihood approach in order to estimate a sparse precision matrix (Yuan and Lin, 2007; Friedman et al., 2008; Rothman et al., 2008; Ravikumar et al., 2009). To ameliorate the bias incurred by the use of an penalty, Lam and Fan (2009) considered the use of a non-convex SCAD penalty. Others have considered a neighborhood selection approach, which entails performing a sparse regression of each variable on all of the other variables in order to estimate the precision matrix (Meinshausen and B¨uhlmann, 2006; Yuan, 2010; Cai et al., 2011, 2012; Sun and Zhang, 2012). Yuan (2010) considered a Dantzig-type neighborhood selection approach. For many of the aforementioned approaches, statistical convergence results (in terms of various matrix norms) have been established for the high-dimensional setting.

Although various computationally efficient algorithms for estimating a sparse precision matrix have been proposed (see e.g. Friedman et al., 2008; Witten et al., 2011), the required computations can be burdensome when the number of variables is in the tens of thousands, or even higher. For example, the precision matrix for a problem with p = 25, 000 (the number of genes in the human genome) involves upwards of 300,000,000 parameters. In such a setting, existing algorithms tend to be infeasible. We are thus motivated to consider a computationally-efficient screening approach for Gaussian graphical models that possesses desirable statistical properties.

In recent years, computationally simple variable screening approaches have gained popularity in the context of high-dimensional modeling. Fan and Lv (2008) proposed the sure independence screening method for linear models. This approach possesses the sure screening property: with probability going to one, all of the important variables will be selected. Fan et al. (2009) and Fan and Song (2010) extended this approach to the context of generalized linear models. Other marginal screening methods include tilting methods (Hall et al., 2009), generalized correlation screening (Hall and Miller, 2009), nonparametric screening (Fan et al., 2011), partial likelihood screening (Zhao and Li, 2012), and robust rank correlation based screening (Li et al., 2012; Zhu et al., 2011). Most of the existing screening methods aim to select variables by ranking utilities such as the Pearson’s correlation between the marginal covariates and the response, where variables with strong marginal utilities are selected.

In this paper, we propose a novel screening procedure for recovering the structure of a Gaussian graphical model. We call this approach graphical sure screening (GRASS). Our approach is motivated by the fact that the ath column of the precision matrix can be obtained by regressing the ath feature onto the 1 other features (Mardia et al., 1980). This suggests that in order to estimate , the neighbourhood of the ath node, we can emulate the sure screening procedure of Fan and Lv (2008) for linear models: we simply threshold the sample correlations of the ath feature with the 1 other features. We show that under certain simple assumptions, the set of nodes for which the sample correlation with the ath node exceeds some threshold is guaranteed to contain the true neighborhood, , with very high probability. This property holds when the dimension p grows as an exponential function of the sample size n. Furthermore, we establish a surprising connection between GRASS and existing approaches for sparse precision matrix estimation using an -penalized log likelihood.

As far as we know, this work is the first time that a sure screening procedure has been applied in an unsupervised context. The proposed method is conceptually very simple, and can be easily implemented in very high dimensions. In contrast to existing methods for estimating a sparse precision matrix, which typically require ) computations (Friedman et al., 2008), our procedure requires only ) operations, and hence can be easily scaled to large-p settings.

The rest of this article is organized as follows. In Section 2, we present the GRASS procedure. In Section 3, we establish the theoretical properties for GRASS; these include the sure screening property, size control of the selected edge set, and control of the theoretical false positive rate. A surprising connection between GRASS and existing -penalized approaches for sparse precision matrix estimation is explored in Section 4. Simulation studies are presented in Section 5, and a real data application is in Section 6. We close with a discussion in Section 7.

2 Graphical Sure Screening

Consider the random vector has unit diagonals, i.e. E() = 1 for all data matrix i.i.d. draws from 0 be some pre-specified threshold.

We propose to obtain a candidate edge set, , and a candidate neighbourhood for the ath

node, , by thresholding the sample correlation matrix by . That is, we define

We refer to (2.1) and (2.2) as the graphical sure screening (GRASS) estimates.

We will show in Section 3 that for an appropriate choice of is contained in high probability, when p grows exponentially with n. We refer to this as the sure screening property for the graphical model. This property holds even when the size of the selected neighbourhood is a polynomial order of the sample size, leading to a drastic decrease in the dimension. Furthermore, under certain conditions, GRASS can also control the false positive rate.

3 Theoretical Properties

Here we present some theoretical properties of the GRASS procedure. Proofs are in the Appendix. In what follows, we use the notation

To begin, we introduce an assumption on the minimum correlation between two nodes connected by an edge.

Assumption 1. For some constants

Assumption 1 allows us to establish the sure screening property of GRASS, which is presented in Theorem 1.

Theorem 1. Assume that Assumption 1 holds, and that log(for some constants

and . Then, there exist constants

Theorem 1 guarantees that with very high probability, the true edge set is contained in the edge set estimate from the GRASS procedure. In other words, with very high probability, GRASS will not result in false negatives. This raises the following question: how large is the estimated neighbourhood for the ath node? In order to answer this question, we must first introduce an additional assumption.

Assumption 2. There exist constants

where is the maximal eigenvalue of

Assumption 2 indicates that the largest eigenvalue of the population covariance matrix allowed to diverge as n grows; however, it cannot diverge too quickly. This condition naturally appears in many applications. For example, it holds for the covariance matrix of a stationary time series (Fan and Lv, 2008).

We now present Theorem 2, which allows us to control the size of

then

where the constants are as in Theorem 1.

Next we propose a choice of that allows us to control the expected false positive rate at a

prespecified value. The false positive rate is defined as

We would like the false positive rate to decrease to 0 as increases with n. As in Zhao and Li (2012), we do this by first fixing the number of false positives f that we are willing to tolerate; this corresponds to a false positive rate of In order to control the expected false positive rate, we introduce an additional assumption.

Assumption 3. For the same as in Theorem 1,

Theorem 3. Under Assumptions 1-3, if log(as in Theorem 1, then we can control the asymptotic expected false positive rate at by choosing Furthermore, with this threshold, the sure screening property of Theorem 1 still holds.

4 Connection with the Graphical Lasso

In recent years, many quite sophisticated approaches for obtaining a sparse estimate of been proposed. Perhaps the best-known among these is the graphical lasso (Friedman et al., 2008;

Yuan and Lin, 2007), which is the solution to the optimization problem

Recently, Witten et al. (2011) and Mazumder and Hastie (2012) established a surprising result: the connected components of the graphical lasso estimator are exactly the same as the connected components that result from hard-thresholding the matrix In other words, the

connected components of the graphical lasso estimator are the same as the connected components

of the GRASS edge set estimate,

Does this mean that GRASS and the graphical lasso are identical? Not quite. Though the connected components of the graphical lasso and of GRASS are the same, their entire sparsity patterns are, in general, not the same. The graphical lasso can be thought as a two-stage procedure, in which we first perform GRASS with , and then perform a smaller graphical lasso problem on each connected component of

Under a set of assumptions explored by Ravikumar et al. (2011), the graphical lasso is known to be model selection consistent. Since the connected components of GRASS and the connected components of the graphical lasso are the same provided that , this means that the model selection consistency results of Ravikumar et al. (2011) are inherited by the connected components of GRASS. In other words, the connected components of GRASS are selected consistently.

5 Simulation Studies

5.1 Data Generation

Let p denote the number of features, and n the number of observations. We considered three ways of generating the edge set E:

with probability 0.01.

Simulation B: A graph with ten densely connected components. We partitioned the p features into

2 we set (. Otherwise, (

Once the edge set E was generated, we created a precision matrix via the following steps:

Then the covariance matrix was rescaled to have diagonal elements equal to 1. Finally, we generated n observations i.i.d. from a ) distribution.

5.2 Control of False Positive Rate

Theorem 3 states that under certain conditions, performing GRASS with leads to control of the expected false positive rate (FPR) at level . We now investigate the extent to which GRASS controls the FPR in practice. Results for Simulations A-C are shown in Table 1.

Assumption 3, required for Theorem 3 to hold, states that maxSimulation B satisfies this assumption, since is block diagonal with completely dense blocks (and thus the same is true of ). As expected, the FPR is controlled successfully in Simulation B (Table 1).

However, Assumption 3 seems not to be satisfied by Simulations A and C. But Table 1 reveals that the FPR is approximately controlled in these settings, especially for larger values of q. How

can this be?

In order to investigate this, we consider the off-diagonal elements of under Simulations A-C. These are displayed in Figure 1. We see that even for Simulations A and C, the vast majority of the large off-diagonal elements of correspond to non-zero elements of

Furthermore, in Simulation A, there is a very pronounced relationship between the values of the non-zero elements of , and the corresponding values of . This is the case because in Simulation A, was generated to be so sparse (Section 5.1) that, with high probability, a given column of contains no more than one non-zero off-diagonal element. Consequently, (approximately) a block-diagonal matrix with blocks containing no more than two features. And consequently the sparsity patterns of are almost identical. Furthermore, there is a simple monotone relationship between most of the non-zero elements of the two matrices, which can be easily derived using the standard formula for the inverse of a 2

Figure 1: For Simulations A-C with n = 100 and p = 50, the off-diagonal elements of and ) are shown. The 0.5% of largest absolute off-diagonal elements of are shown in red; the rest are in black. For all three setups, the vast majority of large off-diagonal elements of correspond to non-zero elements of . The pronounced relationships seen for Simulation A are due to the extreme sparsity of , as is discussed in the text.

5.3 Comparison to Existing Approaches

We now compare the performances of the graphical lasso (Friedman et al., 2008), neighborhood selection (Meinshausen and B¨uhlmann, 2006), and GRASS on Simulations A-C, with n = 50 and p = 200. Results are displayed in Figure 2.

In Simulation B, recall that the sparsity patterns of are identical. In this setting, GRASS outperforms the graphical lasso and neighborhood selection, since it correctly assumes that the sparsity patterns of the covariance and precision matrices are similar. In Simulations A and C, even though this assumption does not hold exactly, the performance of GRASS is quite competitive.

Overall, in Figure 2, there is little difference in performance between GRASS, neighbourhood selection, and the lasso. That is, even though GRASS is an extremely simple approach, in practice GRASS performs competitively with specialized and computationally-intensive procedures for estimating a precision matrix.

Figure 3 displays the adjacency matrices corresponding to the true edge set, as well as the edge sets estimated using the graphical lasso and GRASS, for Simulations B and C. We find once again that the edge sets estimated by the graphical lasso and GRASS appear to be quite similar.

Table 1: The false positive rate (FPR; defined as FP/(FP+TN)) and false negative rate (FNR; defined as FN/(TP+FN)) are reported for various values of q, the level of desired FPR control. Results for two values of n and p, and for each of Simulations A-C, are reported. The value of — the number of elements in the GRASS estimate corresponding to the value of reported. Results are averaged over 250 simulated data sets.

6 Analysis of Gene Expression Data

We examined a gene expression data set from Spira et al. (2007), previously studied in Danaher et al. (2013), and publicly available from the Gene Expression Omnibus (Barrett et al., 2007) at accession number GDS2771. The data consist of 22,283 microarray-derived gene expression measurements from large airway epithelial cells sampled from 97 patients with lung cancer and

Figure 2: For Simulations A-C with p = 200 and n = 50, the number of false and true positive edges detected is displayed as the tuning parameter is varied. Results are shown for graphical lasso (—), neighborhood selection (—), and GRASS (—).

90 controls. We limited our analysis to the 1,778 genes with the highest marginal variance. Each feature was standardized within each class, in order to have mean zero and standard deviation one among the cancer samples, and among the controls.

Our goal is to compare the performances of GRASS and graphical lasso in terms of edge set recovery. Unfortunately, as is typically the case for a high-dimensional biological data set, the true underlying conditional dependence relationships in this data are unknown. In other words, no gold standard is available.

Given the absence of a gold standard, we took the following approach. We split the control samples into two equally-sized sets, Set 1 and Set 2. We applied both graphical lasso and GRASS to each set. We refer to the resulting estimated edge sets as parameters were chosen such that . In order to quantify the accuracy of the edges estimated on Set 2 by GRASS and the graphical lasso, we first treated the edges estimated by the graphical lasso on Set 1 as the gold standard, and then we treated the edges estimated by GRASS on Set 1 as the gold standard. In greater detail, we calculated:

Figure 3: For Simulations B (panels (a)-(c)) and C (panels (d)-(f)) with p = 100 and n = 50, the adjacency matrices corresponding to the true edge set (panels (a) and (d)), the graphical lasso estimate (panels (b) and (e)), and the GRASS estimate (panels (c) and (f)) are shown. The adjacency matrices for graphical lasso and GRASS are averaged over 10 simulated data sets; the color of a particular cell in the heatmap corresponds to the fraction of these 10 data sets for which the corresponding edge is estimated to be present. Results for Simulation A are not shown, since in that setting the true edge set is not fixed across the simulated data sets.

Note that in calculating these accuracies, we only considered feature pairs (i, j) for which GRASS and graphical lasso disagree over whether an edge is present in Set 2 (since feature pairs for which graphical lasso and GRASS agree on Set 2 are uninformative for our purposes).

The results, averaged over 20 splits of the data into Set 1 and Set 2, are summarized in Table 2. They indicate that regardless of whether graphical lasso or GRASS on Set 1 is treated as the gold standard, the results obtained by GRASS on Set 2 have better agreement with the gold standard than do the results obtained by the graphical lasso on Set 2. In other words, independent data provides greater evidence for edges estimated by GRASS than for edges estimated by the graphical lasso, regardless of whether the independent data is evaluated using GRASS or the graphical lasso.

Table 2: Mean (and standard error) of accuracy of graphical lasso (GL) and GRASS on gene expression data, over 20 splits of the observations into Set 1 and Set 2. , the size of the estimated edge set, is also reported. Regardless of whether GRASS or graphical lasso is treated as the gold standard, GRASS yields more accurate edge set recovery than does the graphical lasso. These results are based on an analysis of the control observations. Similar results are obtained from the cases (results not shown).

7 Discussion

In this paper, we have proposed graphical sure screening (GRASS), a simple and efficient procedure for recovering the structure of a high-dimensional Gaussian graphical model. GRASS is a natural extension of sure screening approaches from the regression and classification frameworks into the

setting of Gaussian graphical modeling.

The theoretical results presented in Section 3 for GRASS require a very simple set of assumptions. In particular, Assumption 1, which guarantees that the covariance corresponding to an edge in the graph is not too small, suffices to ensure that GRASS has the sure screening property: that is, GRASS can dramatically reduce the size of the potential edge set while still containing the true edge set with probability tending to one. Unlike Ravikumar et al. (2011) and Meinshausen and B¨uhlmann (2006), no irrepresentablility conditions are required. And unlike Zhou et al. (2009), no restricted eigenvalue condition is required on the precision matrix.

The computational advantages of the GRASS framework over existing approaches to estimate a sparse precision matrix are dramatic: while approaches such as the graphical lasso typically require ) operations, GRASS requires only ) operations.

In practice, in order for GRASS to perform well, the non-zero elements of the precision matrix must tend to be non-zero in the covariance matrix. We have shown that this assumption typically holds in a range of simulation settings. Furthermore, GRASS outperforms the graphical lasso on a gene expression data set.

Acknowledgments

This work was supported in part by NSF Grants DMS 1007698 (R.S.) and DMS CAREER 1252624 (D.W.), NIH Grant 5DP5OD009145 (D.W.), and a Sloan Research Fellowship (D.W.).

A Appendix

First, we reproduce a result from van der Vaart and Wellner (1996) for the sake of readability.

Lemma 1 (Bernstein’s inequality, Lemma 2.2.11, van der Vaart and Wellner (1996))

be independent random variables with zero mean such that , for every

(and all i) and some constants

The following lemma, needed for the proof of Theorem 1, shows that a -distributed random variable satisfies the moment condition in Lemma 1.

. Then for some constant C, we have for all

This is not greater than for some constant C.

Proof of Theorem 1

Proof. To begin, we will show that there exist constants such that for any

By definition,

This can be bounded above by

where are independent standard normal random variables. Hence distribution. Together, Lemmas 1 and 2 guarantee that there exists positive

constants

Next, we notice that

Now we note that , and that for (, Assumption 1 implies that

Hence,

This implies the first part of Theorem 1. The second part of Theorem 1 can be established in a similar way, and hence we omit the details.

The following lemma will be used in the proof of Theorem 2.

-dimensional random vector with mean zero and variance Σ, and Y be a random variable with . Assume that X and Y satisfy the

linear model is uncorrelated with X. Define

for some constant C. Then |S|, the cardinality of S, satisfies

Proof. Left-multiplying both sides by X and taking the expectation, ). Therefore

th element of the vector Σ. Consequently,

Furthermore,

Moreover, recalling that are uncorrelated, we have that

Proof of Theorem 2

Proof. Let

and

By definition, . Then on the set to , it has to belong to . Thus, we conclude that ). Moreover, an

argument similar to that in the proof of Theorem 1 can be used to show that

This implies that

Finally, applying Lemma 3 in conjunction with Assumption 2 yields the desired result.

Proof of Theorem 3

Proof. First, we will show that the assumptions of Theorem 1 are satisfied, so that the sure screening property applies. We must simply show that the new threshold,

greater than 2, the threshold used in Theorem 1. In other words, we must show that

From the fact that

we have that 1 ). Furthermore, since log(we have that ). Using the fact that , (A.3) follows directly.

Next, we show that using a threshold value of leads to control of the asymptotic expected false positive rate at 1)]. Recall that the false positive rate is defined

as

Because E(, it follows that

Furthermore, for any (

where the asymptotic equivalence in the previous line follows from the fact that

Consequently, the expectation of fpris controlled as desired,

where the last inequality is due to the fact that

References

Barrett, T., Troup, D. B., Wilhite, S. E., Ledoux, P., Rudnev, D., Evangelista, C., Kim, I. F., Soboleva, A., Tomashevsky, M., and Edgar, R. (2007). NCBI GEO: mining tens of millions of expression profiles – database and tools update. Nucleic Acids Research, 35(suppl 1):D760–D765.

Cai, T., Liu, W., and Luo, X. (2011). A constrained l1 minimization approach to sparse precision matrix estimation. Journal of the American Statistical Association, 106(494).

Cai, T. T., Liu, W., and Zhou, H. H. (2012). Estimating sparse precision matrix: Optimal rates of convergence and adaptive estimation. arXiv preprint arXiv:1212.2882.

Danaher, P., Wang, P., and Witten, D. M. (2013). The joint graphical lasso for inverse covariance estimation across multiple classes. Journal of the Royal Statistical Society: Series B (Statistical Methodology).

Fan, J., Feng, Y., and Song, R. (2011). Nonparametric independence screening in sparse ultra- high-dimensional additive models. Journal of the American Statistical Association, 106(494).

Fan, J. and Lv, J. (2008). Sure independence screening for ultrahigh dimensional feature space. Journal of the Royal Statistical Society: Series B (Statistical Methodology), 70(5):849–911.

Fan, J., Samworth, R., and Wu, Y. (2009). Ultrahigh dimensional feature selection: beyond the linear model. The Journal of Machine Learning Research, 10:2013–2038.

Fan, J. and Song, R. (2010). Sure independence screening in generalized linear models with np- dimensionality. The Annals of Statistics, 38(6):3567–3604.

Friedman, J., Hastie, T., and Tibshirani, R. (2008). Sparse inverse covariance estimation with the graphical lasso. Biostatistics, 9(3):432–441.

Hall, P. and Miller, H. (2009). Using generalized correlation to effect variable selection in very high dimensional problems. Journal of Computational and Graphical Statistics, 18(3).

Hall, P., Titterington, D., and Xue, J.-H. (2009). Tilting methods for assessing the influence of components in a classifier. Journal of the Royal Statistical Society: Series B (Statistical Methodology), 71(4):783–803.

Lam, C. and Fan, J. (2009). Sparsistency and rates of convergence in large covariance matrix estimation. Annals of Statistics, 37(6B):4254.

Lauritzen, S. L. (1996). Graphical Models. Oxford University Press.

Li, R., Zhong, W., and Zhu, L. (2012). Feature screening via distance correlation learning. Journal of the American Statistical Association, 107(499):1129–1139.

Mardia, K. V., Kent, J. T., and Bibby, J. M. (1980). Multivariate analysis.

Mazumder, R. and Hastie, T. (2012). Exact covariance thresholding into connected components for large-scale graphical lasso. The Journal of Machine Learning Research, 13(1):781–794.

Meinshausen, N. and B¨uhlmann, P. (2006). High-dimensional graphs and variable selection with the lasso. The Annals of Statistics, pages 1436–1462.

Ravikumar, P., Lafferty, J., Liu, H., and Wasserman, L. (2009). Sparse additive models. Journal of the Royal Statistical Society: Series B (Statistical Methodology), 71(5):1009–1030.

Ravikumar, P., Wainwright, M. J., Raskutti, G., and Yu, B. (2011). High-dimensional covariance estimation by minimizing l1-penalized log-determinant divergence. Electronic Journal of Statistics, 5:935–980.

Rothman, A. J., Bickel, P. J., Levina, E., and Zhu, J. (2008). Sparse permutation invariant covariance estimation. Electronic Journal of Statistics, 2:494–515.

Spira, A., Beane, J. E., Shah, V., Steiling, K., Liu, G., Schembri, F., Gilman, S., Dumas, Y.-M., Calner, P., Sebastiani, P., et al. (2007). Airway epithelial gene expression in the diagnostic evaluation of smokers with suspect lung cancer. Nature Medicine, 13(3):361–366.

Sun, T. and Zhang, C.-H. (2012). Sparse matrix inversion with scaled lasso. arXiv preprint arXiv:1202.2723.

van der Vaart, A. W. and Wellner, J. A. (1996). Weak Convergence and Empirical Processes. Springer, New York.

Witten, D. M., Friedman, J. H., and Simon, N. (2011). New insights and faster computations for the graphical lasso. Journal of Computational and Graphical Statistics, 20(4):892–900.

Yuan, M. (2010). High dimensional inverse covariance matrix estimation via linear programming. The Journal of Machine Learning Research, 99:2261–2286.

Yuan, M. and Lin, Y. (2007). Model selection and estimation in the Gaussian graphical model. Biometrika, 94(1):19–35.

Zhao, S. D. and Li, Y. (2012). Principled sure independence screening for Cox models with ultra- high-dimensional covariates. Journal of Multivariate Analysis, 105(1):397–411.

Zhou, S., van de Geer, S., and B¨uhlmann, P. (2009). Adaptive lasso for high dimensional regression and gaussian graphical modeling. arXiv preprint arXiv:0903.2515.

Zhu, L.-P., Li, L., Li, R., and Zhu, L.-X. (2011). Model-free feature screening for ultrahighdimensional data. Journal of the American Statistical Association, 106(496).

designed for accessibility and to further open science