In recent years, graphical modeling has been a topic of great interest in both the scientific and statistical communities. Applications of graphical modeling are widespread, from computer vision to natural language processing to genomics. In particular, in genomics, graphical models have been extensively used to model gene regulatory networks, composed of tens of thousands of genes. It is typically of interest to infer the structure of the graph based on hundreds, or at most thousands, of observations for which gene expression measurements are available. Consequently, the setting is high-dimensional, in the sense that there are many more features than observations.
Consider the random vector , and the conditional dependence graph G = (Γ, E). Here Γ = {1, . . . , p} is the set of nodes, and E is the set of edges in Γ
Γ. A pair (a, b) is contained in the edge set E if and only if
is conditionally dependent on
, given all remaining variables
The conditional dependence graph takes a particularly simple form if we suppose that is a non-singular covariance matrix. In this setting, a pair of variables is conditionally independent if and only if the corresponding entry of the precision matrix
equals zero (Lauritzen, 1996; Mardia et al., 1980). Consequently, in the Gaussian graphical model, recovering the edge set E is equivalent to recovering the sparsity pattern of the precision matrix
Recently, much attention has been devoted to the task of estimating and recovering the sparsity pattern of a large sparse precision matrix; we provide a brief review of the literature here. A number of authors have considered an -penalized likelihood approach in order to estimate a sparse precision matrix (Yuan and Lin, 2007; Friedman et al., 2008; Rothman et al., 2008; Ravikumar et al., 2009). To ameliorate the bias incurred by the use of an
penalty, Lam and Fan (2009) considered the use of a non-convex SCAD penalty. Others have considered a neighborhood selection approach, which entails performing a sparse regression of each variable on all of the other variables in order to estimate the precision matrix (Meinshausen and B¨uhlmann, 2006; Yuan, 2010; Cai et al., 2011, 2012; Sun and Zhang, 2012). Yuan (2010) considered a Dantzig-type neighborhood selection approach. For many of the aforementioned approaches, statistical convergence results (in terms of various matrix norms) have been established for the high-dimensional setting.
Although various computationally efficient algorithms for estimating a sparse precision matrix have been proposed (see e.g. Friedman et al., 2008; Witten et al., 2011), the required computations can be burdensome when the number of variables is in the tens of thousands, or even higher. For example, the precision matrix for a problem with p = 25, 000 (the number of genes in the human genome) involves upwards of 300,000,000 parameters. In such a setting, existing algorithms tend to be infeasible. We are thus motivated to consider a computationally-efficient screening approach for Gaussian graphical models that possesses desirable statistical properties.
In recent years, computationally simple variable screening approaches have gained popularity in the context of high-dimensional modeling. Fan and Lv (2008) proposed the sure independence screening method for linear models. This approach possesses the sure screening property: with probability going to one, all of the important variables will be selected. Fan et al. (2009) and Fan and Song (2010) extended this approach to the context of generalized linear models. Other marginal screening methods include tilting methods (Hall et al., 2009), generalized correlation screening (Hall and Miller, 2009), nonparametric screening (Fan et al., 2011), partial likelihood screening (Zhao and Li, 2012), and robust rank correlation based screening (Li et al., 2012; Zhu et al., 2011). Most of the existing screening methods aim to select variables by ranking utilities such as the Pearson’s correlation between the marginal covariates and the response, where variables with strong marginal utilities are selected.
In this paper, we propose a novel screening procedure for recovering the structure of a Gaussian graphical model. We call this approach graphical sure screening (GRASS). Our approach is motivated by the fact that the ath column of the precision matrix can be obtained by regressing the ath feature onto the
1 other features (Mardia et al., 1980). This suggests that in order to estimate
, the neighbourhood of the ath node, we can emulate the sure screening procedure of Fan and Lv (2008) for linear models: we simply threshold the sample correlations of the ath feature with the
1 other features. We show that under certain simple assumptions, the set of nodes for which the sample correlation with the ath node exceeds some threshold is guaranteed to contain the true neighborhood,
, with very high probability. This property holds when the dimension p grows as an exponential function of the sample size n. Furthermore, we establish a surprising connection between GRASS and existing approaches for sparse precision matrix estimation using an
-penalized log likelihood.
As far as we know, this work is the first time that a sure screening procedure has been applied in an unsupervised context. The proposed method is conceptually very simple, and can be easily implemented in very high dimensions. In contrast to existing methods for estimating a sparse precision matrix, which typically require ) computations (Friedman et al., 2008), our procedure requires only
) operations, and hence can be easily scaled to large-p settings.
The rest of this article is organized as follows. In Section 2, we present the GRASS procedure. In Section 3, we establish the theoretical properties for GRASS; these include the sure screening property, size control of the selected edge set, and control of the theoretical false positive rate. A surprising connection between GRASS and existing -penalized approaches for sparse precision matrix estimation is explored in Section 4. Simulation studies are presented in Section 5, and a real data application is in Section 6. We close with a discussion in Section 7.
Consider the random vector has unit diagonals, i.e. E(
) = 1 for all
data matrix
i.i.d. draws from
0 be some pre-specified threshold.
We propose to obtain a candidate edge set, , and a candidate neighbourhood for the ath
node, , by thresholding the sample correlation matrix by
. That is, we define
We refer to (2.1) and (2.2) as the graphical sure screening (GRASS) estimates.
We will show in Section 3 that for an appropriate choice of is contained in
high probability, when p grows exponentially with n. We refer to this as the sure screening property for the graphical model. This property holds even when the size of the selected neighbourhood is a polynomial order of the sample size, leading to a drastic decrease in the dimension. Furthermore, under certain conditions, GRASS can also control the false positive rate.
Here we present some theoretical properties of the GRASS procedure. Proofs are in the Appendix. In what follows, we use the notation
To begin, we introduce an assumption on the minimum correlation between two nodes connected by an edge.
Assumption 1. For some constants
Assumption 1 allows us to establish the sure screening property of GRASS, which is presented in Theorem 1.
Theorem 1. Assume that Assumption 1 holds, and that log(for some constants
and . Then, there exist constants
Theorem 1 guarantees that with very high probability, the true edge set is contained in the edge set estimate from the GRASS procedure. In other words, with very high probability, GRASS will not result in false negatives. This raises the following question: how large is
the estimated neighbourhood for the ath node? In order to answer this question, we must first introduce an additional assumption.
Assumption 2. There exist constants
where is the maximal eigenvalue of
Assumption 2 indicates that the largest eigenvalue of the population covariance matrix allowed to diverge as n grows; however, it cannot diverge too quickly. This condition naturally appears in many applications. For example, it holds for the covariance matrix of a stationary time series (Fan and Lv, 2008).
We now present Theorem 2, which allows us to control the size of
then
where the constants are as in Theorem 1.
Next we propose a choice of that allows us to control the expected false positive rate at a
prespecified value. The false positive rate is defined as
We would like the false positive rate to decrease to 0 as increases with n. As in Zhao and Li (2012), we do this by first fixing the number of false positives f that we are willing to tolerate; this corresponds to a false positive rate of
In order to control the expected false positive rate, we introduce an additional assumption.
Assumption 3. For the same as in Theorem 1,
Theorem 3. Under Assumptions 1-3, if log(as in Theorem 1, then we can control the asymptotic expected false positive rate at
by choosing
Furthermore, with this threshold, the sure screening property of Theorem 1 still holds.
In recent years, many quite sophisticated approaches for obtaining a sparse estimate of been proposed. Perhaps the best-known among these is the graphical lasso (Friedman et al., 2008;
Yuan and Lin, 2007), which is the solution to the optimization problem
Recently, Witten et al. (2011) and Mazumder and Hastie (2012) established a surprising result: the connected components of the graphical lasso estimator are exactly the same as the connected components that result from hard-thresholding the matrix In other words, the
connected components of the graphical lasso estimator are the same as the connected components
of the GRASS edge set estimate,
Does this mean that GRASS and the graphical lasso are identical? Not quite. Though the connected components of the graphical lasso and of GRASS are the same, their entire sparsity patterns are, in general, not the same. The graphical lasso can be thought as a two-stage procedure, in which we first perform GRASS with , and then perform a smaller graphical lasso problem on each connected component of
Under a set of assumptions explored by Ravikumar et al. (2011), the graphical lasso is known to be model selection consistent. Since the connected components of GRASS and the connected components of the graphical lasso are the same provided that , this means that the model selection consistency results of Ravikumar et al. (2011) are inherited by the connected components of GRASS. In other words, the connected components of GRASS are selected consistently.
5.1 Data Generation
Let p denote the number of features, and n the number of observations. We considered three ways of generating the edge set E:
with probability 0.01.
Simulation B: A graph with ten densely connected components. We partitioned the p features into
2 we set (
. Otherwise, (
Once the edge set E was generated, we created a precision matrix via the following steps:
Then the covariance matrix was rescaled to have diagonal elements equal to 1. Finally, we generated n observations i.i.d. from a
) distribution.
5.2 Control of False Positive Rate
Theorem 3 states that under certain conditions, performing GRASS with leads to control of the expected false positive rate (FPR) at level
. We now investigate the extent to which GRASS controls the FPR in practice. Results for Simulations A-C are shown in Table 1.
Assumption 3, required for Theorem 3 to hold, states that maxSimulation B satisfies this assumption, since
is block diagonal with completely dense blocks (and thus the same is true of
). As expected, the FPR is controlled successfully in Simulation B (Table 1).
However, Assumption 3 seems not to be satisfied by Simulations A and C. But Table 1 reveals that the FPR is approximately controlled in these settings, especially for larger values of q. How
can this be?
In order to investigate this, we consider the off-diagonal elements of under Simulations A-C. These are displayed in Figure 1. We see that even for Simulations A and C, the vast majority of the large off-diagonal elements of
correspond to non-zero elements of
Furthermore, in Simulation A, there is a very pronounced relationship between the values of the non-zero elements of , and the corresponding values of
. This is the case because in Simulation A,
was generated to be so sparse (Section 5.1) that, with high probability, a given column of
contains no more than one non-zero off-diagonal element. Consequently,
(approximately) a block-diagonal matrix with blocks containing no more than two features. And consequently the sparsity patterns of
are almost identical. Furthermore, there is a simple monotone relationship between most of the non-zero elements of the two matrices, which can be easily derived using the standard formula for the inverse of a 2
Figure 1: For Simulations A-C with n = 100 and p = 50, the off-diagonal elements of and
) are shown. The 0.5% of largest absolute off-diagonal elements of
are shown in red; the rest are in black. For all three setups, the vast majority of large off-diagonal elements of
correspond to non-zero elements of
. The pronounced relationships seen for Simulation A are due to the extreme sparsity of
, as is discussed in the text.
5.3 Comparison to Existing Approaches
We now compare the performances of the graphical lasso (Friedman et al., 2008), neighborhood selection (Meinshausen and B¨uhlmann, 2006), and GRASS on Simulations A-C, with n = 50 and p = 200. Results are displayed in Figure 2.
In Simulation B, recall that the sparsity patterns of are identical. In this setting, GRASS outperforms the graphical lasso and neighborhood selection, since it correctly assumes that the sparsity patterns of the covariance and precision matrices are similar. In Simulations A and C, even though this assumption does not hold exactly, the performance of GRASS is quite competitive.
Overall, in Figure 2, there is little difference in performance between GRASS, neighbourhood selection, and the lasso. That is, even though GRASS is an extremely simple approach, in practice GRASS performs competitively with specialized and computationally-intensive procedures for estimating a precision matrix.
Figure 3 displays the adjacency matrices corresponding to the true edge set, as well as the edge sets estimated using the graphical lasso and GRASS, for Simulations B and C. We find once again that the edge sets estimated by the graphical lasso and GRASS appear to be quite similar.
Table 1: The false positive rate (FPR; defined as FP/(FP+TN)) and false negative rate (FNR; defined as FN/(TP+FN)) are reported for various values of q, the level of desired FPR control. Results for two values of n and p, and for each of Simulations A-C, are reported. The value of — the number of elements in the GRASS estimate corresponding to the value of
reported. Results are averaged over 250 simulated data sets.
We examined a gene expression data set from Spira et al. (2007), previously studied in Danaher et al. (2013), and publicly available from the Gene Expression Omnibus (Barrett et al., 2007) at accession number GDS2771. The data consist of 22,283 microarray-derived gene expression measurements from large airway epithelial cells sampled from 97 patients with lung cancer and
Figure 2: For Simulations A-C with p = 200 and n = 50, the number of false and true positive edges detected is displayed as the tuning parameter is varied. Results are shown for graphical lasso (—), neighborhood selection (—), and GRASS (—).
90 controls. We limited our analysis to the 1,778 genes with the highest marginal variance. Each feature was standardized within each class, in order to have mean zero and standard deviation one among the cancer samples, and among the controls.
Our goal is to compare the performances of GRASS and graphical lasso in terms of edge set recovery. Unfortunately, as is typically the case for a high-dimensional biological data set, the true underlying conditional dependence relationships in this data are unknown. In other words, no gold standard is available.
Given the absence of a gold standard, we took the following approach. We split the control samples into two equally-sized sets, Set 1 and Set 2. We applied both graphical lasso and GRASS to each set. We refer to the resulting estimated edge sets as parameters were chosen such that
. In order to quantify the accuracy of the edges estimated on Set 2 by GRASS and the graphical lasso, we first treated the edges estimated by the graphical lasso on Set 1 as the gold standard, and then we treated the edges estimated by GRASS on Set 1 as the gold standard. In greater detail, we calculated:
Figure 3: For Simulations B (panels (a)-(c)) and C (panels (d)-(f)) with p = 100 and n = 50, the adjacency matrices corresponding to the true edge set (panels (a) and (d)), the graphical lasso estimate (panels (b) and (e)), and the GRASS estimate (panels (c) and (f)) are shown. The adjacency matrices for graphical lasso and GRASS are averaged over 10 simulated data sets; the color of a particular cell in the heatmap corresponds to the fraction of these 10 data sets for which the corresponding edge is estimated to be present. Results for Simulation A are not shown, since in that setting the true edge set is not fixed across the simulated data sets.
Note that in calculating these accuracies, we only considered feature pairs (i, j) for which GRASS and graphical lasso disagree over whether an edge is present in Set 2 (since feature pairs for which graphical lasso and GRASS agree on Set 2 are uninformative for our purposes).
The results, averaged over 20 splits of the data into Set 1 and Set 2, are summarized in Table 2. They indicate that regardless of whether graphical lasso or GRASS on Set 1 is treated as the gold standard, the results obtained by GRASS on Set 2 have better agreement with the gold standard than do the results obtained by the graphical lasso on Set 2. In other words, independent data provides greater evidence for edges estimated by GRASS than for edges estimated by the graphical lasso, regardless of whether the independent data is evaluated using GRASS or the graphical lasso.
Table 2: Mean (and standard error) of accuracy of graphical lasso (GL) and GRASS on gene expression data, over 20 splits of the observations into Set 1 and Set 2. , the size of the estimated edge set, is also reported. Regardless of whether GRASS or graphical lasso is treated as the gold standard, GRASS yields more accurate edge set recovery than does the graphical lasso. These results are based on an analysis of the control observations. Similar results are obtained from the cases (results not shown).
In this paper, we have proposed graphical sure screening (GRASS), a simple and efficient procedure for recovering the structure of a high-dimensional Gaussian graphical model. GRASS is a natural extension of sure screening approaches from the regression and classification frameworks into the
setting of Gaussian graphical modeling.
The theoretical results presented in Section 3 for GRASS require a very simple set of assumptions. In particular, Assumption 1, which guarantees that the covariance corresponding to an edge in the graph is not too small, suffices to ensure that GRASS has the sure screening property: that is, GRASS can dramatically reduce the size of the potential edge set while still containing the true edge set with probability tending to one. Unlike Ravikumar et al. (2011) and Meinshausen and B¨uhlmann (2006), no irrepresentablility conditions are required. And unlike Zhou et al. (2009), no restricted eigenvalue condition is required on the precision matrix.
The computational advantages of the GRASS framework over existing approaches to estimate a sparse precision matrix are dramatic: while approaches such as the graphical lasso typically require ) operations, GRASS requires only
) operations.
In practice, in order for GRASS to perform well, the non-zero elements of the precision matrix must tend to be non-zero in the covariance matrix. We have shown that this assumption typically holds in a range of simulation settings. Furthermore, GRASS outperforms the graphical lasso on a gene expression data set.
This work was supported in part by NSF Grants DMS 1007698 (R.S.) and DMS CAREER 1252624 (D.W.), NIH Grant 5DP5OD009145 (D.W.), and a Sloan Research Fellowship (D.W.).
First, we reproduce a result from van der Vaart and Wellner (1996) for the sake of readability.
Lemma 1 (Bernstein’s inequality, Lemma 2.2.11, van der Vaart and Wellner (1996))
be independent random variables with zero mean such that , for every
(and all i) and some constants
The following lemma, needed for the proof of Theorem 1, shows that a -distributed random variable satisfies the moment condition in Lemma 1.
. Then for some constant C, we have for all
This is not greater than for some constant C.
Proof of Theorem 1
Proof. To begin, we will show that there exist constants such that for any
By definition,
This can be bounded above by
where are independent standard normal random variables. Hence
distribution. Together, Lemmas 1 and 2 guarantee that there exists positive
constants
Next, we notice that
Now we note that , and that for (
, Assumption 1 implies that
Hence,
This implies the first part of Theorem 1. The second part of Theorem 1 can be established in a similar way, and hence we omit the details.
The following lemma will be used in the proof of Theorem 2.
-dimensional random vector with mean zero and variance Σ, and Y be a random variable with
. Assume that X and Y satisfy the
linear model is uncorrelated with X. Define
for some constant C. Then |S|, the cardinality of S, satisfies
Proof. Left-multiplying both sides by X and taking the expectation, ). Therefore
th element of the vector Σ
. Consequently,
Furthermore,
Moreover, recalling that are uncorrelated, we have that
Proof of Theorem 2
Proof. Let
and
By definition, . Then on the set
to
, it has to belong to
. Thus, we conclude that
). Moreover, an
argument similar to that in the proof of Theorem 1 can be used to show that
This implies that
Finally, applying Lemma 3 in conjunction with Assumption 2 yields the desired result.
Proof of Theorem 3
Proof. First, we will show that the assumptions of Theorem 1 are satisfied, so that the sure screening property applies. We must simply show that the new threshold,
greater than 2, the threshold used in Theorem 1. In other words, we must show that
From the fact that
we have that 1 ). Furthermore, since log(
we have that
). Using the fact that
, (A.3) follows directly.
Next, we show that using a threshold value of leads to control of the asymptotic expected false positive rate at
1)]. Recall that the false positive rate is defined
as
Because E(, it follows that
Furthermore, for any (
where the asymptotic equivalence in the previous line follows from the fact that
Consequently, the expectation of fpris controlled as desired,
where the last inequality is due to the fact that
Barrett, T., Troup, D. B., Wilhite, S. E., Ledoux, P., Rudnev, D., Evangelista, C., Kim, I. F., Soboleva, A., Tomashevsky, M., and Edgar, R. (2007). NCBI GEO: mining tens of millions of expression profiles – database and tools update. Nucleic Acids Research, 35(suppl 1):D760–D765.
Cai, T., Liu, W., and Luo, X. (2011). A constrained l1 minimization approach to sparse precision matrix estimation. Journal of the American Statistical Association, 106(494).
Cai, T. T., Liu, W., and Zhou, H. H. (2012). Estimating sparse precision matrix: Optimal rates of convergence and adaptive estimation. arXiv preprint arXiv:1212.2882.
Danaher, P., Wang, P., and Witten, D. M. (2013). The joint graphical lasso for inverse covariance estimation across multiple classes. Journal of the Royal Statistical Society: Series B (Statistical Methodology).
Fan, J., Feng, Y., and Song, R. (2011). Nonparametric independence screening in sparse ultra- high-dimensional additive models. Journal of the American Statistical Association, 106(494).
Fan, J. and Lv, J. (2008). Sure independence screening for ultrahigh dimensional feature space. Journal of the Royal Statistical Society: Series B (Statistical Methodology), 70(5):849–911.
Fan, J., Samworth, R., and Wu, Y. (2009). Ultrahigh dimensional feature selection: beyond the linear model. The Journal of Machine Learning Research, 10:2013–2038.
Fan, J. and Song, R. (2010). Sure independence screening in generalized linear models with np- dimensionality. The Annals of Statistics, 38(6):3567–3604.
Friedman, J., Hastie, T., and Tibshirani, R. (2008). Sparse inverse covariance estimation with the graphical lasso. Biostatistics, 9(3):432–441.
Hall, P. and Miller, H. (2009). Using generalized correlation to effect variable selection in very high dimensional problems. Journal of Computational and Graphical Statistics, 18(3).
Hall, P., Titterington, D., and Xue, J.-H. (2009). Tilting methods for assessing the influence of components in a classifier. Journal of the Royal Statistical Society: Series B (Statistical Methodology), 71(4):783–803.
Lam, C. and Fan, J. (2009). Sparsistency and rates of convergence in large covariance matrix estimation. Annals of Statistics, 37(6B):4254.
Lauritzen, S. L. (1996). Graphical Models. Oxford University Press.
Li, R., Zhong, W., and Zhu, L. (2012). Feature screening via distance correlation learning. Journal of the American Statistical Association, 107(499):1129–1139.
Mardia, K. V., Kent, J. T., and Bibby, J. M. (1980). Multivariate analysis.
Mazumder, R. and Hastie, T. (2012). Exact covariance thresholding into connected components for large-scale graphical lasso. The Journal of Machine Learning Research, 13(1):781–794.
Meinshausen, N. and B¨uhlmann, P. (2006). High-dimensional graphs and variable selection with the lasso. The Annals of Statistics, pages 1436–1462.
Ravikumar, P., Lafferty, J., Liu, H., and Wasserman, L. (2009). Sparse additive models. Journal of the Royal Statistical Society: Series B (Statistical Methodology), 71(5):1009–1030.
Ravikumar, P., Wainwright, M. J., Raskutti, G., and Yu, B. (2011). High-dimensional covariance estimation by minimizing l1-penalized log-determinant divergence. Electronic Journal of Statistics, 5:935–980.
Rothman, A. J., Bickel, P. J., Levina, E., and Zhu, J. (2008). Sparse permutation invariant covariance estimation. Electronic Journal of Statistics, 2:494–515.
Spira, A., Beane, J. E., Shah, V., Steiling, K., Liu, G., Schembri, F., Gilman, S., Dumas, Y.-M., Calner, P., Sebastiani, P., et al. (2007). Airway epithelial gene expression in the diagnostic evaluation of smokers with suspect lung cancer. Nature Medicine, 13(3):361–366.
Sun, T. and Zhang, C.-H. (2012). Sparse matrix inversion with scaled lasso. arXiv preprint arXiv:1202.2723.
van der Vaart, A. W. and Wellner, J. A. (1996). Weak Convergence and Empirical Processes. Springer, New York.
Witten, D. M., Friedman, J. H., and Simon, N. (2011). New insights and faster computations for the graphical lasso. Journal of Computational and Graphical Statistics, 20(4):892–900.
Yuan, M. (2010). High dimensional inverse covariance matrix estimation via linear programming. The Journal of Machine Learning Research, 99:2261–2286.
Yuan, M. and Lin, Y. (2007). Model selection and estimation in the Gaussian graphical model. Biometrika, 94(1):19–35.
Zhao, S. D. and Li, Y. (2012). Principled sure independence screening for Cox models with ultra- high-dimensional covariates. Journal of Multivariate Analysis, 105(1):397–411.
Zhou, S., van de Geer, S., and B¨uhlmann, P. (2009). Adaptive lasso for high dimensional regression and gaussian graphical modeling. arXiv preprint arXiv:0903.2515.
Zhu, L.-P., Li, L., Li, R., and Zhu, L.-X. (2011). Model-free feature screening for ultrahighdimensional data. Journal of the American Statistical Association, 106(496).