When is Nontrivial Estimation Possible for Graphons and Stochastic Block Models?

2016·Arxiv

Abstract

Abstract

Block graphons (also called stochastic block models) are an important and widely-studied class of models for random networks. We provide a lower bound on the accuracy of estimators for block graphons with a large number of blocks. We show that, given only the number k of blocks and an upper bound on the values (connection probabilities) of the graphon, every estimator

graphons. In particular, our bound rules out any nontrivial estimation (that is, with substantially less than . Combined with previous upper and lower bounds, our results characterize, up to logarithmic terms, the accuracy of graphon estimation in the metric. A similar lower bound to ours was obtained independently by Klopp, Tsybakov, and Verzelen [17].

1. Introduction

Networks and graphs arise as natural modeling tools in many areas of science. In many settings, especially ones where edges in a network represent social ties, observed networks display some type of community structure, where the connectivity between nodes depends heavily on the communities they belong to. This type of structure is captured in the k-block graphon model, also known as the stochastic block models. The more communities we allow in the model (or “types” of nodes we consider), the richer the model becomes and the better we can hope to describe the real world.

Given an observed network, graphon estimation is the problem of finding a graphon model that approximates the process that gave rise to the network. In this paper, we are concerned with the fundamental limits of graphon estimation for block graphons. That is, given a n-node network that was generated from a k-block graphon, how accurately can you recover the graphon? We consider the “nonparametric” setting, where k may depend on n. Our lower bounds apply even to estimation algorithms that know the true number of blocks k (though this quantity typically needs to be estimated).

Many real world networks display the property that the average degree of the network is small compared to the number of nodes in the network. Graphons whose expected average degree is linear in n are called dense, while graphons whose expected average degree is sublinear in n are referred to as sparse. In this work, we prove a new lower bound for graphon estimation for sparse networks. In particular, our results rule out nontrivial estimation for very sparse networks (roughly, where )). An estimator is nontrivial if its expected error is significantly better than an estimator which ignores the input and always outputs the same model. It follows from recent work [21, 22, 23] that nontrivial estimation is impossible when ). Ours is the first lower bound that rules out nontrivial graphon estimation for large k. Previous work by Klopp, Tsybakov, and Verzelen [16] provides other lower bounds on graphon estimation that are tight in several regimes. In very recent work [17] that is concurrent to ours, the same authors provide a similar bound to the one presented here.

Block graphon models were introduced by Hoff et al. [15] under the name latent position graphs. Graphons play an important role in the theory of graph limits (see [20] for a survey) and the connection between the graph model and convergent graph sequences has been studied in both the dense and sparse settings [7, 8, 10, 9, 10]. Estimation for stochastic block models with a fixed number of blocks was introduced by Bickel and Chen [5], while the first estimation of the general model was proposed by Bickel, Chen, and Levina [6]. Many graphon estimation methods, with an array of assumptions on the graphon, have been proposed since; [19, 24, 18, 25, 12, 3, 26, 14, 2, 13, 1]. Gao et al. [14] provide the best known upper bounds in the dense setting while Wolfe and Olhede [25], Borgs et al. [11], Klopp et al. [16] give upper bounds for the sparse case.

1.1. Graphons.

Definition 1 (Bounded Graphons and W-random graphs). A (bounded) graphon W is a symmetric, measurable function W : [0, 1][0, 1]. Here, symmetric means that W(x, y) = W(y, x) for all ([0, 1].

For any integer n, a graphon W defines a distribution on graphs on n vertices as follows: First, select n labels uniformly and independently from [0, 1], and form an matrix H where ). We obtain an unlabeled, undirected graph G by connecting the ith and jth nodes with probability independently for each (i, j). The resulting random variable is called a W-random graph, and denoted ).

We denote the set of graphs with n nodes by , the set of graphons by W and the set of -bounded graphons by . If W is -bounded, then the expected number of edges in ) is at most ). In the case that is parametrised by n and lim0, we obtain a sparse graphon.

We consider the estimation problem: given parameters n and , as well as a graph ) generated from an unknown -bounded graphon W, how well can we estimate W?

A natural goal is to design estimators that produce a graphon ˆW that is close to W in a metric such as . This is not possible, since there are many graphons that are far apart in , but that generate the same probability distribution on graphs. If there exists a measure preserving map : [0, 1] [0, 1] such that )) = ) for all [0, 1], then ) and ) are identically distributed. The converse is true if we instead only require )) = ) almost everywhere. Thus, we wish to say that ˆW approaches the class of graphons that generate ). To this end, we use the following metric on the set of graphons,

where ) = )) and ranges over all measurable, measure-preserving maps. Two graphons W and generate the same probability distribution on the set of graphs if and only if ) = 0 (see, e.g. [20]).

Existing upper bounds for graphon estimation are based on algorithms that produce graphons of a particular form, namely block graphons, also called stochastic block models (even when it is not known that the true graphon is a block graphon).

Definition 2 (k-block graphon (stochastic block models)). For , a graphon is a k-block graphon if there exists a partition of [0, 1] into k measurable sets such that W is constant on for all i and j.

Graphons of this form can be generated from matrices. Given a matrix M, we can assign a k-block graphon W[M] with blocks = ( ] that takes the value on .

1.2. Main result. We are concerned with the problem of estimating a graphon, W, given a graph sampled from ). A graphon estimator is a function ˆthat takes as input a n node graph, that is generated according to W, and attempts to output a graphon that is close to

W. The main contribution of this paper is the development of the lower bound

Combined with previous work we can give the following lower bound on the error of graphon estimators.

where inf is the infimum over all estimators ˆand supis the supremum over all k-block, -bounded graphons.

Note that k and may depend on n. That is, the theorem holds if we consider sequences and . Our result improves on previously known results when and , that is, when the graphs produced by the graphon are sparse and k is large. The upper bound

by Klopp et al. [16] implies that our lower bound is almost tight. In particular, if k is constant then the lower bound in Theorem 3 is tight. In all cases, it is within a factor of at most max( 1) of the correct bound.

When , Theorem 3 implies that the error is Ω(), which is the error achieved

by the trivial estimator ˆW = 0. That is, in the sparse setting, the trivial estimator achieves the optimal error. To the authors’ knowledge this is the first result that completely rules out nontrivial estimation in the case where k is large. Recent concurrent work (Klopp et al. [17]) provides similar bounds.

The bound

is due to previous work of Klopp et al. [16] and the bound

follows from a result of Neeman and Netrapalli [23]. We give details on how to derive (3) from their results in Section 4.

1.3. Techniques: Combinatorial Lower Bounds for δp. Our proof of the main theorem will

involve Fano’s lemma. As such, during the course of the proof we will need to lower bound the packing number, with respect to , of a large set of k-block graphons. Whilst easily upper bounded, little is known about lower bounds on . To the authors’ knowledge, this work gives the first lower bound for the packing number of with respect to . We will also give a combinatorial lower bound for the metric that is easier to handle than the metric itself.

To understand our technical contributions, it helps to first understand a problem related to graphon estimation, namely that of estimating the matrix of probabilities H. Existing algorithms for graphon estimation are generally analyzed in two phases: first, one shows that the estimator ˆW is close to the matrix H (in an appropriate version of the metric), and then uses (high probability) bounds on ]) to conclude that ˆW is close to W. Klopp et al. [16] show tight upper and lower bounds on estimation of H. One can think of our lower bound as showing that the lower bounds on estimation of H can be transferred to the problem of estimating W.

The main technical difficulty lies in showing that a given pair of matrices A, B lead to graphons that are far apart in the metric. Even if A, B are far apart in, say, , they may lead to graphons that are close in . For consistency with the graphon formalism, we normalize the metric on matrices so that it agrees with the metric on the corresponding graphons. For a matrix A,

As an example of the discrepancy between the and metrics, consider the matrices

The matrices A and B have positive distance in the metric, = , but ]) = 0.

One can get an upper bound on ]) by restricting attention in the definition of to functions that permute the blocks . This leads to the following metric on matrices which minimizes over permutations of the rows and columns of one of the matrices:

where is the matrix with entries (. This metric arises in other work (e.g. [20]), and it is well known that

To prove lower bounds, we consider a new metric on matrices, in which we allow the rows and columns to be permuted separately. Specifically, let

where is the matrix with entries (.

Because ˆˆis defined “combinatorially” (that is, it involves minimization over a discrete set of size about 2, instead of over all measure-preserving injections), it is fairly easy to lower bound ˆˆ) for random matrices A, B using the union bound.

In particular, it allows us to give bounds on the packing number of with respect to the metric. If (Ω, d) is a metric space, 0 and Ω, then we define the -packing number of T to be the largest number of disjoint balls of radius that can fit in T , denoted by ). The following Proposition will be proved after the proof of Theorem 3.

Proposition 5. There exists C > 0 such that the -packing number of , equipped with , is 2, that is ) = 2.

Finally, we note that these techniques extend directly to the metric, for [1]. That is, we may define ˆand ˆˆanalogously to the definitions above, and obtain the bounds

along with similar lower bounds on the packing number.

1.4. Related work.

Work on graphon estimation falls broadly into two categories; estimating the matrix H and estimating the graphon W. When estimating H, the aim is to produce a matrix that is close in the metric to the true matrix of probabilities H that was used to generate the graph G. When estimating the graphon, our aim is the minimise the distance between the estimate and the true underlying graphon W that was used to generate G.

Gao et al. studied the problem of estimating the matrix of probabilities H given an instance chosen from W when = 1. They proved the following minimax rate for this problem when W is a k-block graphon;

where the infinimum is over all estimators ˆM from to the set of symmetric matrices, the supremum is over all probability matrices H generated from k-block graphons. Klopp et al. extended this result to the sparse case, proving that for all and 0 1,

where the supremum is over all probability matrices H generated from k-block, -bounded graphons.

Klopp et al. [16, Corollary 3] also studied the problem of estimating the graphon W. They proved that Equation (1) holds for any k-block, -bounded graphon, W, with . They also exhibited the first lower bound (known to us) for graphon estimation using the metric. They proved that Equation (2) holds for 0 and .

The related problems of distinguishing a graphon with k > 1 from an Erd¨os-R´enyi model with the same average degree (called the distinguishability problem) and reconstructing the communities of a given network (called the reconstruction problem) have also been widely studied. This problem is closely related to the problem of estimating H. Recent work by Mossel et al. [21] and Neeman and Netrapalli [23] establish conditions under which a k-block graphon is mutually contiguous to the Erd¨os-R´enyi model with the same average degree. Contiguity is essentially a condition that implies that no test could ever definitely determine which of the two graphons a given sample came from. There is a large body of work on algorithmic and statistical problems in this area and we have only cited work that is directly relevant here.

2. Lower Bound for the δ2 Metric

As mentioned earlier, the main technical contribution of this paper is lower bounding the metric by the more combinatorial ˆˆmetric. In this section we will prove the inequality given in Lemma 4.

Proposition 6. Let be k-block graphons with blocks = [ ) and : [0, 1] [0, 1] be a measure-preserving bijection. Then there exists a probability distribution P on such that

Proof. Let = and )). Now, consider a matrix P with . Noting that ) = 1/k and )) = 1/k, we can see that P is doubly stochastic, that is, the rows and columns of P sum to 1. Berkhoff’s theorem states that any doubly stochastic matrix can be written as a convex combination of permutation matrices. So

P = where ) = 1. Therefore, we have a probability distribution P on and

Now,

Proof of Lemma 4. Proposition 6 implies that for all measure preserving bijections : [0, 1] [0, 1] and matrices A and B we have

Since this is true for any , we have ˆˆ

3. Proof of Main Theorem

To prove the main theorem we will use Fano’s lemma to find a constant that lower bounds the

bound on the expected error. To that end, we aim to find a large set, T , of k-block graphons whose KL-diameter and -packing number with respect to with = min

bounded. Our proof is inspired by that of Gao et al.

Suppose p, q are probability distributions on the same space. Then the Kullback-Leibler (KL) divergence of p and q is defined by ) = (log )dp. For a collection T of probability distributions, the KL diameter is defined by

The following version of Fano’s lemma is found in [27].

Lemma 7 (Fano’s Inequality.). Let (Ω, d) be a metric space and be a collection of probability measures. For any totally bounded Ω and 0,

The following lemma gives us a way to easily upper bound the KL divergence between the distributions induced by two different graphons.

Proof. Let T be a variable denoting the choice of labels, so

Now,

where the inequality follows from the log-sum inequality. Now, the probability density function of T is the constant function 1 so it follows from Gao et al. [14, Proposition 4.2] that

Recall that we are aiming to define a large set of k-block matrices whose KL diameter can be

upper bounded and packing number with respect to (with = min(

)) can be lower bounded. The following lemma shows that there exists a large set of matrices who are pairwise far in Hamming distance, even after every possible permutation of the rows and columns. We will use this in the proof of Theorem 3 to define a large class of k-block graphons who are pairwise far in the ˆˆmetric and hence the metric. This gives us a bound on packing number.

Lemma 9. There exists a set S of symmetric binary matrices such that |S| = 2and if and then Ham() = Ω().

Proof. Consider two randomly chosen symmetric binary matrices and permutations and . For , let = 1 if and 0 otherwise so is a Bernoulli random variable with ] = . Thus, by a Chernoff bound,

Therefore, for randomly chosen ,

For a constant c > 0, consider the process that selects 2binary matrices uniformly at random uniformly at random. The probability that all pairs are at Hamming distance at least 6 is at least 1 . Selecting c sufficiently small, we get that at least one such set S exists.

We are not aware of an explicit construction of a large family of matrices that are far apart in ˆˆmetric; we leave such a construction as an open problem.

We now proceed to the proof of Theorem 3. We will use Lemma 9 to define a set T with packing number 2. The elements of T are all close in norm, so using Lemma 8 we get a bound on the KL diameter. We then directly apply these bounds to Fano’s lemma.

where inf is the infimum over all estimators ˆand supis the supremum over all k-block, -bounded graphons.

Proof. Let S be a set satisfying the conditions of Lemma 9 and let = min(1). For , define

where 1 is the all 1’s matrix and c is some constant that we will choose later. That is, (= [ + ] if = 1 and (] if = 0. Let then using Lemma 8 we conclude that for all , we have

Thus by Corollary 2,

Therefore, there exists D > 0 such that if min, we have log ) =

Ω(). Then, Fano’s lemma implies

We can choose c small enough that the right hand side is larger than a fixed constant for all k and n. Therefore, using Markov’s inequality we have

Proof of Proposition 5. During the course of the proof of Theorem 3 we construct 2graphons in that are pairwise at least Ω() apart in the distance for any c > 0 such that . Therefore, for some C > 0, the -packing number of is at least 2

4. Appendix

We show here how to derive the lower bound in (3) from the results of Neeman and Netrapalli [23].

where inf is the infimum over all estimators ˆand supis the supremum over all k-block, -bounded graphons.

Proof. Let = min

with equally sized blocks and edge probabilities between nodes with the same label, p, and with different labels, q. Let be the Erd¨os-R´enyi model with the same expected degree, d, as . Note that 0 ) so and

= ΩLet (Ω) be a sequence of measurable spaces, each equipt with two probability measures, and . We say and are mutually contiguous if for any sequence of events , we have lim0 if and only if lim0. Let = . A result of Neeman et al. [23], as presented by Banks et al. [4], implies that and are mutually contiguous if

A short calculation shows that this holds so we have that and are mutually contiguous. Choose C > 0 such that . Suppose, for sake of contradiction, that there did exist ˆW such that

Then due to the contiguity of and , lim( ˆ0 and lim( ˆ0. Therefore, for large enough n, there ex- ists a graph G such that ( ˆand ( ˆ, which implies that , which is a contradiction.

graphon for any so Lemma 12 implies

If then

If then the functions and

some i such that so

References

[1] E. Abbe and C. Sandon. Recovering communities in the general stochastic block model without knowing the parameters. arXiv:1503.00609, 2015.

[2] E. Abbe, A. S. Bandeira, and G. Hall. Exact recovery in the stochastic block model. arXiv:1405.3267, 2014.

[3] E. M. Airoldi, T. Costa, and S. Chan. A non-parametric perspective on network analysis: Theory and consistent estimation. In Advances in Neural Information Processing Systems (NIPS), volume 26, pages 692–700, 2013.

[4] J. Banks and C. Moore. Information-theoretic thresholds for community detection in sparse networks. arXiv preprint arXiv:1601.02658, Jan 2016.

[5] P. J. Bickel and A. Chen. A nonparametric view of network models and newman-girvan and other modularities. Proceedings of the National Academy of Sciences of the United States of America, 106:21068–21073, 2009.

[6] P. J. Bickel, A. Chen, and E. Levina. The method of moments and degree distributions for network models. Annals of Statistics, 39(5):2280–2301, 2011.

[7] C. Borgs, J. T. Chayes, L. Lov´asz, V. S´os, and K. Vesztergombi. Counting graph homo- morphisms. In Topics in Discrete Mathematics (eds. M. Klazar, J. Kratochvil, M. Loebl, J. Matousek, R. Thomas, P.Valtr), pages 315–371. Springer, 2006.

[8] C. Borgs, J. T. Chayes, L. Lov´asz, V. S´os, and K. Vesztergombi. Convergent graph sequences I: Subgraph frequencies, metric properties, and testing. Advances in Math., 219:1801–1851, 2008.

[9] C. Borgs, J. T. Chayes, H. Cohn, and Y. Zhao. An theory of sparse graph convergence I: limits, sparse random graph models, and power law distributions. arXiv:1401.2906, 2014.

[10] C. Borgs, J. T. Chayes, H. Cohn, and Y. Zhao. An theory of sparse graph convergence II: LD convergence, quotients, and right convergence. arXiv:1408.0744, 2014.

[11] C. Borgs, J. T. Chayes, and A. Smith. Private graphon estimation for sparse graphs. arXiv:1506.06162 [math.ST], 2015.

[12] S. H. Chan and E. M. Airoldi. A consistent histogram estimator for exchangeable graph models. Journal of Machine Learning Research Workshop and Conference Proceedings, 32: 208–216, 2014.

[13] S. Chatterjee. Matrix estimation by universal singular value thresholding. Annals of Statistics, 43(1):177–214, 2015.

[14] C. Gao, Y. Lu, and H. H. Zhou. Rate-optimal graphon estimation. arXiv:math/1410.5837v2, October 2014.

[15] P. D. Hoff, A. E. Raftery, and M. S. Handcock. Latent space approaches to social network analysis. Journal of the American Statistical Association, 97(460):1090–1098, 2002.

[16] O. Klopp, A. Tsybakov, and N. Verzelen. Oracle inequalities for network models and sparse graphon estimation (version 1). arXiv:1507.04118v1, July 2015.

[17] O. Klopp, A. Tsybakov, and N. Verzelen. Oracle inequalities for network models and sparse graphon estimation (version 3). arXiv:1507.04118v3, March 2016.

[18] P. Latouche and S. Robin. Bayesian model averaging of stochastic block models to estimate the graphon function and motif frequencies in a w-graph model. ArXiv:1310.6150, 2013.

[19] J. R. Lloyd, P. Orbanz, Z. Ghahramani, and D. M. Roy. Random function priors for ex- changeable arrays with applications to graphs and relational data. In Advances in Neural Information Processing Systems (NIPS), volume 25, pages 1007–1015, 2012.

[20] L. Lov´asz. Large networks and graph limits. American Mathematical Society Colloquium Publications., 60, 2012.

[21] E. Mossel, J. Neeman, and A. Sly. Reconstruction and estimation in the planted partition model. Probability Theory and Related Fields, 162(3):431–461, 2014. ISSN 1432-2064. doi: 10.1007/s00440-014-0576-6. URL http://dx.doi.org/10.1007/s00440-014-0576-6.

[22] E. Mossel, J. Neeman, and A. Sly. Consistency thresholds for the planted bisection model. In Proceedings of the Forty-Seventh Annual ACM on Symposium on Theory of Computing, STOC ’15, pages 69–75, New York, NY, USA, 2015. ACM. ISBN 978-1-4503-3536-2. doi: 10.1145/2746539.2746603. URL http://doi.acm.org/10.1145/2746539.2746603.

[23] J. Neeman and P. Netrapalli. Non-reconstructability in the stochastic block model. arXiv preprint arXiv:1404.6304, 2014.

[24] M. Tang, D. L. Sussman, and C. E. Priebe. Universally consistent vertex classification for latent positions graphs. Ann. Statist., 41(3):1406–1430, 06 2013. doi: 10.1214/13-AOS1112. URL http://dx.doi.org/10.1214/13-AOS1112.

[25] P. Wolfe and S. C. Olhede. Nonparametric graphon estimation. arXiv:1309.5936, 2013.

[26] J. J. Yang, Q. Han, and E. M. Airoldi. Nonparametric estimation and testing of exchangeable graph models. In Proceedings of 17th AISTATS (JMLR: W&CP volume 33), 2014.

[27] B. Yu. Assouad, fano, and le cam. In D. Pollard, E. Torgersen, and G. Yang, editors, Festschrift for Lucien Le Cam, pages 423–435. Springer New York, 1997. ISBN 978-1-4612-7323-3. doi: 10.1007/978-1-4612-1880-729. URL http://dx.doi.org/10.1007/978-1-4612-1880-7_29.

designed for accessibility and to further open science