Robust and computationally feasible community detection in the presence of arbitrary outlier nodes

2014·Arxiv

Abstract

Abstract

ROBUST AND COMPUTATIONALLY FEASIBLE COMMUNITY DETECTION IN THE PRESENCE OF ARBITRARY OUTLIER NODES

Received April 2014; revised November 2014. Supported in part by NSF Grants DMS-12-08982 and DMS-14-03708, NIH Grant R01 CA127334-05, and the Wharton Dean’s Fund for Post-Doctoral Research.

AMS 2000 subject classifications. 62H30, 91C20. Key words and phrases. Robust community detection, SDP relaxation, dual certificate, k-means clustering.

1. Introduction. Driven by applications in a wide range of ﬁelds, including engineering, genomics, sociology, psychology and computer science, analysis of graph and network data has drawn signiﬁcant recent interest. Random graph models have been introduced to characterize the structure of the networks and a large number of algorithmic approaches have been proposed for various applications. See, for example, Fienberg (2010, 2012), Goldenberg et al. (2010), and the references therein for overviews and recent work. An important problem in the analysis of network data is that of community detection which aims to cluster the nodes in a given graph into distinct groups or communities based on the observed undirected edges. Community detection has proven to be both technically and computationally challenging. It also has deep connections to other ﬁelds such as spin-glass theory and signal processing. In terms of statistical modeling, the most well-known model for community detection is perhaps the stochastic block model (SBM) proposed in Holland, Laskey and Leinhardt (1983). Under the SBM, the graph of interest is assumed to be a random one with independent edges, and the within-group edge density is assumed to be greater than the between-group edge density. To be speciﬁc, suppose G = (V,E) is a random graph where V is a ﬁxed set of vertices consisting of n nodes, and E is a random set of edges. Assume that the n nodes are indexed by [n] := {1,...,n} and each of these nodes belongs to one and only one of the r nonoverlapping groups. This amounts to assigning each node j ∈ [n] a group label by a labeling function φ(j) ∈ {1,... ,r}. We denote by A = (Aij)1≤i,j≤n the random adjacency matrix of this random graph. Then for each pair (i,j), 1 ≤ i,j ≤ n, Aij = 0 or 1, indicating whether the nodes i and j are connected or not, respectively. We only consider undirected graph with no self loops, so A is symmetric, and all its diagonal entries are 0. For pairs (i,j) with 1 ≤ i < j ≤ n, Aij’s are assumed to be independent Bernoulli random variables with parameters Bφ(i)φ(j), where the symmetric matrix B ∈ Rr×r is referred to as the connectivity matrix. In a basic model, denote by q+ and p− the maximum cross-group density and the minimum within-group density, namely q+ := max1≤i<j≤r Bij, p− := min1≤i≤r Bii.(1.1) Moreover, the within-group densities are assumed to be greater than the cross-group densities, that is, p− − q+ := δ > 0.(1.2) This is a common assumption in the literature of community detection under the SBM; see, for example, Rohe, Chatterjee and Yu (2011), Chaudhuri, Chung and Tsiatas (2012). Denote the minimum community size by nmin :=

min1≤l≤r |φ−1(l)|, where |S| denotes the cardinality of the set S. Then the diﬃculty of the community detection problem is determined by the tuple

Under the SBM, various community detection algorithms have been proposed and studied in the literature, with diﬀerent emphases on computational complexity and statistical accuracy. These include greedy algorithms, such as hierarchical agglomeration [see, e.g., Clauset, Newman and Moore (2004)]; greedy methods guided by global criterion maximization, such as modularity function maximization [see, e.g., Newman and Girvan (2004)] and proﬁle likelihood function maximization [see, e.g., Bickel and Chen (2009), Zhao, Levina and Zhu (2012)]; stochastic model based methods, such as variational likelihood methods [see, e.g., Bickel et al. (2013), Celisse, Daudin and Pierre (2012)], pseudo-likelihood methods with EM algorithm [see, e.g., Amini et al. (2013)], Bayesian methods with Gibbs sampling, Markov chain Monte Carlo and belief propagation [see, e.g., Snijders and Nowicki (1997), Nowicki and Snijders (2001), Decelle et al. (2011)]; graph distance methods [see, e.g., Bhattacharyya and Bickel (2014)]; spectral clustering, its variations and other spectral methods [see, e.g., McSherry (2001), Giesen and Mitsche (2005), Rohe, Chatterjee and Yu (2011), Chaudhuri, Chung and Tsiatas (2012), Coja-Oghlan and Lanka (2009/10), Balakrishnan et al. (2011), Sussman et al. (2012), Fishkind et al. (2013), Jin (2015), Joseph and Yu (2013), Sarkar and Bickel (2013), Lei and Rinaldo (2015)]; and convex optimization methods [see, e.g., Mathieu and Schudy (2010), Oymak and Hassibi (2011), Jalali et al. (2014), Ames and Vavasis (2014), Chen, Sanghavi and Xu (2012), Ames (2014)]. Among these methods, greedy methods are usually computationally feasible, while their statistical accuracy has not been fully established in theory. Modularity or proﬁle likelihood methods are proven to be consistent when the number of groups is ﬁxed. However, they are in principle computationally NP hard. Similarly, stochastic model based methods are usually computationally diﬃcult and not fully justiﬁed in theory. Spectral clustering is a popular algorithm for community detection, since it is fast in computation and easy to implement. It has been proven that spectral clustering is consistent even when the number of groups r grows on the order of O(√n). Although in practice spectral clustering is believed to work well only for dense graphs, several recent papers, Amini et al. (2013), Sarkar and Bickel (2013), Joseph and Yu (2013), Lei and Rinaldo (2015), have shown that spectral clustering or its variations also work well for sparse graphs. The SBM is admittedly an oversimpliﬁed model for many applications, and diﬀerent generalizations have been proposed in the literature, which encompass mixture model [see Newman and Leicht (2007)], where the parametric model for the connectivity probabilities is based on the relationship between vertices and groups, instead of between diﬀerent groups; degree

corrected model [see Coja-Oghlan and Lanka (2009/10), Karrer and Newman (2011), Zhao, Levina and Zhu (2012)]; Latent variable method [see Handcock, Raftery and Tantrum (2007)] and mixed membership model [see Airoldi et al. (2008)]. However, each of these GSBMs focuses on a single latent graph structure, while in practice, due to lack of information, this additional structure is not easy to detect if it only applies to a few nodes of the graph. Diﬀerent types of outliers may appear in a single graph, and it is diﬃcult to use a complex generalization of the SBM to model multiple types of outlier nodes. The SBM is usually the ﬁrst model to ﬁt the data because of its simple form, even if it is believed that there is possibly a small portion of nodes which are not modeled well. Robustness in presence of arbitrary outliers is an important property for given community detection algorithms. In this paper we consider robust community detection in the presence of arbitrary outlier nodes, and the main question we wish to answer is the following:

Does there exist a computationally fast community detection method that is robust to a portion of arbitrary outlier nodes with theoretical guarantees?

Our answer is aﬃrmative, and we will introduce our model, methodology, numerical results and theoretical guarantees with rigorous proofs in this paper. We begin by formalizing the GSBM which allows for a small portion of arbitrary nodes.

for community detection which covers a range of settings in practice where the usual SBM is not suitable. More speciﬁcally, we assume the undirected graph G = (V,E) has N := n + m nodes, among which there are n “inliers” obeying the SBM described above, while the other m nodes are “outliers” which are connected with the other nodes in an arbitrary way. We refer to

this model as generalized stochastic block model (GSBM ). Denote V = [N] =

I ∪O, where I is the set of indices of the inliers, while O is the set of indices of outliers. Each inlier node i ∈ I is assigned a label φ(i) ∈ {1,...,r}, while all outliers are simply labeled φ(i) = r + 1. For any two nodes i,j ∈ I, P((i,j) ∈ E) = Bφ(i)φ(j), and moreover we assume the event {(i,j) ∈ E}, i < j ∈ I are independent. The r×r symmetric connectivity matrix B only represents the likelihood of connectivity of the inlier nodes. The connectivity between the outliers and the inliers and the connectivity among the outliers themselves are arbitrary. The only restriction of the connectivity of the outliers is that there is no self-loop.

The GSBM can be equivalently expressed in terms of its adjacency matrix A. To be speciﬁc, deﬁne

... ... ... ...

P⊺,(1.3) where W ∈ Rm×m is an arbitrary symmetric 0–1 matrix with all diagonal entries being 0, Z ∈ Rn×m is an arbitrary 0–1 matrix, P is an unknown N × N permutation matrix, in which there is only one 1 in each row and column, while all other entries are 0’s, and K is an n × n symmetric matrix which captures the connectivity of the inliers, thus corresponding to the usual SBM. The oﬀ-diagonal entries of K are independent Bernoulli variables, with parameter Bij if the entry belongs to the submatrix Kij. Denote the dimension of Kii to be li for i = 1,...,r. Then n = �ri=1 li. Similar to SBM, nmin = min1≤i≤r li. The parameters p− and q+ are deﬁned as in (1.1) and δ in (1.2). Then the diﬃculty of community detection under the GSBM is parameterized by the tuple (n,m,r,p−,q+,nmin). Here we emphasize that Z and W are not necessarily ﬁxed with respect to the randomness of K. Both Z and W can depend on K in arbitrary forms. In other words, the connectivity between the outliers and the inliers is allowed to depend on the connectivity among the inlier nodes. This is also a generalization of standard SBM, where the connectivity between each pair of nodes is stochastically independent of the connectivity between other pairs. The GSBM is a ﬂexible model and is widely applicable. It covers various types of outliers which are common in practice, and we name a few as follows: • Mixed membership. The SBM assumes that each node belongs to one and only one predetermined cluster. If most nodes obey this property, while there is a small portion of nodes each belonging to more than one clusters, these nodes are referred to as having mixed membership. When only a small portion of nodes have mixed membership, it is natural to treat them as outliers in an ordinary SBM. • Hubs. In social networks and others, it is natural that some nodes have many more connections than most of others. Moreover, it is possible that these nodes belong to several groups without obvious bias to any speciﬁc one. These nodes are referred to as hubs, and can be treated as outliers in our GSBM. • Small clusters. The SBMs are usually employed to model big and significant clusters, while small clusters are diﬃcult to detect. Small clusters are often not detectable because they are too small and possibly weak.

The number of small clusters is also diﬃcult to estimate; however, this information is essential for most popular algorithms in the literature, such as spectral clustering and modularity methods. The nodes in the small clusters can be treated as outliers in our GSBM. • Independent neutral nodes. In a given graph, in addition to the well-classiﬁed nodes, there might be some nodes which do not belong to any signiﬁcant groups, and also have fewer connections than most other nodes. We refer to these objects as independent neutral nodes. For example, in the political blogs data set introduced later, a small portion of blogs have very few connections. Such blogs may have strong preference in politics; however, this cannot be seen from only the graph representation. Therefore, these nodes are regarded as independent neutral nodes, which are naturally taken as outliers. In practice, it is diﬃcult or even impossible to modify the usual SBM to model precisely the possible combinations of mixed membership, hubs, small clusters, independent neutral nodes and other types of settings. Moreover, complex statistical models may also result in overﬁtting and high computational complexity in clustering. Therefore, the SBM is usually set up based on the basic properties of the graph. For example, in the political blogs network application discussed in Section 4.2, an SBM with 2 clusters is preferred, since it is known that there are mainly two signiﬁcant clusters: liberals and conservatives. However, it is also known that there are many independent groups advocating various causes that lie outside of the two main clusters. The GSBM can also be taken as a criterion to evaluate the robustness of community detection algorithms. When an SBM is adopted based on the properties of most nodes of a given graph, or equivalently, most nodes can be well modeled by an SBM in use, the robustness of a given community detection algorithm depends on whether a small portion of outliers will completely change the clustering result, or most nodes can be still well clustered. Therefore, a graph clustering algorithm is robust if it is guaranteed to have good performance under the GSBM. 1.2. Organization of the paper. The rest of the paper is organized as follows. In Section 2 the method of convexiﬁed likelihood method is introduced, followed by a detailed alternating directional augmented Lagrangian algorithm. Section 3 is focused on the theoretical consistency of the convex optimization method in the inference of the underlying groups speciﬁed by the GSBM. Numerical results on the analyses of the simulated data and a real data set about political blogs are presented in Section 4. A discussion is given in Section 5, and the proofs of the main theoretical results are contained in Section 6. Additional technical proofs are given in the Supplementary Material.

2. Methodology. In this section we propose a community detection algorithm which is robust and computationally feasible with theoretical guarantee of consistency. In the literature, greedy algorithms such as hierarchical clustering are not fully justiﬁed in theory, while modularity and pro-ﬁle maximum likelihood methods are computationally NP hard. Stochastic model based methods, such as maximum likelihood or variational likelihood method, have been proven to have certain consistency when the number of blocks is ﬁxed as the number of nodes going to inﬁnity. However, they are also computationally diﬃcult. EM algorithm is naturally proposed for solving relevant maximum likelihood formulation, but there is no theoretical guarantee of convergence with reasonable rate. Bayesian methods such as Gibbs sampling and belief propagation have also been proposed in the literature without rigorous theoretical justiﬁcations. Unlike the aforementioned methods, the spectral clustering methods have the advantage of fast algorithms. Spectral clustering algorithms are easy to implement because there is no tuning parameter. Moreover, strong theoretical results have been established under various conditions; see the references mentioned in the previous section. However, as indicated in Joseph and Yu (2013), ordinary spectral clustering applied to the graph Laplacian may not work due to the existence of small and weak clusters. We use a simulated data set to illustrate that ordinary spectral clustering applied to the graph Laplacian or the adjacency matrix is not consistent under the GSBM. Other types of numerical examples can be found in Joseph and Yu (2013). First, we create a data set of n = 1000 nodes obeying the ordinary SBM with r = 2 clusters. We also assume that the two clusters are perfectly balanced; that is, there are 500 nodes in each cluster. The within-group probability is p = 0.17, while the cross-group probability is q = 0.11. Under this set-up, the adjacency matrix is shown as in Figure 1. Spectral clustering applied directly to either the graph Laplacian or the adjacency matrix of this graph data has good performance of clustering. To illustrate this, we plot the eigenvectors corresponding to the top two eigenvalues (in absolute value) of the graph Laplacian and the adjacency matrix, respectively, in Figure 1. In each case, the two eigenvectors combined are capable of discriminating between the two clusters. Therefore, spectral clustering methods work for our data set when there are no outliers. Now we consider the GSBM by adding only m = 30 outliers into the above model with r = 2 clusters. To specify this GSBM, it suﬃces to explain Z and W in (1.3). We assume W is the adjacency matrix of a random graph with 30 nodes and independent edges. Moreover, we assume the probability of connectivity is 0.7. We deﬁne Z as a 1000 × 30 with independent Bernoulli entries. We also let EZ = β1⊺ = [β,β,...,β]. The components of β are 1000 i.i.d. copies of U 2, and U is a uniform random variable on [0,1]. The unordered and ordered adjacency matrices are given in Figure 2.

Fig. 1. The upper left panel illustrates the adjacency matrix of 1000 nodes satisfying the ordinary SBM. The upper right panel is the adjacency matrix obtained by permuting the adjacency matrix such that nodes 1 to 500 belong to the same cluster while the remaining ones constitute another cluster. The lower left panel plots the eigenvectors of the graph Laplacian corresponding to the top 2 eigenvalues in absolute value (red for the first and black for the second), while those for the adjacency matrix are plotted in the lower right panel. In both cases, these two eigenvectors are capable of discriminating between the two communities.

Suppose the data set is still modeled approximately by the SBM with r = 2. For this new data set, the two eigenvectors corresponding to the top two eigenvalues (in absolute value) of the graph Laplacian or the adjacency matrix cannot discriminate between the two major clusters. Even if we treat the 30 outliers as a single group due to their homogeneous behavior in the graph and thereby use r = 3, the third eigenvector of the adjacency matrix is still unable to distinguish the two major clusters. The third eigenvector of the graph Laplacian can only discriminate a part of nodes in the two major clusters. Actually, our numerical simulation shows that after applying spectral clustering on the graph Laplacian with r = 3, the misclassiﬁcation rate among the inliers is above 30 percent. These three eigenvectors are plotted in Figure 2 for both cases. The ﬁgures indicate that standard spectral clustering is not a robust community detection method in the presence of very few adversarial outliers. It was shown in Joseph and Yu (2013) that under certain conditions, penalized spectral clustering may reduce the eﬀects of the small weak clus-

Fig. 2. The upper left panel illustrates the adjacency matrix of 1030 nodes satisfying the GSBM with two major clusters and 30 outliers. The upper right panel is obtained by permuting the nodes such that nodes belonging to the same group are consecutive. The lower left panel plots the eigenvectors of the graph Laplacian corresponding the top 3 eigenvalues in absolute value (red for the first, black for the second and green for the third), while those for the adjacency matrix are plotted in the lower right panel. Ordinary spectral clustering with r = 2 or r = 3 is ineffective or even powerless on this data set since the top three eigenvectors cannot clearly discriminate between the two main communities.

ters, but it is not clear whether penalized spectral clustering applied to the graph Laplacian can diminish the inﬂuence of other types of outliers. Another method to improve standard spectral clustering methods is to detect outlier nodes based on the ﬁrst several eigenvectors. However, it is not clear whether there exists an approach which can uniformly detect all kinds of outliers with a theoretical guarantee. In order to ﬁnd in one shot the major clusters among the inlier nodes, we introduce in Section 2.1 a convex optimization method as well as a detailed algorithm which is implementable. It will be shown in Section 3 that the proposed procedure is robust against a small portion of arbitrary outliers with theoretical guarantees. 2.1. Convex optimization. In this section, we will choose the method of semideﬁnite programming (SDP) to ﬁt the GSBM, followed by a k-means

clustering. Numerically, SDP is well known to be computationally feasible, and various eﬃcient algorithms were proposed for solving diﬀerent types of SDP. Theoretically, under the ordinary SBM, SDP methods are shown to be capable in detecting communities; see Mathieu and Schudy (2010), Oymak and Hassibi (2011), Jalali et al. (2014), Ames and Vavasis (2014), Chen, Sanghavi and Xu (2012). We propose a new convex optimization method inspired by existing SDP methods in the literature. The signiﬁcantly novel part is that we will prove that this SDP method can consistently cluster the nodes when there is a portion of arbitrary type of outliers. The formal statement is given in Section 3, and all the proofs are deferred to Section 6 and the supplemental article Cai and Li (2015). First, we derive the convex optimization from the viewpoint of ﬁtting a parametric model. This viewpoint was originally proposed in Chen, Sanghavi and Xu (2012), but we are going to derive a diﬀerent convex optimization. For now we only consider the ordinary SBM, which implies that m = 0 and N = n. By the deﬁnition of SBM, for all 1 ≤ i < j ≤ n, the events {Aij = 1} are independent. Recall that here A is the observed adjacency matrix. Moreover, we deﬁne a symmetric matrix X with all diagonal entries equal to 1. For any 1 ≤ i < j ≤ n, we let Xij = 0 if the labeling functions φ(i) ̸= φ(j), while Xij = 1 if φ(i) = φ(j). Obviously, this matrix X is of rank r since there are r groups. Moreover, we consider a special case of the ordinary SBM. Suppose 1 > p > q > 0. For any 1 ≤ i < j ≤ N, when Xij = 0 let P(Aij = 1) = q; otherwise let P(Aij = 1) = p. This gives log P(Aij = 1|Xij) = Xij log p + (1 − Xij)log q and log P(Aij = 0|Xij) = Xij log(1 − p) + (1 − Xij)log(1 − q). Since {Aij = 1} are independent events, we have the log-likelihood function ℓ(A|X) = �

[Aij(Xij log p + (1 − Xij)log q) + (1 − Aij)(Xij log(1 − p) + (1 − Xij)log(1 − q))]. For any ﬁxed p and q, given A, we would like to choose an appropriate X to maximize ℓ(A|X). If we let λ = log(1 − q) − log plog p − log q + log(1 − q) − log(1 − p),(2.1) since the diagonal entries of A are all equal to 0, the maximization is equivalent to

where JN is the N × N matrix with all entries 1. Now let us ﬁgure out the constraint of X. By the SBM, it is easy to check that X must have the following form:

...

P⊺,(2.2) where P is some unknown permutation matrix, while Js is an s × s matrix with all entries 1’s. Solving optimization (2.2) under such constraint is computationally infeasible, so we seek for some relaxed form. Here we notice there are three major features of X. First, it is positive semideﬁnite; second, all its entries are between 0 and 1; third, it is of rank-r, which is relatively low. If we convexify the second integer constraint and neglect the third requirement, the relaxed maximum likelihood method becomes

subject to �X ⪰ 0, 0 ≤ �Xij ≤ 1 for 1 ≤ i,j ≤ N. The above optimization method is diﬀerent from that in Chen, Sanghavi and Xu (2012), where the relaxation is based on the observation that X is of low rank and hence a nuclear norm penalization is added up to the original objective function. On the contrary, our convex relaxation is derived from the observation that X is both low-rank and positive semideﬁnite, and consequently we impose constraint of the positive semideﬁnite cone. Now let us come back to the robust community detection under the GSBM. To control the possible outliers as formalized in the GSBM model, for the convenience of theoretical analysis, we add an additional term in the objective function to penalize the trace min ⟨ �X,αIN − (1 − λ)A + λ(JN − IN − A)⟩ subject to �X ⪰ 0, 0 ≤ �Xij ≤ 1 for 1 ≤ i,j ≤ N, which is equivalent to min ⟨ �X,E⟩ subject to �X ⪰ 0,(2.3) 0 ≤ �Xij ≤ 1 for 1 ≤ i,j ≤ N, where E := αIN − (1 − λ)A + λ(JN − IN − A).(2.4)

Remark 2.1. At ﬁrst glance, there are seemingly two tuning parameters: α and λ. In our theoretical result as shown later in Section 3, the parameter α is required to be much greater than the number of outlier nodes m. The introduction of α amounts to the trace penalization of �X, which is usually adopted in the literature of SDP relaxation in order to recover a low-rank structure; see, for example, Cand`es, Strohmer and Voroninski (2013), Li and Voroninski (2013). In our problem, we intend to use (2.3) to solve for a low-rank matrix to reveal the clustering structure of the GSBM, so this trace penalization is possibly a natural heuristic. However, in our numerical simulations in Section 4, the clustering eﬀectiveness of the convex optimization method (2.3) is not signiﬁcantly improved by choosing a positive α. Instead, (2.3) works even by letting α be a small constant or zero. On the contrary, there is a risk for choosing a large α, which may result in a positive deﬁnite E. If so, the solution to (2.3) must be 0, which is useless in analyzing the networking data. Therefore, we only need to tune the parameter λ in practice, and it has a clear statistical meaning as indicated in (2.1) in a special case of the ordinary SBM. In Section 3, it is shown that if λ lies in an interval determined by p− and q+ as deﬁned in (1.1), under mild technical conditions, any solution �X to (2.3) is capable of detecting the underlying group structure among the inliers. A simple and heuristic data dependent choice of λ is given in Section 4, where we also show numerically that the performance of our method for clustering is robust to the choice of λ. When �X is obtained, in the pursuit of an explicit clustering solution, a further step of k-means clustering is conducted to the normalized column vectors of �X with k = r, provided the number of major clusters r is assumed known. Furthermore, in Section 3 it is shown that the misclassiﬁcation rate after the k-means clustering can be tightly controlled. In summary, our proposed community detection procedure consists of the following two steps: Step 1. Choose an appropriate tuning parameter λ, and then solve (2.3). The solution is denoted as �X. Step 2. Conduct k-means clustering algorithm to the normalized column vectors of �X with k = r, so that we can solve for the assigning function ˆφ that maps from {1 ≤ i ≤ N} to {1,...,r}. Finally, we introduce the augmented Lagrange multiplier algorithm to solve (2.3). Augmented Lagrange multiplier algorithms have been employed in a variety of SDP optimizations in order to recover the underlying low-rank matrix structure; see, for example, Lin, Liu and Su (2011), Cand`es et al. (2011), Jalali et al. (2014), Chen, Sanghavi and Xu (2012) and a nice

review paper on alternating direction method of multipliers (ADMM) Boyd et al. (2010). Notice that (2.3) can be rewritten as min

subject to Y = Z, where the indicator function ι(a ∈ A) is deﬁned as

By this deﬁnition, we can easily conclude that ι(a ∈ A) is a convex function if and only if A is a convex set. Deﬁne the augmented Lagrangian of this optimization problem as Lρ(Y,Z;Λ) := ι(Y ⪰ 0) + ι(0 ≤ Z ≤ JN) + ⟨Y,E⟩ + ρ2∥Y − Z + Λ∥2F . If both Λ and Z are ﬁxed, and we aim to minimize Lρ(Y,Z;Λ) with respect to Y, it is equivalent to minimizing ι(Y ⪰ 0) + ρ2

For any symmetric matrix X whose eigenvalue decomposition is VΣV⊺, deﬁne X+ := VΣ+V⊺. Then the solution to the above minimization has an explicit form argmin

Lρ(Y,Z;Λ) =

Remark 2.2. This step has dominating computational complexity in each iteration of ADMM. In fact, an exact implementation of this subproblem of optimization requires a full SVD of Z − Λ − Eρ , whose computational complexity is O(N 3). When N is as large as hundreds of thousands, the full SVD has scalability issue. An open question is how to facilitate the implementation, or whether there exists a surrogate that is computationally inexpensive. A possible remedy is applying the low-rank iterative method, which means in each iteration of ADMM, the full SVD is replaced by a partial SVD where only the leading eigenvalues and eigenvectors are computed. Although this type of method may be stuck in local minimizers, given the fact that SDP implementation can be viewed as a preprocessing before kmeans clustering, such a low-rank iterative method might be helpful. We leave this large-scale computing problem as a future research project.

On the other hand, if both Λ and Y are ﬁxed, to minimize Lρ(Y,Z;Λ) with respect to Z is equivalent to minimizing

Again, we have a closed-form solution argmin

Lρ(Y,Z;Λ) := min(max(Y + Λ,0),JN), which changes the negative entries of Y + Λ into zeros and those greater than one into one. As to the Lagrange multiplier, as the convention in the literature of augmented Lagrange multiplier algorithms, Λ is updated to Λ + (Y − Z). The above augmented Lagrange multiplier method derives an iterative algorithm for solving the convex optimization (2.3), which is summarized in Algorithm 1. In numerical simulations, we let Z0 = 0 and Λ0 = 0 for initialization, and simply choose ρ = 1 and run the algorithm for iter = 100 iterations. Numerical analyses of the algorithm applied to simulated data and a real data set of political blogs are deferred to Section 4, where its eﬃciency and eﬀectiveness are clearly demonstrated. Moreover, for the purpose of comparison, we also implement ordinary spectral clustering methods on the synthetic data sets. The numerical simulations clearly show that our method outperforms spectral clustering methods in terms of robustness against outliers. 3. Theoretical guarantees. In this section, we will introduce our main theoretical results that guarantee that the clustering procedure derived in the previous section can detect the underlying communities under the GSBM.

The following theorem provides an explicit condition of the parameters n, m, p−, q+ and nmin, as well as the tuning parameters (α,λ), under which the solution to (2.3) is capable of unveiling the underlying group structures among the inliers in presence of a portion of outlier confounders.

Theorem 3.1. Let A be the adjacency matrix of the semi-random graph under the GSBM, as defined in (1.3). Let be a solution to the semidefinite program (2.3) and the density gap be defined as in (1.2), and the minimum within-group density and the maximum cross-group density as in (1.1). As defined in Section 1, the integer n denotes the number of inlier nodes, m denotes the number of outlier nodes and denotes the minimum community size among the inliers. Suppose that 3m and

p− log n

+

+

+

(α − 2m)nmin

(3.1)

for some sufficiently large numerical constant C, and the tuning parameter

Theorem 3.1 guarantees that any solution to (2.3) �X satisﬁes �Xjk = 1 for φ(j) = φ(k) ≤ r, and �Xjk = 0 for φ(j) ̸= φ(k) and φ(j) ≤ r,φ(k) ≤ r. In other words, for each pair of inlier nodes j and k, whether they belong to the same group or not solely depends on whether �Xjk equals 1 or 0. It is noteworthy that condition (3.2) is similar to the tuning parameter condition imposed in Chen, Sanghavi and Xu (2012). To interpret condition (3.1), it is helpful to consider two examples. First, let us consider the very sparse case where p− ≃ q+ ≃ δ ≃ O(log nn ), nmin ≃ O(n) and hence r ≃ O(1). This condition implies that our procedure works even for a graph whose average degree of inlier nodes is on the oder of

O(log n). This is consistent with the best-known result in the literature of community detection without outliers by spectral clustering based on the adjacency matrices or graph Laplacians [see Lei and Rinaldo (2015)], although the log n barrier could be resolved by more sophisticated nonbacktracking matrix methods; see Krzakala et al. (2013). In this case, condition (3.1) becomes

�log n

Then by letting α = log N, m = log n outliers are allowed. In the second example, we assume δ ≃ p− ≃ q+ ≃ O(1), and the number of clusters r grows with n. As a speciﬁc example, we let r ≃ n1/4. Moreover, we assume nmin ≃ n3/4. Then condition (3.1) becomes

Then by letting α = N 3/4, m = O(n1/2−ε) outliers are allowed for any ε > 0. A prominent feature of Theorem 3.1 is its consistency with the state-of-the-art community detection under the ordinary SBM in the literature. Assume there is no outlier node, that is, m = 0, and we simply let α = O(1) or just α = 0. Then condition (3.1) becomes

p− log n

+

If the number of clusters is ﬁxed, that is, r = O(1), we also assume the size of the smallest community nmin = O(n). As mentioned above, this condition is guaranteed by letting the minimum within-group density p− to be

as low as ) and the density gap ). In another example

where p− = O(1), q+ = O(1) and δ = O(1), condition (3.1) is equivalent to nmin ≥ O(√nlogn). By modifying Lemma 6.7 as discussed in Section 6, this condition can be relaxed to nmin ≥ O(√n). This is consistent with the state-of-the-art result in the community detection literature by spectral clustering [see, e.g., Chaudhuri, Chung and Tsiatas (2012)], and planted partition [see, e.g., Giesen and Mitsche (2005), Shamir and Tsur (2007), Oymak and Hassibi (2011), Ames (2014), Chen, Sanghavi and Xu (2012)] where the within-group densities are usually assumed to be the same, so do the cross-group densities. The O(√n) barrier of the small cluster size is well known in the literature of planted clique problems; see Deshpande and Montanari (2015) and the references therein. Remark 3.1. The proof of Theorem 3.1 is involved, and the details are given in Section 6. It is helpful to understand the intuition behind the proof.

The optimization (2.3) consists of two parts: a linear objective function and a constraint set which is the intersection of a polytope and the semideﬁnite cone. In order to show that the solution of (2.3) has the form of (3.3), we ﬁnd a point on the boundary of the constraint set such that this point has the form of (3.3). Moreover, we prove that a level set of the linear objective function is tangent to the tangent cone of the constraint set at the selected point. This shows that the selected point is the solution of (2.3). It is noteworthy that the level set of the linear objective function is in fact a hyperplane with co-dimension 1, so the selected point is a sharp vertex of the constraint set. For more details, see the remark before the proof Lemma 6.10 in the supplemental article Cai and Li (2015). Theorem 3.1 shows that the semideﬁnite programming (2.3) can discriminate the diﬀerent groups among the inlier nodes. However, the clustering result is not clear by only the observation of �X, and it is not clear how the outliers could aﬀect the ﬁnal clustering result. Given the extra knowledge of the number of clusters, we propose to cluster the normalized column vectors of �X by k-means with parameter r. To be speciﬁc, without loss of generality, let us assume P = I, and deﬁne �X =

... ...

Moreover, deﬁne yi = xi/∥xi∥2. Then all yi’s belong to the set of Ndimensional vectors with two-norm 1 and all coordinates being nonnegative. Notice that if xi = 0, we then deﬁne yi as an arbitrary nonnegative vector with norm 1. Then, for any inlier indices i,j ∈ I and φ(i) ̸= φ(j), we have

and for any i,j ∈ I and φ(i) = φ(j) = k, we have

Moreover, for any yi and yj, since both of them are nonnegative, we have

By deﬁnition, the solution to the k-means applied to {y1,...,yN} is argmin

∥yj − µk∥2,(3.4)

where S = {S1,...,Sr} is all r nonoverlapping partitions of [N]. It is obvious

choose ˜µk as any vector yi belonging to the kth community, that is, φ(i) = k. Then there holds min

∥yj − ˜µr∥2(3.5)

2m

+ 2m = 2mr + 2m. Suppose the solution to the k-means clustering is �S1,..., �Sr and ˆµk =1 | �Sk| �yj∈ �Sk yj. For each j ∈ �Sk, deﬁne ˆφ(j) := k. Now we show that if m <

must account for more than 50 percent in some

cluster �Sk. Assume this is not true. Then there is a Di being minority in each �Sk, and hence for each yaj ∈ Di, there exists a ybj /∈ Di, but ˆφ(yaj) = ˆφ(ybj). Moreover, all these 2li indices are distinct. This implies

1

the assumption m < nmin2r+4. Since each Di is the majority of some estimated community �Sk, we can give the deﬁnition of misclassiﬁcation rate among the inliers: suppose there are p pairs (ya1,yb1),...,(yap,ybp) such that all 2p indices are distinct, 1 ≤ φ(yaj) < φ(ybj) ≤ r for all j = 1,...,p but ˆφ(yaj) = ˆφ(ybj). The mis-classiﬁcation rate among the inliers is deﬁned as max pn for all possible p satisfying the above property. Now we give an example showing why this deﬁnition of misclassiﬁcation rate is appropriate. Suppose n balls have r colors as well as m uncolored balls, and we assign them into r boxes. In the ith box, we assume there are si balls having color i, while there are ti balls which are colored other than i. Moreover, we also assume the assignment is acceptable in the sense that si > ti. In the ith box, there are at most ti distinct pairs of colored balls such that in each pair the colors are diﬀerent. By our deﬁnition, the misclassiﬁcation rate is t1+···+trn , which is the natural deﬁnition.

Back to our robust community detection problem, if we assume the mis-classiﬁcation rate among the inliers is pn, we have

Therefore, we have

2mr + 2m(1 − (m/nmin))n ≤ (2r + 3)mn

provided . In summary, we have proven the following theorem,

which guarantees that the misclassiﬁcation rate among the inliers can be well controlled:

Theorem 3.2. Suppose the assumptions in Theorem 3.1 hold as well as . Then, with high probability, the misclassification rate among the inlier nodes is less than or equal to

Rigorously speaking, k-means minimization is computationally NP-hard, although in practice it is often easy and fast to implement with a number of repetitions. However, as shown in Kumar, Sabharwal and Sen [(2011), Theorem 4.9], there is a (1 + ε) approximate k-means clustering for (3.4) with computational time O(2(r/ε)O(1)N 2), which is polynomial time when r is a constant. Suppose { ˇS1,..., ˇSr} is a polynomial time approximate kmeans solution, such that

∥yj − ˇµk∥2 ≤ (1 + ε) minS,µ1,...,µr

≤ (1 + ε)(2mr + 2m). Then if within the inliers there are p misclassiﬁed nodes by { ˇS1,..., ˇSr}, similarly to the previous argument, we get pn ≤ (1+ε)(2r+3)mn . When r grows with N, one can also cluster the rows of �X in (3.3) based on the ℓ1 distance. If two inlier nodes belong to the same community, their corresponding rows in �X have ℓ1 distance less than m; on the other hand, if two inlier nodes belong to diﬀerent communities, their corresponding rows have ℓ1 distance greater than 2nmin. If the number of outliers is far less than the minimum size of the major clusters, for example, nmin > O(m2), a pairwise comparison between the rows of �X can detect the inlier communities accurately even without the knowledge of r. However, this method does not work as eﬀectively as k-means clustering in numerical simulations. An

interesting direction for future research is to ﬁgure out whether there is a polynomial time (1 + ε) approximation k-means clustering for (3.4) when r grows with N. 4. Numerical results. In this section, synthetic data and a real-world network data are employed to demonstrate the eﬃciency and eﬀectiveness of our community detection procedure: convex optimization (2.3) followed by k-means. As discussed in Section 2, throughout all numerical simulations of the augmented Lagrange multiplier method Algorithm 1, we ﬁx α = 0. All simulations were carried out with MATLAB R2014b on a MacBook Pro with a 2.66 GHz Intel Core i7 Processor and 4GB 1067 MHz DDR3 memory. As indicated in Algorithm 1, for the initialization, let Λ0 = Z0 = 0. Also, we ﬁx iter = 100 and ρ = 1. As to the k-means clustering to the normalized columns of �X, we use the “kmeans” function in MATLAB with 100 replicates. 4.1. Synthesized data simulations. We consider again the synthetic data set used in Section 2. Figure 2 illustrates the adjacency matrix of a concrete realization of the original network. Suppose one knows that there are 2 major clusters, and a GSBM with r = 2 clusters is adopted. We now explain in detail our implementation of Algorithm 1. First, we need to choose the tuning parameter λ between the maximum cross-group density q+ and the minimum within-group density p−. Ideal choices of λ are formalized by condition (3.2) in Theorem 3.1. In practice, we propose a simple method to choose λ as the mean connectivity density in a subgraph determined by nodes with “moderate” degrees. If the adjacency matrix of the graph is denoted as A, and 1N represents the N-dimensional vector with all coordinates equal to 1, then A1N represents the degrees of the N nodes. The nodes with degrees above the 80th percentile or below the 20th percentile are eliminated from the graph, and λ is chosen as the mean density of the subgraph determined by the remaining nodes. The purpose of choosing nodes with moderate degrees is to diminish the inﬂuence of the outliers. Notice that the mean density of the subgraph may be very diﬀerent from the mean of A1N, which is usually signiﬁcantly aﬀected by the outliers. The convex method is implemented with λ mentioned above. As an illustration, in one realization of the synthetic data set, the solution to convex optimization (2.3), and the community detection result by further implementing k-means clustering with k = r = 2 are plotted in Figure 3. We generated 10 independent graphical data sets, and correspondingly implemented 10 trials of Algorithm 1 followed by k-means clustering, as well as spectral clustering on the graph Laplacians and adjacency matrices. The average misclassiﬁcation rate among the 1000 inlier nodes of our convex optimization method is 0.0063, which is much smaller than 1 percent. The average time cost for running Algorithm 1 followed by k-means

Fig. 3. On the right is the plot of the solution to convex optimization (2.3). Based on it, the community detection result followed by k-means algorithm is shown on the left.

clustering is 87.65 seconds. In contrast, if we apply spectral clustering to the graph Laplacians and adjacency matrices with k = 2, respectively, the average misclassiﬁcation rates among the 1000 inlier nodes are 0.4792 and 0.5000, which are almost equivalent to random guessing. If we treat the 30 outliers as an additional group, and apply spectral clustering to the graph Laplacians and adjacency matrices with k = 3, the misclassiﬁcation rates among all 1030 nodes are correspondingly 0.3083 and 0.4730. Consequently, the misclassiﬁcation rates are high in terms of detecting the two major clusters. Now let us study the sensitivity of our algorithm to the choice of λ. To be sure that λ is between q = 0.11 and p = 0.17, in Figure 4 the community detection results are plotted with λ = 0.11,0.12,... ,0.16. It is obvious that for our data set the clustering power is robust to λ, unless λ is too close to p. To our surprise, even when λ = q, the two major clusters are well clustered. This is possibly due to the facts that the graph is relatively sparse and the solution after 100 iterations is still not exactly the solution to (2.3). On the right of Figure 3, we see that the solution to (2.3) is close to but not exactly equal to what Theorem 3.1 predicts. A possible reason is that the density gap in our synthetic data is not large enough. It is interesting that although the solution does not have exactly the same form as in Theorem 3.1, the k-means in the second step can still successfully cluster the two groups of nodes. We replace the within density p = 0.17 with 0.19,0.21,... ,0.29, and the solutions to (2.3) are plotted in Figure 5, respectively. The solutions appear to be closer to the form in Theorem 3.1 as the density gap increases. 4.2. Real data application. In this section, our robust community detection procedure is tested by implementing a modiﬁed version of convex optimization (2.3) on a political blogs network data set analyzed in Adamic and Glance (2005). This network data set collected in 2005 is composed of political blogs and their connections by hyperlinks, and it demonstrates the

Fig. 4. The performance of convex community detection with different values of

division and interaction between the liberal and conservative blogs prior to the 2004 presidential election. By ignoring the directions of the hyperlinks and selecting the largest connected component, there are totally 1222 nodes and 16,714 edges, which implies that the average degree is about 27. As indicated in Zhao, Levina and Zhu (2012), the distribution of the degrees is highly skewed to the right and has high variability. Also, the political memberships of all blogs are clearly studied and labeled manually in Adamic and

Fig. 5. The solutions of (2.3) with different values of p.

Fig. 6. Political blogs data of two clusters of conservatives and liberals, along with the performance of convex optimization.

Glance (2005), and are treated as the truth for the purpose of evaluating the clustering eﬃcacy of diﬀerent algorithms. The upper left panel of Figure 6 plots the adjacency matrix of the observed political blogs network. Since the degrees in this real-world network data have high variability, most community detection methods derived from the simple SBM do not perform well. Instead, algorithms based on the so-called degree-corrected SBM are proposed and proven to work well. For instance, a polynomial time spectral method based on such a model is introduced in Coja-Oghlan and Lanka (2009/10). Back to convex optimization (2.3), modiﬁcation of the matrix E is needed to adapt to the heterogeneity of the degrees. As mentioned earlier in Section 4.1 on the synthetic data simulation, λ is chosen data dependently as the mean of the degrees in a trimmed graph. When the degrees have high variability, we propose to change the scalar matrix λIN to the diagonal matrix D = Diag(A1N)/N, the diagonal entries of which are the degrees of all nodes divided by N. In brief, the modiﬁed convex

optimization is (2.3) with

In the second step of our proposed community detection procedure, we choose k = r = 2 in the k-means clustering. As a result, our community detection procedure applied to the real-world network data set only costs 137.16 seconds to accurately cluster these 1222 nodes with a misclassiﬁca-tion rate about 63/1222 ≈ 0.052. The lower right panel of Figure 6 shows this clustering result by plotting the adjacency matrix of the clustered graph, in which two nodes are connected if and only if they are clustered in the same group. The misclassiﬁcation rate is comparable to the best-known results in the literature. The SCORE method proposed in Jin (2015) leads to a misclas-siﬁcation rate of 58/1222. Proﬁle likelihood method under degree-corrected SBM [Karrer and Newman (2011)] and Newman–Girvan modularity method [Newman and Girvan (2004), Zhao, Levina and Zhu (2012)] usually have misclassiﬁcation rates about 0.05. However, as indicated in Jin (2015), the tabu algorithm implemented to maximize these criteria is computationally expensive and is numerically unstable due to bad initializations. It is shown in Jin (2015) that the average misclassiﬁcation of the modularity method is about 105/1222 based on 100 independent repetitions. As to classical spectral clustering, the upper right and lower left panels of Figure 6 show that the two eigenvectors of the graph Laplacian/adjacency matrix corresponding to the top two eigenvalues are not capable in detecting and distinguishing the liberal and conservative political blogs. Hence, ordinary spectral clustering does not work when applied to this data set. A data-dependent penalized spectral clustering applied to the graph Laplacian was proposed in Joseph and Yu (2013), but the misclassiﬁcation rate is nearly 0.2, which is much worse than our result. 5. Discussion. In this paper we introduce the GSBM for robust community detection in the presence of arbitrary outlier nodes, and propose a computationally feasible method using convex optimization. Strong theoretical guarantees are established under mild technical conditions. In particular, when the number of clusters is ﬁxed and the edge density within the inliers

is ) outliers are allowed; when the edge density within the

inliers is on the order of O(1), and the number of clusters grows with n, for example, O(n1/4), our method is robust against O(n1/2−ε) adversarial outliers. Under the special case when there is no outlier node, our theoretical result is also consistent with the state-of-the-art results in the literature of computationally feasible community detection under the SBM.

There are a number of possible extensions to the current results. The proposed community detection procedure as well as the theoretical guarantees depend on the assumption δ = p− − q+ > 0. Although this assumption is common in the literature of community detection, it is actually a strong assumption which sometimes does not hold in real-world network data applications. For example, suppose there are r = 3 clusters, and the connectivity matrix is

For each node, its associated within-group density is bigger than its associated cross-group densities; however,

Therefore, in the current framework no choice of the tuning parameter λ is capable of the consistent community detection, which implies the matrix E in the convex optimization step must be modiﬁed. In fact, in our simulations, λ is replaced by a data-dependent diagonal matrix based on the degrees of all nodes in order to adapt to high-degree variation. We are interested in justifying this choice under the degree-corrected SBM proposed in CojaOghlan and Lanka (2009/10) and analyzed in Karrer and Newman (2011), Zhao, Levina and Zhu (2012), Chaudhuri, Chung and Tsiatas (2012), Jin (2015), Lei and Rinaldo (2015). In our numerical simulations, contrary to the established theoretical guarantees, the choices of α are much smaller than the number of outlier nodes m. In fact, the procedure works well with the choice α = 0. An open question is whether this tuning parameter is actually redundant. In addition, in the second step of our procedure, the number of major inlier clusters r is needed. Since the solution of the convex optimization usually increases the connections within the major groups and diminishes the connections across them, it is natural and interesting to investigate whether r can be inferred exactly from the data. For reasons of space, we leave these as future work.

6.1. Notation. Throughout the proofs we will use the following notation: the ℓ × ℓ identity matrix is denoted by Iℓ. An ℓ1 × ℓ2 matrix whose entries all equal to 1 is denoted as J(ℓ1,ℓ2). For square matrices, we write Jℓ := J(ℓ,ℓ). An ℓ-dimensional vector whose coordinates all equal to 1 is denoted as 1ℓ. If all coordinates of a vector v are nonnegative, we write v ≥ 0. When all coordinates of v are positive, we write v > 0. We use u ≥ v to denote

u − v ≥ 0, and similarly, u > v denotes u − v > 0. We also denote by ∥x∥∞ the maximum absolute values over all coordinates of x. Similarly, if all entries of the matrix M are nonnegative, we write M ≥ 0. When all entries of M are positive, we write M > 0. The inequality M1 ≥ M2 denotes M1 − M2 ≥ 0, while M1 > M2 denotes M1 − M2 > 0. Denote by ∥M∥∞ the maximum absolute value over all entries of M. The norms ∥ · ∥ and ∥ · ∥F represent the operator and Frobenius norms, respectively. We use M ≻ 0 to denote that the symmetric matrix M is positive deﬁnite and use M ⪰ 0 to denote that M is positive semideﬁnite. Similarly M1 ≻ M2 and M1 ⪰ M2 represent that M1 − M2 is positive deﬁnite and positive semideﬁnite, respectively. For any vector v ∈ Rn, we denote by Diag(v) the n × n diagonal matrix whose diagonal entries are correspondingly the coordinates of v. Denote by C,C0,c, etc. numerical constants, whose values could change from line to line. 6.2. Preliminaries. Before proving Theorem 3.1, we introduce several well-known theorems in linear algebra and probability theory. Lemma 6.1 (Weyl [Horn and Johnson (2013), Theorem 4.3.1]). Let H

and Hermitian matrices. Suppose that H+P, H and P have real eigenvalues , each arranged in algebraically nonincreasing order. Then for i = 1,...,n we have

λi(H) + λn(P) ≤ λi(H + P) ≤ λi(H) + λ1(P). Lemma 6.2 (Cauchy’s interlacing theorem [Horn and Johnson (2013),

Theorem 4.3.28]). Let Hermitian matrix and principal submatrix. Suppose that H and G have real eigenvalues and , each arranged in algebraically nonincreasing order. Then for j = 1,...,k we have

Lemma 6.3 (Chernoﬀ’s inequality [Chernoﬀ (1981)]). Let X1,...,Xn be

P(Xi = 1) = pi, P(Xi = 0) = 1 − pi.

Finally, we consider the following problem: suppose that A = (aij)1≤i,j≤n is a random symmetric matrix, whose diagonal entries are all zeros, while aij,1 ≤ i < j ≤ n are independent zero-mean Bernoulli random variables obeying |aij| ≤ 1 and Var(aij) ≤ σ2. Can we prove that with high probability, ∥A∥ ≤ C(σ√nlogn + log n) for some numerical constant C? In the sequel, this upper bound is derived by applying the following matrix Bernstein inequality, which is an improvement of Ahlswede and Winter (2002): Lemma 6.4 [Tropp (2012), Theorem 6.1]. Consider a ﬁnite sequence

of independent, random, self-adjoint matrices with dimension d. Assume that

be a symmetric random matrix whose diagonal entries are all zeros. Moreover, suppose independent zero-mean random variables satisfying . Then, with probability at least 1

∥A∥ ≤ C0(σ�nlogn + log n)

Proof. For each pair (i,j):1 ≤ i < j ≤ n, let Xij be the matrix whose (i,j) and (j,i) entries are both aij, whereas other entires are zeros. Then we have A = �

Moreover, we can easily have EXij = 0, ∥Xij∥ ≤ 1 and

They by applying Lemma 6.4, the proof is complete. □

6.3. Supporting lemmas. Notice that optimization (2.3) is determined by the adjacency matrix A. Here we derive some properties of A and leave the detailed proofs in the supplemental article Cai and Li (2015). More precisely, we give some properties of the random matrix K, which is a principal submatrix of A; see (1.3).

q+ log n

,(6.1)

for some sufficiently large numerical constant C, then with probability at least 1

Kii1li ≥ ((li − 1)Bii − 2�(li − 1)Bii log n)1li,(6.2)

�Bjk + δ16�lk1lj,(6.3) K⊺jk1lj ≤�Bjk + δ16�lj1lk,(6.4)

. With probability at least 1 we have

∥U∥ ≤ C0(�nq+ log n + log n),(6.7)

U :=

whose diagonal blocks are all are some numerical constants.

It is worth noting that by applying a very recent result Vu [(2014), Lemma 8], which is an improvement of F¨uredi and Koml´os (1981), Vu (2007),

we can prove ∥U∥ ≤ C0(�nq+ + √log n). Condition (3.1) in Theorem 3.1 can then be relaxed to

p− log n

+

+

+

(α − 2m)nmin

The beneﬁt is that when m = O(1), p− = O(1), q+ = O(1) and δ = O(1), nmin can be as small as O(√N) by letting α =√N. In particular, if there is no outlier node, that is, the ordinary SBM, this is consistent with the state-of-the-art result in the literature of computationally feasible community detection. 6.4. Proof of Theorem 3.1. In this section, we will rigorously prove Theorem 3.1. First, to simplify the calculations, we can assume the permutation matrix P to be the identity matrix IN. This suggestion is formalized by the following lemma:

Lemma 6.8. If Theorem 3.1 is true for , it is also true for any permutation matrix P.

The proof is given in the supplemental article Cai and Li (2015). Lemma 6.8 guarantees that in order to prove Theorem 3.1, we can assume without loss of generality that P = I, that is, A = [ K ZZ⊺ W]. In the following, we will prove Theorem 3.1 based on the following idea: In order to analyze a solution �X to (2.3), we need to explore several inequalities that it satisﬁes. The obvious ones are �X ⪰ 0 and 0 ≤ �X ≤ JN as the feasibility conditions in (2.3). However, the optimality condition of �X implies that for any feasible �X, we have ⟨ �X,E⟩ ≤ ⟨ �X,E⟩. To suﬃciently utilize this condition, we need to construct a feasible matrix X, such that ⟨ �X,E⟩ ≤ ⟨X,E⟩ is a tight constraint. In Section 6.4.1 we will show how to construct this X. After establishing these inequalities for any solution �X, we give in Section 6.4.2 a suﬃcient condition which guarantees that �X has the form (3.3) (with P = I), and then in Section 6.4.3 we prove that with high probability this suﬃcient condition is true by using the supporting lemmas proven previously. Consequently, these three steps imply Theorem 3.1. 6.4.1. Solution candidate. In this section, we will construct a candidate solution X feasible to (2.3). Denote

:=

... ... ... ...

which is equivalent to deﬁning �Zi = λJ(li,m) − Zi, i = 1,...,r,(6.8)

W = (α − λ)Im + λJm − W.(6.9) The following lemma, the proof of which is given in the supplemental article Cai and Li (2015), guarantees the existence of r vectors x1,...,xr ∈ Rm, which will be employed to construct a candidate solution:

min

exists uniquely. Moreover, denote the solutions by definition satisfy . Then there are nonnegative vectors nonnegative diagonal matrix

Ξ = diag(ξ1,...,ξm),

Wxi + �Z⊺i 1li = βi − Ξxi,(6.11)

= 0, j = 1,...,m(6.12)

⟨xi,βi⟩ = 0, i = 1,...,r.(6.13)

x⊺j(�W + Ξ)xk ≤ m�ljlk.(6.14)

xik.(6.15)

0 ≤ βi ≤ (m + li − 1)1m.(6.16) Throughout the paper, we deﬁne V := [v1,...,vr] :=

... ... ... ...

and

... ... ... ...

Since xi’s are feasible to optimization (6.10), we can easily see that X is feasible to optimization (2.3). We aim to prove that under mild technical conditions, X is actually a solution to optimization (2.3).

propose a condition which guarantees that any solution �X to (2.3) must be in the form of (3.3) with P = IN. This suﬃcient condition is equivalent to constructing a matrix Λ satisfying a series of equalities and inequalities as indicated in the following lemma. We call it a dual certiﬁcate. In Section 6.4.3, we will show that with high probability, this dual certiﬁcate can be constructed in an explicit way.

are defined as in Lemma 6.9. If there exist symmetric matrices

Λ =

(6.17)

satisfies , then any minimizer (2.3) must be of the form

�X =

An intuition behind the theorem and the rigorous proof are given in the supplemental article Cai and Li (2015). It is noteworthy that the condition on Λ is weaker if the number of clusters r gets smaller. The reason is that the equality condition is ΛV = 0. Obviously when r gets smaller, V has fewer columns, and hence the equality constraint becomes milder. We emphasize that the choices of Ψii and Φij are intended to ﬁt the equality constraint of Λ, that is, ΛV = 0. To make sure Λ ⪰ 0, we need to ﬁrst project Λ onto the orthogonal compliment of V, and then show the projection is positive deﬁ-nite. This is based on the spectral norm bound as indicated in Lemma 6.7, which provides a concentration inequality for a random matrix. 6.4.3. Construction of dual certiﬁcate. It suﬃces to construct a matrix Λ in the form of (6.17) in Lemma 6.10, which satisﬁes ΛV = 0, Ψii > 0, Φjk > 0 and Λ ⪰ 0. The following lemma guarantees the existence of such Λ, and its proof is given in the supplemental article Cai and Li (2015).

Moreover, assume

p− log n

+

+

+

(α − 2m)nmin

(6.18)

for some sufficiently large numerical constant C. Then, with probability at least 1, there exist matrices ’s satisfying and the matrix

SUPPLEMENTARY MATERIAL

Supplemental materials to “Robust and computationally feasible community detection in the presence of arbitrary outliers nodes”

(DOI: 10.1214/14-AOS1290SUPP; .pdf). We give in the supplement proofs

REFERENCES

Adamic, A. and Glance, N. (2005). The political blogosphere and the 2004 US election: Divided they blog. In Proceedings of the 3rd International Workshop on Link Discovery 36–43. ACM, New York.

Ahlswede, R. and Winter, A. (2002). Strong converse for identification via quantum channels. IEEE Trans. Inform. Theory 48 569–579. MR1889969

Airoldi, E., Blei, M., Fienberg, S. and Xing, E. (2008). Mixed membership stochastic blockmodels. J. Mach. Learn. Res. 9 1981–2014.

Ames, B. P. W. (2014). Guaranteed clustering and biclustering via semidefinite programming. Math. Program. 147 429–465. MR3258531

Ames, B. P. W. and Vavasis, S. A. (2014). Convex optimization for the planted k-disjoint-clique problem. Math. Program. 143 299–337. MR3152071

Amini, A. A., Chen, A., Bickel, P. J. and Levina, E. (2013). Pseudo-likelihood methods for community detection in large sparse networks. Ann. Statist. 41 2097–2122. MR3127859

Balakrishnan, S., Xu, M., Krishnamurthy, A. and Singh, A. (2011). Noise thresholds for spectral clustering (NIPS 2011). Adv. Neural Inf. Process. Syst. 25 954–962.

Bhattacharyya, S. and Bickel, P. J. (2014). Community detection in networks using graph distance. Available at arXiv:1401.3915.

Bickel, P. J. and Chen, A. (2009). A nonparametric view of network models and Newman–Girvan and other modularities. Proc. Natl. Acad. Sci. USA 106 21068–21073.

Bickel, P., Choi, D., Chang, X. and Zhang, H. (2013). Asymptotic normality of maximum likelihood and its variational approximation for stochastic blockmodels. Ann. Statist. 41 1922–1943. MR3127853

Boyd, S., Parikh, N., Chu, E., Peleato, B. and Eckstein, J. (2010). Distributed optimization and statistical learning via the alternating direction method of multipliers. Faund. Trends Mach. Learn. 3 1–122.

computationally feasible community detection in the presence of arbitrary outlier nodes.” DOI:10.1214/14-AOS1290SUPP.

Cand`es, E. J., Strohmer, T. and Voroninski, V. (2013). PhaseLift: Exact and stable signal recovery from magnitude measurements via convex programming. Comm. Pure Appl. Math. 66 1241–1274. MR3069958

Cand`es, E. J., Li, X., Ma, Y. and Wright, J. (2011). Robust principal component analysis? J. ACM 58 Art. 11, 37. MR2811000

Celisse, A., Daudin, J.-J. and Pierre, L. (2012). Consistency of maximum-likelihood and variational estimators in the stochastic block model. Electron. J. Stat. 6 1847–1899. MR2988467

Chaudhuri, K., Chung, F. and Tsiatas, A. (2012). Spectral clustering of graphs with general degrees in the extended planted partition model. J. Mach. Learn. Res. 23 35.1– 35.23.

Chen, Y., Sanghavi, S. and Xu, H. (2012). Clustering sparse graphs. Adv. Neural Inf. Process. Syst. 25 2213–2221.

Chernoff, H. (1981). A note on an inequality involving the normal distribution. Ann. Probab. 9 533–535. MR0614640

Clauset, A., Newman, M. and Moore, C. (2004). Finding community structure in very large networks. Phys. Rev. E 70 066111.

Coja-Oghlan, A. and Lanka, A. (2009/10). Finding planted partitions in random graphs with general degree distributions. SIAM J. Discrete Math. 23 1682–1714. MR2570199

(2011). Asymptotic analysis of the stochastic block model for modular networks and its algorithmic applications. Phys. Rev. E 84 066106.

Deshpande, Y. and Montanari, A. (2015). Finding hidden cliques of sizenearly linear time. Found. Comput. Math. DOI:10.1007/s10208-014-9125-y. To appear.

Fienberg, S. E. (2010). Introduction to papers on the modeling and analysis of network data. Ann. Appl. Stat. 4 1–4. MR2758081

Fienberg, S. E. (2012). A brief history of statistical models for network analysis and open challenges. J. Comput. Graph. Statist. 21 825–839. MR3005799

Fishkind, D. E., Sussman, D. L., Tang, M., Vogelstein, J. T. and Priebe, C. E. (2013). Consistent adjacency-spectral partitioning for the stochastic block model when the model parameters are unknown. SIAM J. Matrix Anal. Appl. 34 23–39. MR3032990

(1981). The eigenvalues of random symmetric matrices. Combinatorica 1 233–241. MR0637828

Giesen, J. and Mitsche, D. (2005). Reconstructing many partitions using spectral techniques. In Fundamentals of Computation Theory. Lecture Notes in Computer Science 3623 433–444. Springer, Berlin. MR2194866

Goldenberg, A., Zheng, A. X., Fienberg, S. E. and Airoldi, E. M. (2010). A survey of statistical network models. Foundations and Trends in Machine Learning 2 129–233.

Handcock, M. S., Raftery, A. E. and Tantrum, J. M. (2007). Model-based clustering for social networks. J. Roy. Statist. Soc. Ser. A 170 301–354. MR2364300

Holland, P. W., Laskey, K. B. and Leinhardt, S. (1983). Stochastic blockmodels: First steps. Soc. Netw. 5 109–137. MR0718088

Horn, R. A. and Johnson, C. R. (2013). Matrix Analysis, 2nd ed. Cambridge Univ. Press, Cambridge. MR2978290

Jalali, A., Chen, Y., Sanghavi, S. and Xu, H. (2014). Clustering partially observed graphs via convex optimization. J. Mach. Learn. Res. 15 2213–2238.

Jin, J. (2015). Fast network community detection by SCORE. Ann. Statist. 43 57–89.

Joseph, A. and Yu, B. (2013). Impact of regularization on spectral clustering. Available at arXiv:1312.1733.

Karrer, B. and Newman, M. E. J. (2011). Stochastic blockmodels and community structure in networks. Phys. Rev. E (3) 83 016107, 10. MR2788206

Zhang, P. (2013). Spectral redemption in clustering sparse networks. Proc. Natl. Acad. Sci. USA 110 20935–20940. MR3174850

Kumar, A., Sabharwal, Y. and Sen, S. (2011). A simple linear time (1 + approximation algorithm for k-means clustering in any dimensions. J. ACM 58 11.

Lei, J. and Rinaldo, A. (2015). Consistency of spectral clustering in stochastic block models. Ann. Statist. 43 215–237. MR3285605

Li, X. and Voroninski, V. (2013). Sparse signal recovery from quadratic measurements via convex programming. SIAM J. Math. Anal. 45 3019–3033. MR3106479

Lin, Z., Liu, R. and Su, Z. (2011). Linearized alternating direction method with adaptive penalty for low rank representation. In Advances in Neural Information Processing Systems (NIPS) 612–620.

Mathieu, C. and Schudy, W. (2010). Correlation clustering with noisy input. In Proceedings of the Twenty-First Annual ACM-SIAM Symposium on Discrete Algorithms 712–728. SIAM, Philadelphia, PA. MR2768627

McSherry, F. (2001). Spectral partitioning of random graphs. In 42nd IEEE Symposium on Foundations of Computer Science (Las Vegas, NV, 2001) 529–537. IEEE Computer Soc., Los Alamitos, CA. MR1948742

Newman, M. and Girvan, M. (2004). Finding and evaluating community structure in networks. Phys. Rev. E 69 026113.

Newman, M. and Leicht, E. (2007). Mixture models and exploratory analysis in networks. Proc. Natl. Acad. Sci. USA 104 9564–9569.

Nowicki, K. and Snijders, T. A. B. (2001). Estimation and prediction for stochastic blockstructures. J. Amer. Statist. Assoc. 96 1077–1087. MR1947255

Oymak, S. and Hassibi, B. (2011). Finding dense clusters via low rank + sparse decomposition. Available at arXiv:1104.5186.

Rohe, K., Chatterjee, S. and Yu, B. (2011). Spectral clustering and the highdimensional stochastic blockmodel. Ann. Statist. 39 1878–1915. MR2893856

Sarkar, P. and Bickel, P. J. (2013). Role of normalization in spectral clustering for stochastic blockmodels. Available at arXiv:1310.1495.

Shamir, R. and Tsur, D. (2007). Improved algorithms for the random cluster graph model. Random Structures Algorithms 31 418–449. MR2362638

Snijders, T. A. B. and Nowicki, K. (1997). Estimation and prediction for stochastic blockmodels for graphs with latent block structure. J. Classification 14 75–100. MR1449742

Sussman, D. L., Tang, M., Fishkind, D. E. and Priebe, C. E. (2012). A consistent adjacency spectral embedding for stochastic blockmodel graphs. J. Amer. Statist. Assoc. 107 1119–1128. MR3010899

Tropp, J. A. (2012). User-friendly tail bounds for sums of random matrices. Found. Comput. Math. 12 389–434. MR2946459

Vu, V. H. (2007). Spectral norm of random matrices. Combinatorica 27 721–736. MR2384414

Vu, V. (2014). A simple SVD algorithm for finding hidden partitions. Available at arXiv:1404.3918.

Zhao, Y., Levina, E. and Zhu, J. (2012). Consistency of community detection in networks under degree-corrected stochastic block models. Ann. Statist. 40 2266–2292. MR3059083

designed for accessibility and to further open science