b

DiscoverSearch
About
My stuff
Robust and computationally feasible community detection in the presence of arbitrary outlier nodes
2014·arXiv
Abstract
Abstract

ROBUST AND COMPUTATIONALLY FEASIBLE COMMUNITY DETECTION IN THE PRESENCE OF ARBITRARY OUTLIER NODES1

image

Received April 2014; revised November 2014. 1Supported in part by NSF Grants DMS-12-08982 and DMS-14-03708, NIH Grant R01 CA127334-05, and the Wharton Dean’s Fund for Post-Doctoral Research.

AMS 2000 subject classifications. 62H30, 91C20. Key words and phrases. Robust community detection, SDP relaxation, dual certificate, k-means clustering.

image

image

Does there exist a computationally fast community detection method that is robust to a portion of arbitrary outlier nodes with theoretical guarantees?

image

this model as generalized stochastic block model (GSBM ). Denote V = [N] =

image

image

Fig. 1. The upper left panel illustrates the adjacency matrix of 1000 nodes satisfying the ordinary SBM. The upper right panel is the adjacency matrix obtained by permuting the adjacency matrix such that nodes 1 to 500 belong to the same cluster while the remaining ones constitute another cluster. The lower left panel plots the eigenvectors of the graph Laplacian corresponding to the top 2 eigenvalues in absolute value (red for the first and black for the second), while those for the adjacency matrix are plotted in the lower right panel. In both cases, these two eigenvectors are capable of discriminating between the two communities.

image

Fig. 2. The upper left panel illustrates the adjacency matrix of 1030 nodes satisfying the GSBM with two major clusters and 30 outliers. The upper right panel is obtained by permuting the nodes such that nodes belonging to the same group are consecutive. The lower left panel plots the eigenvectors of the graph Laplacian corresponding the top 3 eigenvalues in absolute value (red for the first, black for the second and green for the third), while those for the adjacency matrix are plotted in the lower right panel. Ordinary spectral clustering with r = 2 or r = 3 is ineffective or even powerless on this data set since the top three eigenvectors cannot clearly discriminate between the two main communities.

image

image

image

image

image

image

image

image

Theorem 3.1. Let A be the adjacency matrix of the semi-random graph under the GSBM, as defined in (1.3). Let �Xbe a solution to the semidefinite program (2.3) and the density gap  δbe defined as in (1.2), and the minimum within-group density  p− and the maximum cross-group density  q+ be definedas in (1.1). As defined in Section 1, the integer n denotes the number of inlier nodes, m denotes the number of outlier nodes and  nmindenotes the minimum community size among the inliers. Suppose that  p− ≥ C log nnmin , α ≥3m and

image

for some sufficiently large numerical constant C, and the tuning parameter

image

image

image

image

as low as  O(log nn ) and the density gap  δ = O(log nn ). In another example

image

image

image

image

image

image

image

image

2r+4, each Di, i = 1,...,rmust account for more than 50 percent in some

image

image

image

provided  m < nmin2r+4. In summary, we have proven the following theorem,

Theorem 3.2. Suppose the assumptions in Theorem 3.1 hold as well as m < nmin2r+4. Then, with high probability, the misclassification rate among the inlier nodes  i ∈ Iis less than or equal to (2r+3)mn .

image

Fig. 3. On the right is the plot of the solution to convex optimization (2.3). Based on it, the community detection result followed by k-means algorithm is shown on the left.

image

Fig. 4. The performance of convex community detection with different values of  λ.

image

Fig. 5. The solutions of (2.3) with different values of p.

image

Fig. 6. Political blogs data of two clusters of conservatives and liberals, along with the performance of convex optimization.

image

is  O(log nn ), O(log n) outliers are allowed; when the edge density within the

image

image

image

and  P be two n×nHermitian matrices. Suppose that H+P, H and P have real eigenvalues  {λi(H+P)}ni=1, {λi(H)}ni=1 and {λi(P)}ni=1, each arranged in algebraically nonincreasing order. Then for i = 1,...,n we have

Theorem 4.3.28]). Let  H be an n × nHermitian matrix and  G its k × kprincipal submatrix. Suppose that H and G have real eigenvalues  {λi(H)}ni=1and  {λi(G)}ki=1, each arranged in algebraically nonincreasing order. Then for j = 1,...,k we have

image

image

image

{Xk}of independent, random, self-adjoint matrices with dimension d. Assume that

image

Corollary 6.5. Let A = (aij)1≤i,j≤nbe a symmetric random matrix whose diagonal entries are all zeros. Moreover, suppose  aij, 1 ≤ i < j ≤ n areindependent zero-mean random variables satisfying  |aij| ≤ 1 and Var(aij) ≤σ2. Then, with probability at least 1  − cn4 , we have

image

image

image

image

image

for some sufficiently large numerical constant C, then with probability at least 1  − 2n − 2rn2 , for all i = 1,...,r and 1 ≤ j < k ≤ r, we have

image

Lemma 6.7. Suppose p− ≥ C( log nnmin). With probability at least 1  − c rn4min ,we have

image

image

image

whose diagonal blocks are all  0’s. Here C, C0 and care some numerical constants.

Lemma 6.8. If Theorem 3.1 is true for  P = IN, it is also true for any permutation matrix P.

image

image

image

image

image

exists uniquely. Moreover, denote the solutions by  x1,...,xr ∈ Rm, which bydefinition satisfy  ∥xi∥∞ ≤ 1. Then there are nonnegative vectors  β1,...,βr ∈Rm and an m × mnonnegative diagonal matrix

image

image

image

image

image

image

image

image

image

image

Lemma 6.10. Suppose Ξ and β1,...,βrare defined as in Lemma 6.9. If there exist symmetric matrices  Λ ∈ RN×N, Ψjj ∈ Rlj×lj (1 ≤ j ≤ r) and

image

image

satisfies  Ψii > 0, Φjk > 0, ΛV = 0 and Λ ⪰ 0, then any minimizer �X to(2.3) must be of the form

image

image

Lemma 6.11. Suppose p− ≥ C( log nnmin), q+ + δ4 < λ < p− − δ4 and α ≥ 3m.Moreover, assume

image

for some sufficiently large numerical constant C. Then, with probability at least 1− 1n − 2rn2 − crn4min , there exist matrices  Ψii’s and Φjk’s satisfying  Ψii > 0,Φjk > 0and the matrix  Λ defined by Ψii’s and Φjk’s obey ΛV = 0 andΛ ⪰ 0.

Supplemental materials to “Robust and computationally feasible community detection in the presence of arbitrary outliers nodes”

image

Adamic, A. and Glance, N. (2005). The political blogosphere and the 2004 US election: Divided they blog. In Proceedings of the 3rd International Workshop on Link Discovery 36–43. ACM, New York.

Ahlswede, R. and Winter, A. (2002). Strong converse for identification via quantum channels. IEEE Trans. Inform. Theory 48 569–579. MR1889969

Airoldi, E., Blei, M., Fienberg, S. and Xing, E. (2008). Mixed membership stochastic blockmodels. J. Mach. Learn. Res. 9 1981–2014.

Ames, B. P. W. (2014). Guaranteed clustering and biclustering via semidefinite programming. Math. Program. 147 429–465. MR3258531

Ames, B. P. W. and Vavasis, S. A. (2014). Convex optimization for the planted k-disjoint-clique problem. Math. Program. 143 299–337. MR3152071

Amini, A. A., Chen, A., Bickel, P. J. and Levina, E. (2013). Pseudo-likelihood methods for community detection in large sparse networks. Ann. Statist. 41 2097–2122. MR3127859

Balakrishnan, S., Xu, M., Krishnamurthy, A. and Singh, A. (2011). Noise thresholds for spectral clustering (NIPS 2011). Adv. Neural Inf. Process. Syst. 25 954–962.

Bhattacharyya, S. and Bickel, P. J. (2014). Community detection in networks using graph distance. Available at arXiv:1401.3915.

Bickel, P. J. and Chen, A. (2009). A nonparametric view of network models and Newman–Girvan and other modularities. Proc. Natl. Acad. Sci. USA 106 21068–21073.

Bickel, P., Choi, D., Chang, X. and Zhang, H. (2013). Asymptotic normality of maximum likelihood and its variational approximation for stochastic blockmodels. Ann. Statist. 41 1922–1943. MR3127853

Boyd, S., Parikh, N., Chu, E., Peleato, B. and Eckstein, J. (2010). Distributed optimization and statistical learning via the alternating direction method of multipliers. Faund. Trends Mach. Learn. 3 1–122.

Cai, T. and Li, X. (2015). Supplement to “Robust andcomputationally feasible community detection in the presence of arbitrary outlier nodes.” DOI:10.1214/14-AOS1290SUPP.

Cand`es, E. J., Strohmer, T. and Voroninski, V. (2013). PhaseLift: Exact and stable signal recovery from magnitude measurements via convex programming. Comm. Pure Appl. Math. 66 1241–1274. MR3069958

Cand`es, E. J., Li, X., Ma, Y. and Wright, J. (2011). Robust principal component analysis? J. ACM 58 Art. 11, 37. MR2811000

Celisse, A., Daudin, J.-J. and Pierre, L. (2012). Consistency of maximum-likelihood and variational estimators in the stochastic block model. Electron. J. Stat. 6 1847–1899. MR2988467

Chaudhuri, K., Chung, F. and Tsiatas, A. (2012). Spectral clustering of graphs with general degrees in the extended planted partition model. J. Mach. Learn. Res. 23 35.1– 35.23.

Chen, Y., Sanghavi, S. and Xu, H. (2012). Clustering sparse graphs. Adv. Neural Inf. Process. Syst. 25 2213–2221.

Chernoff, H. (1981). A note on an inequality involving the normal distribution. Ann. Probab. 9 533–535. MR0614640

Clauset, A., Newman, M. and Moore, C. (2004). Finding community structure in very large networks. Phys. Rev. E 70 066111.

Coja-Oghlan, A. and Lanka, A. (2009/10). Finding planted partitions in random graphs with general degree distributions. SIAM J. Discrete Math. 23 1682–1714. MR2570199

Decelle, A., Krzakala, F., Moore, C. and Zdeborov´a, L.(2011). Asymptotic analysis of the stochastic block model for modular networks and its algorithmic applications. Phys. Rev. E 84 066106.

Deshpande, Y. and Montanari, A. (2015). Finding hidden cliques of size�N/e innearly linear time. Found. Comput. Math. DOI:10.1007/s10208-014-9125-y. To appear.

Fienberg, S. E. (2010). Introduction to papers on the modeling and analysis of network data. Ann. Appl. Stat. 4 1–4. MR2758081

Fienberg, S. E. (2012). A brief history of statistical models for network analysis and open challenges. J. Comput. Graph. Statist. 21 825–839. MR3005799

Fishkind, D. E., Sussman, D. L., Tang, M., Vogelstein, J. T. and Priebe, C. E. (2013). Consistent adjacency-spectral partitioning for the stochastic block model when the model parameters are unknown. SIAM J. Matrix Anal. Appl. 34 23–39. MR3032990

F¨uredi, Z. and Koml´os, J.(1981). The eigenvalues of random symmetric matrices. Combinatorica 1 233–241. MR0637828

Giesen, J. and Mitsche, D. (2005). Reconstructing many partitions using spectral techniques. In Fundamentals of Computation Theory. Lecture Notes in Computer Science 3623 433–444. Springer, Berlin. MR2194866

Goldenberg, A., Zheng, A. X., Fienberg, S. E. and Airoldi, E. M. (2010). A survey of statistical network models. Foundations and Trends in Machine Learning 2 129–233.

Handcock, M. S., Raftery, A. E. and Tantrum, J. M. (2007). Model-based clustering for social networks. J. Roy. Statist. Soc. Ser. A 170 301–354. MR2364300

Holland, P. W., Laskey, K. B. and Leinhardt, S. (1983). Stochastic blockmodels: First steps. Soc. Netw. 5 109–137. MR0718088

Horn, R. A. and Johnson, C. R. (2013). Matrix Analysis, 2nd ed. Cambridge Univ. Press, Cambridge. MR2978290

Jalali, A., Chen, Y., Sanghavi, S. and Xu, H. (2014). Clustering partially observed graphs via convex optimization. J. Mach. Learn. Res. 15 2213–2238.

Jin, J. (2015). Fast network community detection by SCORE. Ann. Statist. 43 57–89.

Joseph, A. and Yu, B. (2013). Impact of regularization on spectral clustering. Available at arXiv:1312.1733.

Karrer, B. and Newman, M. E. J. (2011). Stochastic blockmodels and community structure in networks. Phys. Rev. E (3) 83 016107, 10. MR2788206

Krzakala, F., Moore, C., Mossel, E., Neeman, J., Sly, A., Zdeborov´a, L. andZhang, P. (2013). Spectral redemption in clustering sparse networks. Proc. Natl. Acad. Sci. USA 110 20935–20940. MR3174850

Kumar, A., Sabharwal, Y. and Sen, S. (2011). A simple linear time (1 +  ǫ)-approximation algorithm for k-means clustering in any dimensions. J. ACM 58 11.

Lei, J. and Rinaldo, A. (2015). Consistency of spectral clustering in stochastic block models. Ann. Statist. 43 215–237. MR3285605

Li, X. and Voroninski, V. (2013). Sparse signal recovery from quadratic measurements via convex programming. SIAM J. Math. Anal. 45 3019–3033. MR3106479

Lin, Z., Liu, R. and Su, Z. (2011). Linearized alternating direction method with adaptive penalty for low rank representation. In Advances in Neural Information Processing Systems (NIPS) 612–620.

Mathieu, C. and Schudy, W. (2010). Correlation clustering with noisy input. In Proceedings of the Twenty-First Annual ACM-SIAM Symposium on Discrete Algorithms 712–728. SIAM, Philadelphia, PA. MR2768627

McSherry, F. (2001). Spectral partitioning of random graphs. In 42nd IEEE Symposium on Foundations of Computer Science (Las Vegas, NV, 2001) 529–537. IEEE Computer Soc., Los Alamitos, CA. MR1948742

Newman, M. and Girvan, M. (2004). Finding and evaluating community structure in networks. Phys. Rev. E 69 026113.

Newman, M. and Leicht, E. (2007). Mixture models and exploratory analysis in networks. Proc. Natl. Acad. Sci. USA 104 9564–9569.

Nowicki, K. and Snijders, T. A. B. (2001). Estimation and prediction for stochastic blockstructures. J. Amer. Statist. Assoc. 96 1077–1087. MR1947255

Oymak, S. and Hassibi, B. (2011). Finding dense clusters via low rank + sparse decomposition. Available at arXiv:1104.5186.

Rohe, K., Chatterjee, S. and Yu, B. (2011). Spectral clustering and the highdimensional stochastic blockmodel. Ann. Statist. 39 1878–1915. MR2893856

Sarkar, P. and Bickel, P. J. (2013). Role of normalization in spectral clustering for stochastic blockmodels. Available at arXiv:1310.1495.

Shamir, R. and Tsur, D. (2007). Improved algorithms for the random cluster graph model. Random Structures Algorithms 31 418–449. MR2362638

Snijders, T. A. B. and Nowicki, K. (1997). Estimation and prediction for stochastic blockmodels for graphs with latent block structure. J. Classification 14 75–100. MR1449742

Sussman, D. L., Tang, M., Fishkind, D. E. and Priebe, C. E. (2012). A consistent adjacency spectral embedding for stochastic blockmodel graphs. J. Amer. Statist. Assoc. 107 1119–1128. MR3010899

Tropp, J. A. (2012). User-friendly tail bounds for sums of random matrices. Found. Comput. Math. 12 389–434. MR2946459

Vu, V. H. (2007). Spectral norm of random matrices. Combinatorica 27 721–736. MR2384414

Vu, V. (2014). A simple SVD algorithm for finding hidden partitions. Available at arXiv:1404.3918.

Zhao, Y., Levina, E. and Zhu, J. (2012). Consistency of community detection in networks under degree-corrected stochastic block models. Ann. Statist. 40 2266–2292. MR3059083

image


Designed for Accessibility and to further Open Science