Does there exist a computationally fast community detection method that is robust to a portion of arbitrary outlier nodes with theoretical guarantees?
this model as generalized stochastic block model (GSBM ). Denote V = [N] =
Fig. 1. The upper left panel illustrates the adjacency matrix of 1000 nodes satisfying the ordinary SBM. The upper right panel is the adjacency matrix obtained by permuting the adjacency matrix such that nodes 1 to 500 belong to the same cluster while the remaining ones constitute another cluster. The lower left panel plots the eigenvectors of the graph Laplacian corresponding to the top 2 eigenvalues in absolute value (red for the first and black for the second), while those for the adjacency matrix are plotted in the lower right panel. In both cases, these two eigenvectors are capable of discriminating between the two communities.
Fig. 2. The upper left panel illustrates the adjacency matrix of 1030 nodes satisfying the GSBM with two major clusters and 30 outliers. The upper right panel is obtained by permuting the nodes such that nodes belonging to the same group are consecutive. The lower left panel plots the eigenvectors of the graph Laplacian corresponding the top 3 eigenvalues in absolute value (red for the first, black for the second and green for the third), while those for the adjacency matrix are plotted in the lower right panel. Ordinary spectral clustering with r = 2 or r = 3 is ineffective or even powerless on this data set since the top three eigenvectors cannot clearly discriminate between the two main communities.
Theorem 3.1. Let A be the adjacency matrix of the semi-random graph under the GSBM, as defined in (1.3). Let be a solution to the semidefinite program (2.3) and the density gap
be defined as in (1.2), and the minimum within-group density
and the maximum cross-group density
as in (1.1). As defined in Section 1, the integer n denotes the number of inlier nodes, m denotes the number of outlier nodes and
denotes the minimum community size among the inliers. Suppose that
3m and
for some sufficiently large numerical constant C, and the tuning parameter
as low as ) and the density gap
). In another example
must account for more than 50 percent in some
provided . In summary, we have proven the following theorem,
Theorem 3.2. Suppose the assumptions in Theorem 3.1 hold as well as . Then, with high probability, the misclassification rate among the inlier nodes
is less than or equal to
Fig. 3. On the right is the plot of the solution to convex optimization (2.3). Based on it, the community detection result followed by k-means algorithm is shown on the left.
Fig. 4. The performance of convex community detection with different values of
Fig. 5. The solutions of (2.3) with different values of p.
Fig. 6. Political blogs data of two clusters of conservatives and liberals, along with the performance of convex optimization.
is ) outliers are allowed; when the edge density within the
and Hermitian matrices. Suppose that H+P, H and P have real eigenvalues
, each arranged in algebraically nonincreasing order. Then for i = 1,...,n we have
Theorem 4.3.28]). Let Hermitian matrix and
principal submatrix. Suppose that H and G have real eigenvalues
and
, each arranged in algebraically nonincreasing order. Then for j = 1,...,k we have
of independent, random, self-adjoint matrices with dimension d. Assume that
be a symmetric random matrix whose diagonal entries are all zeros. Moreover, suppose
independent zero-mean random variables satisfying
. Then, with probability at least 1
for some sufficiently large numerical constant C, then with probability at least 1
. With probability at least 1
we have
whose diagonal blocks are all are some numerical constants.
Lemma 6.8. If Theorem 3.1 is true for , it is also true for any permutation matrix P.
exists uniquely. Moreover, denote the solutions by definition satisfy
. Then there are nonnegative vectors
nonnegative diagonal matrix
are defined as in Lemma 6.9. If there exist symmetric matrices
satisfies , then any minimizer
(2.3) must be of the form
Moreover, assume
for some sufficiently large numerical constant C. Then, with probability at least 1, there exist matrices
’s satisfying
and the matrix
Supplemental materials to “Robust and computationally feasible community detection in the presence of arbitrary outliers nodes”
Adamic, A. and Glance, N. (2005). The political blogosphere and the 2004 US election: Divided they blog. In Proceedings of the 3rd International Workshop on Link Discovery 36–43. ACM, New York.
Ahlswede, R. and Winter, A. (2002). Strong converse for identification via quantum channels. IEEE Trans. Inform. Theory 48 569–579. MR1889969
Airoldi, E., Blei, M., Fienberg, S. and Xing, E. (2008). Mixed membership stochastic blockmodels. J. Mach. Learn. Res. 9 1981–2014.
Ames, B. P. W. (2014). Guaranteed clustering and biclustering via semidefinite programming. Math. Program. 147 429–465. MR3258531
Ames, B. P. W. and Vavasis, S. A. (2014). Convex optimization for the planted k-disjoint-clique problem. Math. Program. 143 299–337. MR3152071
Amini, A. A., Chen, A., Bickel, P. J. and Levina, E. (2013). Pseudo-likelihood methods for community detection in large sparse networks. Ann. Statist. 41 2097–2122. MR3127859
Balakrishnan, S., Xu, M., Krishnamurthy, A. and Singh, A. (2011). Noise thresholds for spectral clustering (NIPS 2011). Adv. Neural Inf. Process. Syst. 25 954–962.
Bhattacharyya, S. and Bickel, P. J. (2014). Community detection in networks using graph distance. Available at arXiv:1401.3915.
Bickel, P. J. and Chen, A. (2009). A nonparametric view of network models and Newman–Girvan and other modularities. Proc. Natl. Acad. Sci. USA 106 21068–21073.
Bickel, P., Choi, D., Chang, X. and Zhang, H. (2013). Asymptotic normality of maximum likelihood and its variational approximation for stochastic blockmodels. Ann. Statist. 41 1922–1943. MR3127853
Boyd, S., Parikh, N., Chu, E., Peleato, B. and Eckstein, J. (2010). Distributed optimization and statistical learning via the alternating direction method of multipliers. Faund. Trends Mach. Learn. 3 1–122.
computationally feasible community detection in the presence of arbitrary outlier nodes.” DOI:10.1214/14-AOS1290SUPP.
Cand`es, E. J., Strohmer, T. and Voroninski, V. (2013). PhaseLift: Exact and stable signal recovery from magnitude measurements via convex programming. Comm. Pure Appl. Math. 66 1241–1274. MR3069958
Cand`es, E. J., Li, X., Ma, Y. and Wright, J. (2011). Robust principal component analysis? J. ACM 58 Art. 11, 37. MR2811000
Celisse, A., Daudin, J.-J. and Pierre, L. (2012). Consistency of maximum-likelihood and variational estimators in the stochastic block model. Electron. J. Stat. 6 1847–1899. MR2988467
Chaudhuri, K., Chung, F. and Tsiatas, A. (2012). Spectral clustering of graphs with general degrees in the extended planted partition model. J. Mach. Learn. Res. 23 35.1– 35.23.
Chen, Y., Sanghavi, S. and Xu, H. (2012). Clustering sparse graphs. Adv. Neural Inf. Process. Syst. 25 2213–2221.
Chernoff, H. (1981). A note on an inequality involving the normal distribution. Ann. Probab. 9 533–535. MR0614640
Clauset, A., Newman, M. and Moore, C. (2004). Finding community structure in very large networks. Phys. Rev. E 70 066111.
Coja-Oghlan, A. and Lanka, A. (2009/10). Finding planted partitions in random graphs with general degree distributions. SIAM J. Discrete Math. 23 1682–1714. MR2570199
(2011). Asymptotic analysis of the stochastic block model for modular networks and its algorithmic applications. Phys. Rev. E 84 066106.
Deshpande, Y. and Montanari, A. (2015). Finding hidden cliques of sizenearly linear time. Found. Comput. Math. DOI:10.1007/s10208-014-9125-y. To appear.
Fienberg, S. E. (2010). Introduction to papers on the modeling and analysis of network data. Ann. Appl. Stat. 4 1–4. MR2758081
Fienberg, S. E. (2012). A brief history of statistical models for network analysis and open challenges. J. Comput. Graph. Statist. 21 825–839. MR3005799
Fishkind, D. E., Sussman, D. L., Tang, M., Vogelstein, J. T. and Priebe, C. E. (2013). Consistent adjacency-spectral partitioning for the stochastic block model when the model parameters are unknown. SIAM J. Matrix Anal. Appl. 34 23–39. MR3032990
(1981). The eigenvalues of random symmetric matrices. Combinatorica 1 233–241. MR0637828
Giesen, J. and Mitsche, D. (2005). Reconstructing many partitions using spectral techniques. In Fundamentals of Computation Theory. Lecture Notes in Computer Science 3623 433–444. Springer, Berlin. MR2194866
Goldenberg, A., Zheng, A. X., Fienberg, S. E. and Airoldi, E. M. (2010). A survey of statistical network models. Foundations and Trends in Machine Learning 2 129–233.
Handcock, M. S., Raftery, A. E. and Tantrum, J. M. (2007). Model-based clustering for social networks. J. Roy. Statist. Soc. Ser. A 170 301–354. MR2364300
Holland, P. W., Laskey, K. B. and Leinhardt, S. (1983). Stochastic blockmodels: First steps. Soc. Netw. 5 109–137. MR0718088
Horn, R. A. and Johnson, C. R. (2013). Matrix Analysis, 2nd ed. Cambridge Univ. Press, Cambridge. MR2978290
Jalali, A., Chen, Y., Sanghavi, S. and Xu, H. (2014). Clustering partially observed graphs via convex optimization. J. Mach. Learn. Res. 15 2213–2238.
Jin, J. (2015). Fast network community detection by SCORE. Ann. Statist. 43 57–89.
Joseph, A. and Yu, B. (2013). Impact of regularization on spectral clustering. Available at arXiv:1312.1733.
Karrer, B. and Newman, M. E. J. (2011). Stochastic blockmodels and community structure in networks. Phys. Rev. E (3) 83 016107, 10. MR2788206
Zhang, P. (2013). Spectral redemption in clustering sparse networks. Proc. Natl. Acad. Sci. USA 110 20935–20940. MR3174850
Kumar, A., Sabharwal, Y. and Sen, S. (2011). A simple linear time (1 + approximation algorithm for k-means clustering in any dimensions. J. ACM 58 11.
Lei, J. and Rinaldo, A. (2015). Consistency of spectral clustering in stochastic block models. Ann. Statist. 43 215–237. MR3285605
Li, X. and Voroninski, V. (2013). Sparse signal recovery from quadratic measurements via convex programming. SIAM J. Math. Anal. 45 3019–3033. MR3106479
Lin, Z., Liu, R. and Su, Z. (2011). Linearized alternating direction method with adaptive penalty for low rank representation. In Advances in Neural Information Processing Systems (NIPS) 612–620.
Mathieu, C. and Schudy, W. (2010). Correlation clustering with noisy input. In Proceedings of the Twenty-First Annual ACM-SIAM Symposium on Discrete Algorithms 712–728. SIAM, Philadelphia, PA. MR2768627
McSherry, F. (2001). Spectral partitioning of random graphs. In 42nd IEEE Symposium on Foundations of Computer Science (Las Vegas, NV, 2001) 529–537. IEEE Computer Soc., Los Alamitos, CA. MR1948742
Newman, M. and Girvan, M. (2004). Finding and evaluating community structure in networks. Phys. Rev. E 69 026113.
Newman, M. and Leicht, E. (2007). Mixture models and exploratory analysis in networks. Proc. Natl. Acad. Sci. USA 104 9564–9569.
Nowicki, K. and Snijders, T. A. B. (2001). Estimation and prediction for stochastic blockstructures. J. Amer. Statist. Assoc. 96 1077–1087. MR1947255
Oymak, S. and Hassibi, B. (2011). Finding dense clusters via low rank + sparse decomposition. Available at arXiv:1104.5186.
Rohe, K., Chatterjee, S. and Yu, B. (2011). Spectral clustering and the highdimensional stochastic blockmodel. Ann. Statist. 39 1878–1915. MR2893856
Sarkar, P. and Bickel, P. J. (2013). Role of normalization in spectral clustering for stochastic blockmodels. Available at arXiv:1310.1495.
Shamir, R. and Tsur, D. (2007). Improved algorithms for the random cluster graph model. Random Structures Algorithms 31 418–449. MR2362638
Snijders, T. A. B. and Nowicki, K. (1997). Estimation and prediction for stochastic blockmodels for graphs with latent block structure. J. Classification 14 75–100. MR1449742
Sussman, D. L., Tang, M., Fishkind, D. E. and Priebe, C. E. (2012). A consistent adjacency spectral embedding for stochastic blockmodel graphs. J. Amer. Statist. Assoc. 107 1119–1128. MR3010899
Tropp, J. A. (2012). User-friendly tail bounds for sums of random matrices. Found. Comput. Math. 12 389–434. MR2946459
Vu, V. H. (2007). Spectral norm of random matrices. Combinatorica 27 721–736. MR2384414
Vu, V. (2014). A simple SVD algorithm for finding hidden partitions. Available at arXiv:1404.3918.
Zhao, Y., Levina, E. and Zhu, J. (2012). Consistency of community detection in networks under degree-corrected stochastic block models. Ann. Statist. 40 2266–2292. MR3059083