Learning Network Structures from Contagion

2017·Arxiv

Abstract

Abstract

In 2014, Amin, Heidari, and Kearns proved that tree networks can be learned by observing only the infected set of vertices of the contagion process under the independent cascade model, in both the active and passive query models. They also showed empirically that simple extensions of their algorithms work on sparse networks. In this work, we focus on the active model. We prove that a simple modification of Amin et al.’s algorithm works on more general classes of networks, namely (i) networks with large girth and low path growth rate, and (ii) networks with bounded degree. This also provides partial theoretical explanation for Amin et al.’s experiments on sparse networks.

Keywords: Graph Algorithms, Learning, Contagion, Network Structures, Large Girth

1 Introduction

Edge information of a network is essential to many network analysis applications, e.g., in social network analysis [10, 16, 15], in epidemiology [3], and in viral marketing [12]. However, in some type of networks, e.g., online social networks or the network of customers of a rival company, it is not easy to obtain this information. Therefore, in these settings, it is more practical to infer the network structures from observable data.

This paper considers a problem of inferring the network structure from a contagion process, known as the independent cascade model (defined by Goldenberg et al. [6, 7] and Kempe et al. [11]). The problem was first considered by Gomez-Rodriguez, Leskovec, and Krause [9]. While most results utilize the orders of infections to infer the network structures (e.g., [9, 13, 8, 14, 1]), the set of infected vertices are clearly easier to observe. Thus, we follow a recent work by Amin, Heidari, and Kearns [2] who introduced a problem of learning unknown network structure by observing only sets of initial infected vertices and sets of final infected vertices. Amin et al. considered both the active model where the algorithm can make active seed queries and the passive model where the algorithm only observes the seed and the set of infected vertices, and proposed algorithms for exactly learning tree networks under these two models. They also proposed another algorithm for learning general networks, called the K-lifts algorithm, that works well empirically under real networks and random networks. Amin et al. proved that the K-lifts algorithm can learn cycle networks and provided an example which is a union of a star and a large cycle that the algorithm fails to learn.

In this paper, we consider the active model and extend the approach of Amin et al. [2], to two other classes of networks, as described below.

Networks with large girth and low path growth rate. Essentially, these are networks that behave almost like trees. We consider two parameters of the networks: (1) the girth, which is the size of the smallest cycle in the network and (2) the growth rate on the number of paths between two vertices over the length of the paths. We show that if the girth of a network is large enough and the growth rate is small enough, the network can be exactly learned by an algorithm that uses only a polynomial number of queries in term of the number of vertices and ∆, the contagion parameter (to be defined in the next section). Since a tree does not contain any cycle, its girth is ; this class of networks generalizes the tree networks considered in [2]. This class also includes a counter example for the K-lifts algorithm provided in [2]. While the focus of this work is extending the theoretical limitation of Amin et al. [2]’s work, this type of networks may appear in advisor-advisee networks where cross-field advising happens on rare occasions or in organizational networks, networks that represent ranks and relations of people in organizations, where few low-ranking workers may report to more than one middle managers, resulting in large cycles that span the people at the top-most levels of the organizations.

Bounded degree networks. We also show that if the maximum degree of vertices in the network is bounded above by a constant, the network can be exactly learned with polynomial queries as well.

The paper is organized as follows. In the next section, we state problem definitions and discuss previous results. Section 3 shows that a large girth network with low path growth can be learned in the contagion model. In section 4, we show that if the maximum degree of a network is bounded by a constant, the network can also be learned.

2 Deﬁnitions and Results

We formally describe the contagion process. Let G = (V, E) be an undirected network whose edges are unknown with n vertices. Let , the set of initially infected vertices. From S, a contagion proceeds in discrete steps under the independent cascade model defined by Goldenberg et al. [6, 7] and Kempe et al. [11], as follows. We assume that every vertex in S becomes infected at step t = 0. At each step t = 1, 2, 3, . . ., every vertex became infected at step 1 tosses a coin to spread the disease to each of its uninfected adjacent vertices with the infectious probability receives the disease from u, we say that v becomes infected at step t. In this case, we say that the edge (u, v) is active, otherwise (u, v) is inactive. Note that when referring to an edge as active or inactive, the order of its end points in the tuple is important (e.g., when (u, v) is active, (v, u) is inactive). The contagion proceeds until there are no newly infected vertices. Note that spreading of disease through edge (u, v) occurs only once at the first step when u or v become infected. The minimum and maximum infectious probabilities are denoted by , respectively. We define the contagion parameter ∆ = min

The problem of learning network structure from contagion is defined as follows. For a network G = (V, E), we are given the set of vertices V and the contagion parameter ∆, but for all edges (are unknown. We will describe the version for the active model here and refer to the Previous Results section for the description of the passive model. Under the active query model, for each round i = 1, 2, . . . , M, the algorithm can perform a query by choosing a seed set . The contagion process described above then starts from and returns the set of infected vertices . The goal of the problem is to find an algorithm that uses a small number of rounds M, and correctly returns the edge set E.

Previous Results. Amin, Heidari, and Kearns [2] considered the problem in both active model and passive model. They proved that tree networks can be exactly learned in both models. In addition, they also considered the problem for learning non-tree networks.

Since our focus is on the active model, we start by describing their algorithm for learning trees in this model, later referred to as the AHK algorithm. For any vertex , the algorithm repeatedly queries with the seed set containing only a single vertex u in order to infer the set of vertices Γ(u) adjacent to ) be the set of rounds that v becomes infected while {u} be the seed set, i.e., . The algorithm infers that u and v are adjacent if and only if there does not exist a vertex The algorithm needs ) queries to learn the tree with probability at least 1 note that the AHK algorithm could fail when applying to non-tree networks because a vertex can be infected from the seed set through many possible paths.

For the passive model, the seed sets are chosen randomly from a distribution where each vertex is chosen independently. The algorithm presented by Amin et al. for this model employs the lift function L(v|u) which is the increase in the probability that vertex v becomes infected when u is in the seed set. The algorithm finds an estimate ˆis close to L, they showed that the algorithm can exactly learn the tree.

For non-tree networks, Amin et al. presented an algorithm, called the K-lifts algorithm, which can be viewed as a generalized version of the algorithm for learning trees in the passive model. Given the number of edges K, the algorithm returns K pairs of vertices with highest estimated lift scores as network edges. The experimental results showed that the K-lifts algorithm performs well when learning sparse random networks (under the Erd˝os-R´enyi model [4, 5] and the Small-World model [17]). On the positive side, Amin et al. proved that the K-lifts algorithm can learn cycle networks. However, they showed that the K-lifts algorithm fails to learn a network 1 vertices constructed by joining a star with n vertices rooted at vertex -cycle containing

Our Results. Here we state our results formally. Our first result considers large girth

networks. For a network G = (V, E), the girth g of G is the length of the shortest cycle in the network. We also need another property related to the number of simple paths. Denote by P(u, v) the set of simple paths from vertex set of paths of length be the maximum number of simple paths of length d between any pair of vertices, i.e., . We define the path growth rate . The parameter intuitively represents the growth rate for the number of simple paths in the network. Note that for tree networks, g can be regarded as

We show that if ), for some function f (stated in Theorem 3.7), we can successfully learn the network in the active model with ) queries with probability at least 1 . We note that our algorithm can successfully learn the Amin example H (discussed in Theorem 6 in [2]) since the girth of H is n and its path growth ratio is 2, which is close to 1.

The algorithm requires ) active queries, which is the same bound as the AHK algorithm of Amin et al. under the same model. While the bound itself does not depend on the values of , our proofs of correctness require the network to satisfies certain conditions on ∆, (see Theorem 3.7).

The second result is on the bounded-degree networks. We show that, if the maximum degree of the network is D, in the active model, these networks can be exactly learned by a very simple algorithm that makes at most ) queries with probability at least 1

3 Learning Large Girth Networks

This section describes an algorithm that learns large-girth networks. We start with a crucial lemma on the properties of these networks. As in the AHK algorithm, we would like to filter out non-adjacent pairs of vertices. We focus on pairs of vertices that are close, but not adjacent. Let be the shortest path distance from u to v.

The following lemma is a key observation.

Lemma 3.1. Let G = (V, E) be a network with girth g. For any pair of vertices that 1 , there is a unique shortest path from u to v, and all other paths from u to v have length greater than g/2.

Proof. We prove by contradiction. Let be the shortest path from and let be another path from 2. We can take a union of and obtain a cycle of length at most , contradicting the fact that G has girth g.

Using Lemma 3.1 with appropriate value of g, we can show that it is very unlikely that, when {u} is the seed set, v is infected through paths other than the unique shortest path from u. This implies that in most rounds when v is infected, every intermediate vertex w in the shortest path from u to v must be infected as well.

Recall that ) be the set of rounds that v becomes infected while {u} be the seed set. In a tree network, the rejection criteria of the AHK algorithm works because any intermediate vertex w in the shortest path from u to v. However, in general, since v can be infected through other paths, we need a milder criteria. Instead of requiring that to reject (u, v), we shall reject (u, v) as an edge when there exists a vertex w such that w appears too often with v, i.e., when the set ) is large.

Let m be the number of rounds that we query for a single seed set (to be specified later). Our modification of the AHK algorithm to learn large girth networks is shown in Algorithm 1. Note that although the contagion parameter ∆ is required to make decision in Algorithm 1, it is enough to use its lower bound. The AHK algorithm itself does not require ∆, but the parameter is implicitly needed to make sure that the number of queries is large enough.

We shall show that the Algorithm 1 returns edges E with high probability after querying M = nm rounds in total, if m is large enough.

We would like to point out that our algorithm works only when 1. This bound is essential for preventing issues that may occur when high growth rate compensates the infectious failure based on our analysis techniques. See a discussion at the end of Lemma 3.2.

After each round of query we say that path P is active if every edge in P is active. (Note that each edge must be active in the right direction.) On the other hand, P is inactive if there exists an inactive edge in the path.

The next lemma shows that if 1, the probability that there is an active path of length k from vertex u to vertex v depends on

Lemma 3.2. For any network G = (V, E), if parameter satisfies the condition that , then for any vertex and vertex , the probability that u infects v along any paths of length at least k is at most

Proof. Using the union bound, the probability that u infects v along any paths of length at least k is at most

Note that we use the fact that 1 in the last inequality.

The requirement that 1 is essential to ensure that the sum nicely even when n is large. Note that when the requirement is not true, the contagion process starting at a single seed vertex can reach a vertex very far from the seed. To see this, take a perfect k-any tree with L levels. The contagion process starting at the root resembles the branching process where the offspring distribution is binomial with parameter k and p, where p is the infectious probability. It is known that if the infectious probability is 1, it is very likely that some leaf on the L-th level will be infected. Since our analysis uses a simple union bound that neglects dependencies between paths, it fails to distinguish this situation with the one where a lot of leaves are infected, and finally it fails to show that the probability that a particular node far away from the seed becomes infected is very small.

From Lemma 3.2, we have the next corollary that provides the lower bound on the girth so that the probability of having long active paths is at most ∆4. The ceiling function appears because Lemma 3.2 works only when k is an integer, and as a by-product, that the lower bound on g is even.

Corollary 3.3. For a contagion process with parameter ∆ over a network G = (V, E), if the girth g of G and the path growth rate satisfy the following inequalities:

then for any pair of vertices , the probability that there exists an active path of length at least g/2 between u and v is at most ∆

We shall use the bound of g in the previous corollary as the lower bound of the girth. Note that the lower bound is not extremely large. When ∆ = 125, the algorithm works when 16. When ∆ = 1

Later on in this section, we assume that we work on the network whose parameter girth g satisfy inequalities (1) and (2), respectively. Moreover, for technical reasons (see the proof of Lemma 3.4), we also assume w.l.o.g. that g is even. When the girth g of the network is odd but satisfies condition (2) from Corollary 3.3, we can take 1 as its lower bound on the girth and apply the results.

Our main theorem shows that for any network G = (V, E) in this class, the Algorithm 1 returns the edges E with high probability. To prove the theorem, we need the following 3 lemmas.

The following lemma deals with the case that (u, v) is an edge in the network.

Lemma 3.4. Assume that the network does not have any cycle of length shorter than g, when g is even, and all the network parameters satisfy the conditions in Corollary 3.3. For any pair of adjacent vertices and any vertex , in any round i where the algorithm queries with seed set , the probability that is at least 7∆the expected size of is at least 7∆

Proof. We first analyze the probability. There are two cases.

Case 1: v is in a shortest path from u to w. Let P be the shortest path from u to w containing v. Note that since () is the first edge in the path. Let e be an edge in P adjacent to edge (u, v), i.e., e is the next edge after (u, v) in P.

Let A be the event that (u, v) is active, B be the event that e is active and C be the event that there exists an active path in P(u, w). Note that when event occurs, we know that . Thus, we have

The last inequality holds because A and B are independent, and each occurs with probability at least ∆.

We are left to compute the upper bound of the probability of the event condition implies that P, which is a shortest path, is inactive. There are two possible subcases that w can be infected: (i) from a path ) that uses edge (u, v) or (ii) from a path ) that does not use (u, v).

Let’s consider subcase (i) first. Since P is a shortest path; the postfix starting at v ending at w is also a shortest path from be the postfix of starting at v. We claim that is of length at least g/2. This is true when the shortest path is of length at least g/2. Thus, assume otherwise, i.e., is of length less than g/2; applying Lemma 3.1, we have that all other paths from v to w are of length greater than g/2 as required. Since the paths are long, Corollary 3.3 implies that the probability that we have an active path is at most ∆

Next, consider subcase (ii). Using the same argument as in subcase (i), we have that of length at least g/2; thus, the probability that we have an active path in this case is at most ∆

Combining these two subcases using the union bound, we have that Pr[∆8. Therefore, the probability that is at least 7∆

Case 2: v is not in any shortest paths between u and w. Let P be a shortest path from u to w and e be an edge in P that is adjacent to u (i.e., e is the first edge in P). Provided that (u, v) is active but e is inactive, only when there exists an active path in P(u, w). Again, let A be the event that (u, v) is active, B be the event that e is active, and C be the event that there exists an active path in P(u, w). As in the previous case, we have that Pr[

We focus on the event . If an active path ) uses edge (u, v), it has to be of length greater than g/2 because v is not in any shortest paths from u to w. Since g is even, g/2 is an integer; thus, the length of is at least g/2+ 1. This implies that the postfix of starting at v is of length at least g/2. From Corollary 3.3, the probability that there exists an active path in this case is at most ∆16. On the other hand, as e is inactive, an active path from u to w, not going through (u, v), has length at least g/2. Again, Corollary 3.3 implies that the probability of this case is at most ∆16. With the union bound, we have that Pr[8. Hence, Pr[] is at least

Since for both cases, the probability Pr[] is at least 7∆8 and the algorithm makes m rounds of queries with seed set {u}, the expected size of ) is at least 7∆8, as required.

The next two lemmas consider non-adjacent vertices u and v. When u is close to v, we use Lemma 3.5; otherwise, we use Lemma 3.6, whose proof uses Corollary 3.3 and is omitted to save space.

Lemma 3.5. For any pair of non-adjacent vertices , there exists a vertex such that in any round i where the algorithm queries with the seed set , the probability that is at most ∆. Thus, the expected size of is at most ∆

Proof. From Lemma 3.1, there is only one shortest path from u to v. Let w be the second vertex in the shortest path from u to v. Consider the case that . In this case, the edge (u, w) must be inactive. This implies that the shortest path from u to v is also inactive, thus the seed u infects v along a non-shortest path. From Lemma 3.1, any non-shortest paths is of length greater than g/2. Using Corollary 3.3, we have that Pr[the algorithm makes m rounds of queries with seed set {u}, the expected size of is at most ∆

Lemma 3.6. For any pair of non-adjacent vertices , in any round i where the algorithm queries with the seed set , the probability that is at most ∆. Thus, the expected size of is at most ∆

Lemma 3.6 implies that for any vertex w, the expected size of ) is at most ∆4. The previous 3 lemmas show the expectation gap between 7∆size of ) for some vertex w. Using the Chernoff bound, we can prove the main theorem. Its proof is omitted to save space.

The Algorithm 1 returns the edges E with probability at least 1 after querying at most

4 Learning Bounded Degree Networks

This section shows that if the maximum degree of a network is D, we can recover all edges of the network with probability at least 1 using polynomial queries in term of The key idea is that if the maximum degree of the network is bounded, we could observe a single edge from the results of queries. The algorithm is fairly straight-forward. For each vertex the algorithm repeatedly selects {u} to be the seed set for m rounds, where For any vertex , the algorithm includes (u, v) to the set ), if there exists round i such that . After receiving all results of nm queries, the algorithm returns

Theorem 4.1. Let G = (V, E) be a network with maximum degree D. The algorithm described above can return the edges E with probability at least 1 by querying at most rounds.

Proof. Since the algorithm will not include edges not in E, we consider the probability that the algorithm recovers all edges in E.

Consider edge (. Consider the round i where {u} is the seed set, i.e., The probability that we observe only edge (

where Γ(, is a set of neighbors of u. Since we perform m rounds of queries for u, the probability that we fail to observe edge (u, v), when the seed set is {u}, is at most

If we let ), the failure probability is at most . Using the union bound, the probability that the algorithm fails to observe any edge is at most

Acknowledgements

We would like to thank anonymous reviewers for their very helpful comments. We gratefully acknowledge the Thailand Research Fund (TRF) for financial support through the Royal Golden Jubilee (RGJ) Ph.D. Programme under the Grant No. PHD/0310/2550.

References

[1] Bruno D. Abrahao, Flavio Chierichetti, Robert Kleinberg, and Alessandro Panconesi. Trace complexity of network inference. In The 19th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, pages 491–499, 2013.

[2] Kareem Amin, Hoda Heidari, and Michael Kearns. Learning from contagion (without timestamps). In Proceedings of the 31th International Conference on Machine Learning, pages 1845–1853, 2014.

[3] R. M. Christley, G. L. Pinchbeck, R. G. Bowers, D. Clancy, N. P. French, R. Bennett, and J. Turner. Infection in social networks: Using network analysis to identify high-risk individuals. American Journal of Epidemiology, 162(10):1024–1031, 2005.

[4] P. Erd˝os and A. R´enyi. On random graphs. I. Publ. Math. Debrecen, 6:290–297, 1959.

[5] E. N. Gilbert. Random graphs. Ann. Math. Statist., 30(4):1141–1144, 12 1959.

[6] Jacob Goldenberg, Barak Libai, and Eitan Muller. Talk of the network: A complex systems look at the underlying process of word-of-mouth. Marketing Letters, 12(3):211–223, 2001. ISSN 1573-059X.

[7] Jacob Goldenberg, Barak Libai, and Eitan Muller. Using complex systems analysis to advance marketing theory development. Academy of Marketing Science Review, 2001.

[8] Manuel Gomez-Rodriguez and Bernhard Sch¨olkopf. Submodular inference of diffusion net- works from multiple trees. In Proceedings of the 29th International Conference on Machine Learning, pages 489–496, 2012.

[9] Manuel Gomez-Rodriguez, Jure Leskovec, and Andreas Krause. Inferring networks of dif- fusion and influence. In Proceedings of the 16th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, pages 1019–1028, 2010.

[10] M.S. Granovetter. The strength of weak ties. The American Journal of Sociology, 78(6): 1360–1380, 1973.

[11] David Kempe, Jon M. Kleinberg, and ´Eva Tardos. Maximizing the spread of influence through a social network. In Proceedings of the Ninth ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, pages 137–146, 2003.

[12] Jure Leskovec, Lada A. Adamic, and Bernardo A. Huberman. The dynamics of viral marketing. In Proceedings 7th ACM Conference on Electronic Commerce, pages 228–237, 2006.

[13] Seth A. Myers and Jure Leskovec. On the convexity of latent social network inference. In Advances in Neural Information Processing Systems 23: 24th Annual Conference on Neural Information Processing Systems, pages 1741–1749, 2010.

[14] Praneeth Netrapalli and Sujay Sanghavi. Learning the graph of epidemic cascades. In ACM SIGMETRICS/PERFORMANCE Joint International Conference on Measurement and Modeling of Computer Systems, SIGMETRICS ’12, pages 211–222, 2012.

[15] Jimeng Sun and Jie Tang. A survey of models and algorithms for social influence analysis. In C. Charu Aggarwal, editor, Social Network Data Analytics, pages 177–214. Springer US, 2011.

[16] Jie Tang, Jimeng Sun, Chi Wang, and Zi Yang. Social influence analysis in large-scale networks. In Proceedings of the 15th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, pages 807–816, 2009.

[17] Duncan J. Watts and Steven H. Strogatz. Collective dynamics of ‘small-world’ networks. Nature, 393(6684):440–442, June 4 1998.

designed for accessibility and to further open science