In this paper, we consider the problem of reconstructing a directed graph using path queries. In the query model of learning, a graph is hidden from the learner, and the learner can access information about it with path queries. For a source and destination node, a path query returns whether there is a directed path from the source to the destination node in the hidden graph. We first give bounds for learning graphs on n vertices and k strongly connected components. We then study the case of bounded degree directed trees and give new algorithms for learning “almost-trees” – directed trees to which extra edges have been added. We also give some lower bound constructions justifying our approach.
Keywords: Active learning Graph algorithms Graph learning queries.
Problems in the area of query learning of graphs capture many different contexts. In evolutionary tree reconstruction, an experimenter may measure or query the genetic distance between two species with the goal of placing all the species onto one tree [10,16]. In chemical reaction networks, one may view various chemicals as nodes in a hidden graph, with edges corresponding to reacting pairs – here an experimenter may mix chemicals to test for a reaction, which corresponds to querying subsets of vertices for the presence of an independent set [1,4,5]. Each real-world setting entails its own query learning model, in which the learner typically tries to reconstruct the (possibly weighted) adjacency information of the graph by making as few queries as possible, see e.g. [8,14,15,2,9,13].
The model we study in this paper was introduced by Wang and Honorio  and is a directed variant of other well-studied models [7,10,11]; it involves learning a directed graph by querying ordered pairs of vertices, testing for the presence of a directed path from the first vertex in the pair to the second. This model is meant to capture causality, answering the question “when node u is acted upon, does it create a change in node v?”, but also has other applications, like trying to learn the topology of the internet using ping requests from one IP address to another.
In particular, the model we consider herein is the following: the hidden target is a directed, unweighted graph, and the queries are called path queries. A path query consists of an ordered pair of vertices (u, v) and the result of the query is 1 (or “yes”) if the hidden graph has a directed path from u to v and 0 (or “no”) otherwise.
In their work, Wang and Honorio  prove the following: Given a directed rooted tree with n nodes and maximum degree at most d, there is a randomized algorithm which reconstructs the tree with expected query complexity O(dn log). Their algorithm is recursive – it picks two vertices at random, and with high probability, the path between those two vertices contains an edge which roughly splits the graph. Wang and Honorio show that finding the path and the edge has low query complexity. Then, they split the graph along this edge and recursively apply the technique to each part.
They also show an information theoretic lower bound of log n) and a lower bound of ) (using a parallel chain construction) on the number of queries any algorithm must make. For general graphs, they show that in order to reconstruct a sparse disconnected directed acyclic graph on n nodes, any deterministic algorithm requires at least ) queries. This proof involves differentiating between an empty graph and a single edge. Finally, they show that in order to reconstruct a sparse connected directed acyclic graph on n nodes, any deterministic algorithm requires at least ) queries.
In this work, we extend the understanding of path queries by first considering the problem of learning strongly connected components, as well as the edges between them (see Section 3.1). Then, in our main contribution, in Section 3.2 we extend the results of Wang and Honorio by tackling the problem of learning almost-trees (see Definition 4). Almost-trees are trees with an extra edge. In the case of evolutionary trees, this begins to tackle real-world problems caused by processes like hybridization , where on occasion a species can have two distinct paths to an ancestor, breaking the expected tree-structure of evolution. Our approach matches the bound of Wang and Honorio’s algorithm (up to polylog factors) and is more general.
Let G = (V, E) be a directed graph with vertex set V and edge set E. Let (i, j) denote the directed edge from i to j. We assume |V | = n. Two vertices i, j are said to be strongly connected if there is a directed path from i to j and j to i. This binary relation is an equivalence relation and the induced equivalence classes are called strongly connected components. Let G have k strongly connected components, the collection of strongly connected components forms a partition of V .
A directed graph is called acyclic if it has no cycle. Hence, a directed graph is acyclic if and only if it has no strongly connected subgraphs with more than one vertex.
If we start with an undirected graph G, pick a called root and orient the edges such that there is a path from r to all other , the resulting directed graph is called a rooted directed graph. If the undirected graph we started with was a tree, the resulting directed graph is called a rooted directed tree.
In their work, Wang and Honorio  define path queries as follows:
Definition 1 (path query). Let G = (V, E) be a directed graph. A path query is a function such that if there exists a path in G from i to j, and otherwise.
They give an algorithm for reconstructing bounded-degree directed rooted trees and make observations on what type of edges are not learnable. In particular, they observe transitive edges are not learnable where transitive edges are defined as follows:
Definition 2 (transitive edges). Let G = (V, E) be a directed graph. We say an edge (is transitive if there exists a directed path from i to j of length greater than 1.
We give new algorithms for reconstructing bounded-degree directed graphs using path queries which work for regimes other than bounded degree directed rooted trees. Because it is not possible to learn transitive edges, we will either redefine the notion of learning when transitive edges are present in the graph in Section 3.1 or consider promise instances where such edges are not present in Section 3.2.
We now provide a few useful definitions that are needed for later. We begin with notions of a layered graph and graph height, a useful definition of almost-trees, and the notions of descendants, of ancestors, and of a parent in a tree.
Definition 3 (layered graph, graph height). Given a rooted directed graph G with root r, any tree which contains paths from r to all other is called a layered graph of G. The length of the longest path in G from r to any other is denoted by h and is called the height of G.
Definition 4 (almost-tree). A rooted directed graph G with root r is an almost-tree if G is the union of a rooted directed tree and a single additional directed edge.
Definition 5 (descendants, ancestors, parent). We define the descendant set, ancestor set and parent of a vertex i as follows:
Note that we can find both D(i) and A(i) with 2(1) queries by ) and ) for all .
We begin with some simpler results, which clarify the query complexity of recovering the strongly connected components of a graph.
3.1 Strongly connected components
Suppose G has k strongly connected components, then we have the following upper bound. Note that when we have strongly connected components, there are transitive edges and hence we cannot reconstruct all the edges within each component. Also, note that there could be transitive edges across components. For example, suppose there are three vertices a, b and c which are strongly connected components individually. Suppose there is an edge from a to b and another edge from b to c, then the edge from a to c is transitive and cannot be learnt. Hence, assuming that there are no transitive edges across strongly connected components, the notion of learning here is to find the strongly connected components of G and for each , whether there is an edge between some vertex in to some vertex in .
Theorem 1. Assuming that there are no transitive edges across strongly connected components, query complexity to learn a graph is O(nk).
Proof. It follows from Proposition 2 of the work of Reyzin and Srivastava  that we can recover the partition in O(nk) queries. Then, we can perform ) queries to learn edges that go between two strongly connected components by querying any pair of vertices from each pair of the learned strongly connected components. Finally, we observe that
since it must be that .
3.2 Rooted directed graphs
For rooted directed graphs let us fix the notion of learning to completely reconstruct all the edges. This will be our definition of learnability for the rest of the paper. Note that for almost-trees, the additional edge should follow some natural properties for the problem to be well defined. Firstly, the extra edge cannot be a transitive edge. Also, if the additional edge goes from a node to an ancestor, a strongly connected component is created and it becomes impossible to reconstruct the edges in that component. An almost tree is defined to be path query reconstructable if all the edges can be recovered by path queries.
We start with a lower bound. We give a lower bound of ) for path query reconstructable almost-trees with maximum degree d = O(1).
Theorem 2. There exists a path query reconstructable almost-tree G on 1 vertices with maximum degree d = O(1) such that any randomized algorithm to reconstruct G requires at least ) queries in expectation.
Proof. Let us start by proving the result for a deterministic algorithm. Let n be an even number and consider a caterpillar graph on 1 vertices as shown in figure 1. Assume is the root. Pick uniformly at random such that i < j and add the edge from to . Even if the algorithm knows the caterpillar graph, it still needs to make ) queries to detect the random edge because presence of the edge only changes the single query ().
Now, let us apply Yao’s minimax principle. Consider a uniform distribution over all random edges with i < j. For any fixed deterministic algorithm, the expected query complexity is ). Hence, for any randomized algorithm, there exists a worst case input such that the expected query complexity is ).
Fig. 1. Caterpillar Graph
Our main result is an upper bound on the query complexity of Algorithm 1 which is a clean recursive randomized algorithm for learning an almost-tree. This algorithm can also be used to learn trees and hence generalises the main result in  with the loss of only an extra O(log n) factor. The upper bound on the query complexity of Algorithm 1 stated below asymptotically matches the lower bound in Theorem 4 as a function of n and h ignoring the log factors.
The time complexities of the various subroutines of the algorithm are shown in Figure 2.
Fig. 2. Main algorithm and its time complexities
Theorem 3. Algorithm 1 is a randomized algorithm that learns a path reconstructable almost-tree G with maximum degree d = O(1) using O(n(log +nh) path queries where h is the height of G.
The idea behind Algorithm 1 is to first find a layered graph in G (this is done in line 2 of the algorithm). As a layered graph (say ) is a tree, we get 1 edges of G. This means that we have reconstructed a spanning tree of G and we are left with the task of finding one more edge in G as we know that G is an almost-tree. Let us call this edge a cross edge. In other words, a cross edge is the edge in G that is not in the layered graph produced by line 2 of Algorithm 1. The next task is to find the cross edge. This is done in line 3 of Algorithm 1. To find the layered graph structure, in line 2 of Algorithm 1, we call Algorithm 2. This algorithm works recursively by finding an edge whose descendant set roughly splits the graph into equal parts. Hence, the depth of the recursion tree is O(log n). We show that with high probability, the randomized algorithm which finds such an edge (Algorithm 3) on a subset of vertices V uses O(|V |(log ) queries. This gives an overall query complexity of O(n(log ) for finding a layered graph in G.
We need the following structure theorem rephrased from .
Lemma 1. Let G = (V, E) be a directed rooted graph with root r and maximum degree d. For any , there exists a ) such that
We call w which roughly splits D(v) as a splittable vertex.
We start by analyzing the search sub-routine (Algorithm 4). We claim that if the random vertex picked in line 3 of the subroutine is in D(s) for some splittable vertex s, then the subroutine will find s. This is because if the random vertex picked in line 3 is in D(s) for some splittable vertex s, then we enter the |D(i)| < |V |/3d case corresponding to the if statement in line 9. Inside this if statement, we set P to be the set of potential splittable vertices among the ancestors of i and keep updating P by doing randomized binary search for s. This proves the claim and hence, the number of times the search sub-routine is called is O(log n).
Remark 1. Note that the presence of the additional edge in the almost-tree does not affect the overall structure of the algorithm. Its presence will be felt only in the randomised binary search (in Algorithm 4) as there may be two paths from the root to vertex i. If both paths contain a splittable vertex, then the randomised binary search will find it. The difficult case is when only one of the two paths contains a splittable vertex. Suppose there are two paths P containing the splittable vertex and not containing the splittable vertex. Every time the randomised binary search picks a vertex in , we will end up deleting all the children of i from the potential vertices for the next iteration of the randomised binary search in line 19 of Algorithm 4.
To analyze the query complexity of Algorithm 4, the search sub-routine, we first note that the queries are only made in line 4 and 13. In each call of search, line 3 gets executed once and hence makes O(n) queries. Line 13 gets executed O(log n) times in expectation because it is a randomized binary search and hence makes O(n log n) queries in expectation. So, the expected query complexity of the search sub-routine is O(n log n).
Now we analyze the algorithm to find cross edges (Algorithm 5). We assume that all sub-routines under this algorithm have access to . Algorithm 5 first calls Algorithm 6, which is a recursive procedure for finding a triplet of vertices v, a, b that satisfy the following conditions:
We refer to such a vertex v as a top vertex. This must mean that the extra edge has caused Q(a, b) = 1. Once we know v, a, b, we can find the extra edge exactly by traversing . This is done in the Falgorithm.
Now, we turn to analyzing Algorithm 6. We start from the root r. If r is a top vertex, then the queries in line 5 of Algorithm 6 will find it. If not, we have split the problem into smaller subproblems corresponding to the descendant set of each immediate child of r. We recursively call the same function for each immediate child in line 9 of Algorithm 6.
Falgorithm is given v, a, b as input where v is a top vertex. Hence, the cross edge is of the form () where ) and ). Therefore, this algorithm simply traverses over the descendants of a starting from the immediate children of a and the ancestors of b starting from the immediate parent of b until it finds () exactly.
Hence, the overall query complexity of Algorithm 1 is O(n(log ). Note that when is a complete d-ary trees, this gives a ˜O(n) algorithm where as for caterpillar graphs, we get a ) algorithm which matches with the lower bound proved in Theorem 2. We can also extend Theorem 2 for arbitrary h to get a result of the following form:
Theorem 4. For every d = O(1) and h > (1 + c) logfor c > 0, there exists a path query reconstructable almost-tree G on 1 vertices with maximum degree d and height h such that any deterministic algorithm to reconstruct G requires at least ) queries.
The proof involves modifying Theorem 2 by considering a caterpillar graph on ) vertices and a complete d-ary tree with ) leaves attached to the last level of the caterpillar graph. This construction can be made to work for any h > (1 + c) logfor c > 0. Now, add an edge randomly from one of the leaves of the caterpillar graph to one of the leaves of the complete d-ary tree. Detecting this random edge requires ) queries as there are ) leaves in the caterpillar graph and ) leaves in the complete d-ary tree.
Our algorithm works when there is exactly one extra edge. It would be interesting to see if this approach can be generalised when multiple extra edges are present. It would also be interesting to find an algorithm where the query complexity is a function of the number of extra edges.
This work by supported in part by NSF grants CCF-1848966 and CCF-1934915.
1. Hasan Abasi, Nader H. Bshouty, and Hanna Mazzawi. Non-adaptive learning of a hidden hypergraph. Theor. Comput. Sci., 716:15–27, 2018.
2. Hasan Abasi and Bshouty Nader. On learning graphs with edge-detecting queries. In Aur´elien Garivier and Satyen Kale, editors, Proceedings of the 30th International Conference on Algorithmic Learning Theory, volume 98 of Proceedings of Machine Learning Research, pages 3–30, Chicago, Illinois, 22–24 Mar 2019. PMLR. URL: http://proceedings.mlr.press/v98/abasi19a.html.
3. Mikkel Abrahamsen, Greg Bodwin, Eva Rotenberg, and Morten St¨ockel. Graph reconstruction with a betweenness oracle. In 33rd Symposium on Theoretical As- pages 5:1–5:14, 2016.
4. Dana Angluin and Jiang Chen. Learning a hidden graph using o(log n) queries per edge. In Learning Theory, 17th Annual Conference on Learning Theory, COLT 2004, Banff, Canada, July 1-4, 2004, Proceedings, pages 210–223, 2004.
5. Dana Angluin and Jiang Chen. Learning a hidden hypergraph. Journal of Machine Learning Research, 7:2215–2236, 2006.
6. Nicholas H Barton. The role of hybridization in evolution. Molecular ecology, 10(3):551–568, 2001.
7. Zuzana Beerliova, Felix Eberhard, Thomas Erlebach, Alexander Hall, Michael Hoff-mann, Mat´us Mihal´ak, and L. Shankar Ram. Network discovery and verification. IEEE Journal on Selected Areas in Communications, 24(12):2168–2181, 2006.
8. Mathilde Bouvel, Vladimir Grebinski, and Gregory Kucherov. Combinatorial search on graphs motivated by bioinformatics applications: A brief survey. In Graph-Theoretic Concepts in Computer Science, 31st International Workshop, WG 2005, Metz, France, June 23-25, 2005, Revised Selected Papers, pages 16–27, 2005.
9. Sung-Soon Choi and Jeong Han Kim. Optimal query complexity bounds for finding graphs. In Proceedings of the Fortieth Annual ACM Symposium on Theory of Computing, STOC ’08, pages 749–758, New York, NY, USA, 2008. Association for Computing Machinery.
10. Jotun J Hein. An optimal algorithm to reconstruct trees from additive distance data. Bulletin of mathematical biology, 51(5):597–603, 1989.
11. M. Jagadish and Anindya Sen. Learning a bounded-degree tree using separator queries. In Algorithmic Learning Theory - 24th International Conference, ALT 2013, Singapore, October 6-9, 2013. Proceedings, pages 188–202, 2013.
12. Mano Vikash Janardhanan. Graph verification with a betweenness oracle. In International Conference on Algorithmic Learning Theory, ALT 2017, 15-17 October 2017, Kyoto University, Kyoto, Japan, pages 238–249, 2017.
13. Hanna Mazzawi. Optimally Reconstructing Weighted Graphs Using Queries: (Extended Abstract), pages 608–615.
14. Lev Reyzin. Active Learning of Interaction Networks. PhD thesis, Yale University, 2009.
15. Lev Reyzin and Nikhil Srivastava. Learning and verifying graphs using queries with a focus on edge counting. In International Conference on Algorithmic Learning Theory, pages 285–297. Springer, 2007.
16. Lev Reyzin and Nikhil Srivastava. On the longest path algorithm for reconstructing trees from distance matrices. Information processing letters, 101(3):98–100, 2007.
17. Zhaosen Wang and Jean Honorio. Reconstructing a bounded-degree directed tree using path queries. In 57th Annual Allerton Conference on Communication, Control, and Computing, Allerton 2019, Monticello, IL, USA, September 24-27, 2019, pages 506–513. IEEE, 2019.