DRUM: End-To-End Differentiable Rule Mining On Knowledge Graphs

2019·Arxiv

Abstract

Abstract

In this paper, we study the problem of learning probabilistic logical rules for inductive and interpretable link prediction. Despite the importance of inductive link prediction, most previous works focused on transductive link prediction and cannot manage previously unseen entities. Moreover, they are black-box models that are not easily explainable for humans. We propose DRUM, a scalable and differentiable approach for mining first-order logical rules from knowledge graphs which resolves these problems. We motivate our method by making a connection between learning confidence scores for each rule and low-rank tensor approximation. DRUM uses bidirectional RNNs to share useful information across the tasks of learning rules for different relations. We also empirically demonstrate the efficiency of DRUM over existing rule mining methods for inductive link prediction on a variety of benchmark datasets.

1 Introduction

Knowledge bases store structured information about real-world people, locations, companies and governments, etc. Knowledge base construction has attracted the attention of researchers, foundations, industry, and governments [11, 13, 39, 43]. Nevertheless, even the largest knowledge bases remain incomplete due to the limitations of human knowledge, web corpora, and extraction algorithms.

Numerous projects have been developed to shorten the gap between KBs and human knowledge. A popular approach is to use the existing elements in the knowledge graph to infer the existence of new ones. There are two prominent directions in this line of research: representation learning that obtains distributed vectors for all entities and relations in the knowledge graph [12, 36, 38], and rule mining that uses observed co-occurrences of frequent patterns in the knowledge graph to determine logical rules [5, 15]. An example of knowledge graph completion with logical rules is shown in Figure 1.

One of the main advantages of logic-learning based methods for link prediction is that they can be applied to both transductive and inductive problems while representation learning methods like that of Bordes et al. [4], Sadeghian et al. [35] and Yang et al. [45] cannot be employed in inductive scenarios. Consider the scenario in Figure 1, and suppose that at training time our knowledge base does not contain information about Obama’s family. Representation learning techniques need to be retrained on the whole knowledge base in order to find the answer. In contrast rule mining methods can transfer reasoning to unseen facts.

Additionally, learning logical rules provides us with interpretable reasoning for predictions which is not the case for the embedding based method. This interpretability can keep humans in the loop, facilitate debugging, and increase user trustworthiness. More importantly, rules allow domain knowledge transfer by enabling the addition of extra rules by experts, a strong advantage over representation learning models in scenarios with little or low-quality data.

Mining rules have traditionally relied on pre-defined statistical measures such as support and confidence to assess the quality of rules. These are fixed heuristic measures, and are not optimal for various use cases in which one might want to use the rules. For example, using standard confidence is not necessarily optimal for statistical relational learning. Therefore, finding a method that is able to simultaneously learn rule structures as well as appropriate scores is crucial. However, this is a challenging task because the method needs to find an optimal structure in a large discrete space and simultaneously learn

proper score values in a continuous space. Most previous approaches address parts of this problem [9, 24, 26, 44] but are not able to learn both structure and scores together, with the exception of Yang et al. [46].

In this paper we propose DRUM, a fully differentiable model to learn logical rules and their related confidence scores. DRUM has significant importance because it not only addresses the aforementioned challenges, but also allows gradient based optimization to be employed for inductive logic programming tasks.

Our contributions can be summarized as: 1) An end-to-end differentiable rule mining model that is able to learn rule structures and scores simultaneously; 2) We provide a connection between tensor completion and the estimation of confidence scores; 3) We theoretically show that our formulation is expressive enough at finding the rule structures and their related confidences; 4) Finally, we demonstrate that our method outperforms previous models on benchmark knowledge bases, both on the link prediction task and in terms of rule quality.

2 Problem Statement

Definitions We model a knowledge graph as a collection of facts where E and R represent the set of entities and relations in the knowledge graph, respectively.

A first order logical rule is of the form B is a conjunction of atoms e.g., , called the Body, and H is a specific predicate called the head. A rule is connected if every atom in the rule shares at least one variable with another atom, and a rule is closed if each variable in the rule appears in at least two atoms.

Rule Mining We address the problem of learning first-order logical Horn clauses from a knowledge graph. In particular we are interested in mining closed and connected rules. These assumptions ensure finding meaningful rules that are human understandable. Connectedness also prevents finding rules with unrelated relations.

Formally, we aim to find all and relations as well as a confidence value

where, s are variables that can be substituted with entities. This requires searching a discrete space to find s and searching a continuous space to learn for every particular rule.

3 Related work

Mining Horn clauses has been previously studied in the Inductive Logic Programming (ILP) field, e.g, FOIL [33], MDIE [30] and Inspire [37]. Given a background knowledge base, ILP provides a framework for learning on multi-relational data. However, despite the strong representation powers of ILP, it requires both positive and negative examples and does not scale to large datasets. This is a huge drawback since most knowledge bases are large and contain only positive facts.

Recent rule mining methods such as AMIE+ [15] and Ontological Pathfinding (OP) [5] use predefined metrics such as confidence and support and take advantage of various parallelization and partitioning techniques to speed up the counting process. However, they still suffer from the inherent limitations of relying on predefined confidence and discrete counting.

Most recent knowledge base rule mining approaches fall under the same category as ILP and OP. However, Yang et al. [45] show that one can also use graph embeddings to mine rules. They introduce DistMult, a simple bilinear model for learning entity and relation representations. The relation representations learned via the bilinear model can capture compositional relational semantics via matrix multiplications. For example, if the rule holds, then intuitively so should . To mine rules, they use the Frobenius norm to search for all possible pairs of relations with respect to their compositional relevance to each head. In a more recent approach Omran et al. [32] improve this method by leveraging pruning techniques and computing traditional metrics to scale it up to larger knowledge bases.

In [17] the authors proposed a RESCAL-based model to learn from paths in KGs. More recently, Yang et al. [46] provide the first fully differentiable rule mining method based on TensorLog [6], Neural LP. They estimate the graph structure via constructing TensorLog operators per relation using a portion of the knowledge graph. Similar to us, they chain these operators to compute a score for each triplet, and learn rules by maximizing this score. As we explain in Section 4.1, this formulation is bounded to a fixed length of rules. To overcome this limitation, Neural LP uses an LSTM and attention mechanisms to learn variable rule lengths. However, it can be implied from Theorem 1 that its formulation has theoretical limitations on the rules it can produce.

There are some other interesting works [7, 14, 29, 34] which learn rules in a differentiable manner. However, they need to learn embeddings for each entity in the graph and they do link prediction not only based on the learned rules but also the embeddings. Therefore we exclude them from our experiment section.

We did not cover embedding based methods on graphs or knowledge graphs here, we refer the reader to some of the recent works [18, 19, 21, 16]

4 Methodology

To provide intuition about each part of our algorithm we start with a vanilla solution to the problem. We then explain the drawbacks of the this approach and modify the suggested method step-by-step to makes the challenges of the problem more clear and provides insight into different parts of the suggested algorithm.

We begin by defining a one-to-one correspondence between the elements of E and , where n is the number of entities and is a vector with 1 at position i and 0 otherwise. We also define as the adjacency matrix of the knowledge base with respect to relation ; the elements of equals to 1 when the entities corresponding to and have relation , and 0 otherwise.

4.1 A Compact Differentiable Formulation

To approach this inherently discrete problem in a differentiable manner, we utilize the fact that using the above notations for a pair of entities (x, y) the existence of a chain of atoms such as

is equivalent to being a positive scalar. This scalar is equal to the number of paths of length T connecting x to y which traverse relation at step i. It is straightforward to show that for each head relation H, one can learn logical rules by finding an appropriate

that maximizes

where s indexes over all potential rules with maximum length of T, and is the ordered list of relations related to the rule indexed by s.

However, since the number of learnable parameters in can be exceedingly large, i.e. and the number of observed pairs (x, y) which satisfy the head H are usually small, direct optimization of falls in the regime of over-parameterization and cannot provide useful results. To reduce the number of parameters one can rewrite

This reformulation significantly reduces the number of parameters to TR. However, the new formulation can only learn rules with fixed length T. To overcome this problem, we propose to modify

where we define a new relation with an identity adjacency matrix . With this change, the expansion of includes all possible rule templates of length T or smaller with only T(|R| + 1) free parameters.

Although considers all possible rules lengths, it is still constrained in learning the correct rule confidences. As we will show in the experiments Section 5.3, this formulation (as well as Neural LP [46]) inevitably mines incorrect rules with high confidences. The following theorem provides insight about the restricted expressive power of the rules obtained by

Theorem 1. If are two rules of length T obtained by optimizing the objective related to with confidence values , then there exists rules of length , with confidence values such that:

where d(., .) is a distance between two rules of the same size defined as the number of mismatched atoms in their bodies.

Proof. The proof is provided in the appendix.

To further explain Theorem 1, consider an example knowledge base with only two meaningful logical rules of body length such that they do not share any body atoms. According to Theorem 1, learning these two rules by optimizing leads to learning at least other rules, since , with confidence values greater than . This means we inevitably learn at least 2 additional incorrect rules with substantial confidence values.

Theorem 1 also entails other undesirable issues, for example the resulting list of rules may not have the correct order of importance. More specifically, a rule might have higher confidence value just because it is sharing an atom with another high confidence rule. Thus confidence values are not a direct indicator of rule importance. This reduces the interpretability of the output rules.

We must note that all previous differentiable rule mining methods based on suffer from this limitation. For example Yang et al. [46] has this limitation for rules with maximum length. Section 5.3 illustrates these drawbacks using examples of mined rules.

4.2 DRUM

Recall that the number of confidence values for rules of length T or smaller is . These values can be viewed as entries of a T dimensional tensor where the size of each axis is |R| + 1. To be more specific, we put the confidence value of the rule with body at position in the tensor and we call it the confidence value tensor.

It can be shown that the final confidences obtained by expanding are a rank one estimation of the confidence value tensor. This interpretation makes the limitation of more clear and provides a natural connection to the tensor estimation literature. Since a low-rank approximation (not just rank one) is a popular method for tensor approximation, we use it to generalize . The related to rank L approximation can be formulated as

In the following theorem, we show that is powerful enough to learn any set of logical rules, without including unrelated ones.

Theorem 2. For any set of rules and their associated confidence values there exists an , such that:

Proof. To prove the theorem we will show that one can find a such that the requirements are met. Without loss of generality, assume (for some ) is of length and consists of body atoms . By setting

it is easy to show that satisfies the condition in Theorem 2. Let’s look at

Therefore

Note the number of learnable parameters in . However, this is just the number of free parameters for finding the rules for a single head relation, learning the rules for all relations in knowledge graph requires estimating parameters, which is and can be potentially large. Also, the main problem that we haven’t addressed yet, is that direct optimization of the objective related to learns parameters of rules for different head relations separately, therefore learning one rule can not help in learning others.

Before we explain how RNNs can solve this problem, we would like to draw your attention to the fact that some pairs of relations cannot be followed by each other, or have a very low probability of appearing together. Consider the family knowledge base, where the entities are people and the relations are familial ties like fatherOf, AuntOf, wifeOf, etc. If a node in the knowledge graph is fatherOf, it cannot be wife_of another node because it has to be male. Therefore the relation wife_of never follows the relation father_of. This kind of information can be useful in estimating logical rules for different head relations and can be shared among them.

To incorporate this observation in our model and to alleviate the mentioned problems, we use L bidirectional RNNs to estimate in equation 7:

where h and are the hidden-states of the forward and backward path RNNs, respectively, both of which are zero initialized. The subindexes of the hidden states denote their time step, and their superindexes identify their bidirectional RNN. is the embedding of the head relation H for which we want to learn a probabilistic logic rule, and is a fully connected neural network that generates the coefficients from the hidden states of the RNNs.

We use a bidirectional RNN instead of a normal RNN because it is capable of capturing information about both backward and forward order of which the atoms can appear in the rule. In addition, sharing the same set of recurrent networks for all head predicates (for all ) allows information to be shared from one head predicate to another.

5 Experiments

In this section we evaluate DRUM on statistical relation learning and knowledge base completion. We also empirically assess the quality and interpretability of the learned rules.

We implement our method in TensorFlow [1] and train on Tesla K40 GPUs. We use ADAM [23] with learning rate and batch size of 0.001 and 64, respectively. We set both the hidden state dimension and head relation vector size to 128. We did gradient clipping for training the RNNs and used LSTMs [20] for both directions. is a single layer fully connected. We followed the convention in the existing literature [46] of splitting the data into three categories of facts, train, and test. The code and the datasets for all the experiments will be publicly available.

5.1 Statistical Relation Learning

Table 1: Dataset statistics for statistical relation

Datasets: Our experiments were conducted on three different datasets [24]. The Unified Medical Language System (UMLS) consists of biomedical concepts such as drug and disease names and relations between them such as diagnosis and treatment. Kinship contains kinship relationships among members of a Central Australian native tribe. The Family data set

contains the bloodline relationships between individuals of multiple families. Statistics about each data set are shown in Table 1.

We compared DRUM to its state of the art differentiable rule mining alternative, Neural LP [46]. To show the importance of having a rank greater than one in DRUM, we test two versions, DRUM-1 and DRUM-4, with L = 1 and L = 4 (rank 4), respectively.

To the best of our knowledge, NeuralLP and DRUM are the only scalable 1 and differentiable methods that provide reasoning on KBs without the need to use embeddings of the entities at test time, and provide prediction solely based on the logical rules. Other methods like NTPs [29, 34] and MINERVA [8], rely on some type of learned embeddings at training and test time. Since rules are interpretable and embeddings are not, this puts our method and NeuralLP in fully-interpretable category while others do not have this advantage (therefore its not fair to directly compare them with each other). Moreover, methods that rely on embeddings (fully or partially) are prone to having worse results in inductive tasks, as partially shown in the experiment section. Nonetheless we show the results of the other methods in the appendix.

Table 2 shows link prediction results for each dataset in two scenarios with maximum rule length two and three. The results demonstrate that DRUM empirically outperforms Neural-LP in both cases T = 2, 3. Moreover it illustrates the importance of having a rank higher than one in estimating confidence values. We can see a more than seven percent improvement on some metrics for UMLS, and meaningful improvements in all other datasets. We believe DRUM’s performance over Neural LP is due to its high rank approximation of rule confidences and its use of bidirectional LSTM to capture forward and backward ordering criteria governing the body relations according to the ontology.

Table 2: Experiment results with maximum rule length 2 and 3

5.2 Knowledge Graph Completion

We evaluate our proposed model in inductive and transductive link prediction tasks on two widely used knowledge graphs WordNet [22, 28] and Freebase [3]. WordNet is a knowledge base constructed to produce an intuitively usable dictionary, and Freebase is a growing knowledge base of general facts. In the experiment we use WN18RR [10], a subset of WordNet, and FB15K-237 [41], which both are more challenging versions of WN18 and FB15K [4] respectively. The statistics of these knowledge bases are summarized in Table 3. We also present our results on WN18 [4] in the appendix.

For transductive link prediction we compare DRUM to several state-of-the-art models, including DistMult [45], ComplEx [42], Gaifman [31], TransE [4], ConvE [10], and most importantly Neural-LP. Since NTP(-) [34] are not scalable to WN18 or FB15K, we could not present results on larger datasets. Also dILP [14], unlike our method requires negative examples which is hard to obtain under Open World Assumption (OWA) of modern KGs and dILP is memory-expensive as authors admit, which cannot scale to the size of large KGs, thus we can not compare numerical results here.

In this experiment for DRUM we set the rank of the estimator L = 3 for both datasets. The results are reported without any hyperparamter tuning. To train the model, we split the training file into facts and new training file with the ratio of three to one. Following the evaluation method in Bordes et al. [4], we use filtered ranking; table 4 summarizes our results.

Table 4: Transductive link prediction results. The results are taken from [25, 46] and [40]

Table 5: Inductive link prediction Hits@10 metrics.

Table 6: Human assessment of number of consecutive correct rules

The results clearly show DRUM empirically outperforms Neural-LP for all metrics on both datasets. DRUM also achieves state of the art Hit@1, Hit@3 as well as MRR on WN18RR among all methods (including the embedding based ones).

It is important to note that comparing DRUM with embedding based methods solely on accuracy is not a fair comparison, because unlike DRUM they are black-box models that do not provide interpretability. Also, as we will demonstrate next, embedding based methods are not capable of reasoning on previously unseen entities.

For the inductive link prediction experiment, the set of entities in the test and train file need to be disjoint. To force that condition, after randomly selecting a subset of test tuples to be the new test file, we omit any tuples from the training file with the entity in the new test file. Table 5 summarizes the inductive results for Hits@10.

It is reasonable to expect a significant drop in the performance of the embedding based methods in the inductive setup. The result of Table 5 clearly shows that fact for the TransE method. The table also demonstrates the superiority of DRUM to Neural LP in the inductive regime. Also for Hits@1 and Hits@3, the results of DRUM are about 1 percent better than NeuralLP and for the TransE all the values are very close to zero.

5.3 Quality and Interpretability of the Rules

As stated in Section 1, an important advantage of rules as a reasoning mechanism is their comprehensibility by humans. To evaluate the quality and interpretability of rules mined by DRUM we perform two experiments. Throughout this section we use the family dataset for demonstration purposes as it is more tangible. Other datasets like umls yield similar results.

We use human annotation to quantitatively assess rule quality of DRUM and Neural LP. For each system and each head predicate, we ask two blind annotators2 to examine each system’s sorted list of rules. The annotators were instructed to identify the first rule they perceive as erroneous. Table 6 depicts the number of correct rules before the system generates an erroneous rule.

The results of this experiment demonstrate that rules mined by DRUM appear to be better sorted and are perceived to be more accurate.

We also sort the rules generated by each system based on their assigned confidences and show the three top rules3 in Table 7. Logically incorrect rules are highlighted by italic-red. This experiment shows two of the three top ranked rules generated by Neural LP are incorrect (for both head predicates wife and son).

These errors are inevitable because it can be shown that for rules of maximum length (T), the estimator of Neural LP provides a rank one estimator for the confidence value tensor described in Section 4.2. Thus according to Theorem 1 the second highest confidence rule generated by Neural LP has to share a body atom with the first rule. For example the rule even though incorrect, has a high confidence due to sharing the body atom brother with the highest confidence rule (first rule). Since DRUM does not have this limitation it can be seen that the same does not happen for rules mined by DRUM.

Table 7: Top 3 rules obtained by each system learned on family dataset

6 Conclusion

We present DRUM, a fully differentiable rule mining algorithm which can be used for inductive and interpretable link prediction. We provide intuition about each part of the algorithm and demonstrate its empirical success for a variety of tasks and benchmark datasets.

DRUM’s objective function is based on the Open World Assumption of KBs and is trained using only positive examples. As a possible future work we would like to modify DRUM to take advantage of negative sampling. Negative sampling has shown empirical success in representation learning methods and it may also be useful here. Another direction for future work would be to investigate an adequate way of combining differential rule mining with representation learning techniques.

Acknowledgments

We thank Kazem Shirani for his valuable feedback. We thank Anthony Colas and Sourav Dutta for their help in human assessment of the rules. This work is partially supported by NSF under IIS Award #1526753 and DARPA under Award #FA8750-18-2-0014 (AIDA/GAIA).

References

[1] M. Abadi, A. Agarwal, P. Barham, E. Brevdo, Z. Chen, C. Citro, G. S. Corrado, A. Davis, J. Dean, M. Devin, S. Ghemawat, I. Goodfellow, A. Harp, G. Irving, M. Isard, Y. Jia, R. Jozefowicz, L. Kaiser, M. Kudlur, J. Levenberg, D. Mané, R. Monga, S. Moore, D. Murray, C. Olah, M. Schuster, J. Shlens, B. Steiner, I. Sutskever, K. Talwar, P. Tucker, V. Vanhoucke, V. Vasudevan, F. Viégas, O. Vinyals, P. Warden, M. Wattenberg, M. Wicke, Y. Yu, and X. Zheng. TensorFlow: Large-scale machine learning on heterogeneous systems, 2015. URL https://www.tensorflow.org/. Software available from tensorflow.org.

[2] I. Balaževi´c, C. Allen, and T. M. Hospedales. Tucker: Tensor factorization for knowledge graph completion. arXiv preprint arXiv:1901.09590, 2019.

[3] K. Bollacker, C. Evans, P. Paritosh, T. Sturge, and J. Taylor. Freebase: a collaboratively created graph database for structuring human knowledge. In Proceedings of the 2008 ACM SIGMOD international conference on Management of data, pages 1247–1250. AcM, 2008.

[4] A. Bordes, N. Usunier, A. Garcia-Duran, J. Weston, and O. Yakhnenko. Translating embeddings for modeling multi-relational data. In Advances in neural information processing systems, pages 2787–2795, 2013.

[5] Y. Chen, S. Goldberg, D. Z. Wang, and S. S. Johri. Ontological pathfinding: Mining first-order knowledge from large knowledge bases. In Proceedings of the 2016 International Conference on Management of Data, pages 835–846. ACM, 2016.

[6] W. W. Cohen. Tensorlog: A differentiable deductive database. arXiv preprint arXiv:1605.06523, 2016.

[7] R. Das, S. Dhuliawala, M. Zaheer, L. Vilnis, I. Durugkar, A. Krishnamurthy, A. Smola, and A. McCallum. Go for a walk and arrive at the answer: Reasoning over paths in knowledge bases using reinforcement learning. arXiv preprint arXiv:1711.05851, 2017.

[8] R. Das, S. Dhuliawala, M. Zaheer, L. Vilnis, I. Durugkar, A. Krishnamurthy, A. Smola, and A. McCallum. Go for a walk and arrive at the answer: Reasoning over paths in knowledge bases using reinforcement learning. International Conference on Learning Representations., 2018.

[9] L. De Raedt, A. Dries, I. Thon, G. Van den Broeck, and M. Verbeke. Inducing probabilistic relational rules from probabilistic examples. In Twenty-Fourth International Joint Conference on Artificial Intelligence, 2015.

[10] T. Dettmers, P. Minervini, P. Stenetorp, and S. Riedel. Convolutional 2d knowledge graph embeddings. In Thirty-Second AAAI Conference on Artificial Intelligence, 2018.

[11] X. Dong, E. Gabrilovich, G. Heitz, W. Horn, N. Lao, K. Murphy, T. Strohmann, S. Sun, and W. Zhang. Knowledge vault: A web-scale approach to probabilistic knowledge fusion. In Proceedings of the 20th ACM SIGKDD international conference on Knowledge discovery and data mining, pages 601–610. ACM, 2014.

[12] T. Ebisu and R. Ichise. Toruse: Knowledge graph embedding on a lie group. In Thirty-Second AAAI Conference on Artificial Intelligence, 2018.

[13] J. Ellis, J. Getman, D. Fore, N. Kuster, Z. Song, A. Bies, and S. M. Strassel. Overview of linguistic resources for the tac kbp 2015 evaluations: Methodologies and results. In TAC, 2015.

[14] R. Evans and E. Grefenstette. Learning explanatory rules from noisy data. Journal of Artificial Intelligence Research, 61:1–64, 2018.

[15] L. Galárraga, C. Teflioudi, K. Hose, and F. M. Suchanek. Fast rule mining in ontological knowledge bases with amieplus. The VLDB Journal—The International Journal on Very Large Data Bases, 24(6):707–730, 2015.

[16] P. Goyal and E. Ferrara. Graph embedding techniques, applications, and performance: A survey. Knowledge-Based Systems, 151:78–94, 2018.

[17] K. Guu, J. Miller, and P. Liang. Traversing knowledge graphs in vector space. arXiv preprint arXiv:1506.01094, 2015.

[18] E. Hajiramezanali, A. Hasanzadeh, N. Duffield, K. R. Narayanan, M. Zhou, and X. Qian. Variational graph recurrent neural networks. arXiv preprint arXiv:1908.09710, 2019.

[19] A. Hasanzadeh, E. Hajiramezanali, N. Duffield, K. R. Narayanan, M. Zhou, and X. Qian. Semi-implicit graph variational auto-encoders. arXiv preprint arXiv:1908.07078, 2019.

[20] S. Hochreiter and J. Schmidhuber. Long short-term memory. Neural computation, 9(8): 1735–1780, 1997.

[21] S. M. Kazemi, R. Goel, K. Jain, I. Kobyzev, A. Sethi, P. Forsyth, and P. Poupart. Relational representation learning for dynamic (knowledge) graphs: A survey. arXiv preprint arXiv:1905.11485, 2019.

[22] A. Kilgarriff. Wordnet: An electronic lexical database, 2000.

[23] D. P. Kingma and J. Ba. Adam: A method for stochastic optimization. arXiv preprint arXiv:1412.6980, 2014.

[24] S. Kok and P. Domingos. Statistical predicate invention. In Proceedings of the 24th international conference on Machine learning, pages 433–440. ACM, 2007.

[25] T. Lacroix, N. Usunier, and G. Obozinski. Canonical tensor decomposition for knowledge base completion. arXiv preprint arXiv:1806.07297, 2018.

[26] N. Lao, T. Mitchell, and W. W. Cohen. Random walk inference and learning in a large scale knowledge base. In Proceedings of the Conference on Empirical Methods in Natural Language Processing, pages 529–539. Association for Computational Linguistics, 2011.

[27] X. V. Lin, R. Socher, and C. Xiong. Multi-hop knowledge graph reasoning with reward shaping. In Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing, pages 3243–3253, 2018.

[28] G. A. Miller. Wordnet: a lexical database for english. Communications of the ACM, 38(11): 39–41, 1995.

[29] P. Minervini, M. Bosnjak, T. Rocktäschel, and S. Riedel. Towards neural theorem proving at scale. arXiv preprint arXiv:1807.08204, 2018.

[30] S. Muggleton. Inverse entailment and progol. New generation computing, pages 245–286, 1995.

[31] M. Niepert. Discriminative gaifman models. In Proceedings of the 30th International Conference on Neural Information Processing Systems, NIPS’16, pages 3413–3421, USA, 2016. Curran Associates Inc. ISBN 978-1-5108-3881-9. URL http://dl.acm.org/citation.cfm?id= 3157382.3157479.

[32] P. G. Omran, K. Wang, and Z. Wang. Scalable rule learning via learning representation. In IJCAI, pages 2149–2155, 2018.

[33] J. R. Quinlan. Learning logical definitions from relations. Machine learning, pages 239–266, 1990.

[34] T. Rocktäschel and S. Riedel. End-to-end differentiable proving. In Advances in Neural Information Processing Systems, pages 3788–3800, 2017.

[35] A. Sadeghian, M. Rodriguez, D. Z. Wang, and A. Colas. Temporal reasoning over event knowledge graphs. n 1st Workshop on Knowledge Base Construction, Reasoning and Mining, 2065:54–57, 2016.

[36] M. Schlichtkrull, T. N. Kipf, P. Bloem, R. Van Den Berg, I. Titov, and M. Welling. Modeling relational data with graph convolutional networks. In European Semantic Web Conference, pages 593–607. Springer, 2018.

[37] P. Schüller and M. Benz. Best-effort inductive logic programming via fine-grained cost-based hypothesis generation. Machine Learning, 107(7):1141–1169, 2018.

[38] R. Socher, D. Chen, C. D. Manning, and A. Ng. Reasoning with neural tensor networks for knowledge base completion. In Advances in neural information processing systems, pages 926–934, 2013.

[39] F. M. Suchanek, G. Kasneci, and G. Weikum. Yago: a core of semantic knowledge. In Proceedings of the 16th international conference on World Wide Web, pages 697–706. ACM, 2007.

[40] Z. Sun, Z.-H. Deng, J.-Y. Nie, and J. Tang. Rotate: Knowledge graph embedding by relational rotation in complex space. In International Conference on Learning Representations, 2019. URL https://openreview.net/forum?id=HkgEQnRqYQ.

[41] K. Toutanova and D. Chen. Observed versus latent features for knowledge base and text inference. In Proceedings of the 3rd Workshop on Continuous Vector Space Models and their Compositionality, pages 57–66, 2015.

[42] T. Trouillon, J. Welbl, S. Riedel, E. Gaussier, and G. Bouchard. Complex embeddings for simple link prediction. In M. F. Balcan and K. Q. Weinberger, editors, Proceedings of The 33rd International Conference on Machine Learning, volume 48 of Proceedings of Machine Learning Research, pages 2071–2080, New York, New York, USA, 20–22 Jun 2016. PMLR. URL http://proceedings.mlr.press/v48/trouillon16.html.

[43] D. Vrandeˇci´c and M. Krötzsch. Wikidata: a free collaborative knowledgebase. Communications of the ACM, 57(10):78–85, 2014.

[44] W. Y. Wang, K. Mazaitis, and W. W. Cohen. Structure learning via parameter learning. In Proceedings of the 23rd ACM International Conference on Conference on Information and Knowledge Management, pages 1199–1208. ACM, 2014.

[45] B. Yang, W. tau Yih, X. He, J. Gao, and L. Deng. Embedding entities and relations for learning and inference in knowledge bases. CoRR, abs/1412.6575, 2015.

[46] F. Yang, Z. Yang, and W. W. Cohen. Differentiable learning of logical rules for knowledge base reasoning. In Advances in Neural Information Processing Systems, pages 2319–2328, 2017.

Appendices A Proof of Theorem 1

Theorem 1. If are two rules of length T obtained by optimizing the objective related to with confidence values , then there exists rules of length , with confidence values such that:

where d(., .) is a distance between two rules of the same size defined as the number of mismatched atoms between them.

where , is the objective related to the model. The confidence value of a rule of length T, for instance S, with body

Therefore changing a body atom , does not decrease the confidence value iff be the maximum element of the sequence . By consequently changing ) we obtain a sequence of rules of length T, with non-decreasing confidence values. The distance between any two consecutive elements in that sequence is 1. The last element of the sequence () is the rule with the highest confidence value among length T rules, and the length of the sequence is

To prove the theorem, it is sufficient to substitute to obtain two sequences of rules, with lengths , respectively. by reversing the sequence related to and concatenate it with the other sequence, we have a sequence of length satisfying the conditions required to prove the theorem (after excluding

The confidences values for the rules in the sequence satisfy the condition all the rules in the sequence related to ) and have larger or equal confidence value to ). And since d(., .) is a valid distance function it satisfies the triangle inequality; therefore , which implies

B Extension to the table 2: Comparison with other reasoning methods

Table 8: Statistical Relation Learning comparison with other reasoning methods.

C Extension to the table 4: WN18 dataset results

Table 9: Transductive link prediction results

D Extension to the table 7: top 10 rules obtained by each system

Table 10: Examples of top rules obtained by each system learned on family dataset

designed for accessibility and to further open science