Taxonomies have been fundamental to organizing knowledge for centuries [45]. In today’s Web, taxonomies provide valuable knowledge to support many applications such as query understanding [17], content browsing [54], personalized recommendation [18, 63], and web search [29, 53]. For example, many online retailers (e.g., eBay and Amazon) organize products into categories of different granularities, so that customers can easily search and navigate this category taxonomy to find the items they want to purchase. In addition,
Figure 1: An example of expanding one computer science field-of-study taxonomy to include new concepts such as “
web search engines (e.g., Google and Bing) leverage a taxonomy to better understand user queries and improve the search quality.
Existing taxonomies are mostly constructed by human experts or in a crowdsourcing manner. Such manual curations are timeconsuming, labor-intensive, and rarely complete. To reduce the human efforts, many automatic taxonomy construction methods [31, 41, 60] are proposed. They first identify “is-A” relations (is an “Electronics”) using textual patterns [16, 38] or distributional similarities [3, 43], and then organize extracted concept pairs into a directed acyclic graph (DAG) as the output taxonomy [10, 14, 24]. As the web contents and human knowledge are constantly growing, people need to expand an existing taxonomy to include new emerging concepts. Most of previous methods, however, construct a taxonomy entirely from scratch and thus when we add new concepts, we have to re-run the entire taxonomy construction process. Although being intuitive, this approach has several limitations. First, many taxonomies have a top-level design provided by domain experts and such design shall be preserved. Second, a newly constructed taxonomy may not be consistent with the old one, which can lead to instabilities of its dependent downstream applications. Finally, as targeting the scenario of building taxonomy from scratch, most previous methods are unsupervised and cannot leverage signals from the existing taxonomy to construct a new one.
In this paper, we study the taxonomy expansion task: given an existing taxonomy and a set of new emerging concepts, we aim to automatically expand the taxonomy to incorporate these new concepts (without changing the existing relations in the given taxonomy).1 Figure 1 shows an example where a taxonomy in computer science domain is expanded to include new subfields (Computing”) and new techniques (
Some previous studies [21, 22, 39] attempt this task by using an additional set of labeled concepts with their true insertion positions in the existing taxonomy. However, such labeled data are usually small and thus forbid us from learning a more powerful model that captures the subsumption semantics in the existing taxonomy.
We propose a novel framework named TaxoExpan to tackle the lack-of-supervision challenge. TaxoExpan formulates a taxonomy as a directed acyclic graph (DAG), automatically generates pseudotraining data from the existing taxonomy, and uses them to learn a matching model for expanding a given taxonomy. Specifically, we view each concept in the existing taxonomy as a query and one of its parent concepts as an anchor. This gives us a set of positive query concept, anchor concept
pairs. Then, we generate negative pairs by sampling those concepts that are neither the descendants nor the direct parents of the query concept in the existing taxonomy. In Figure 1, for example, the
”, “
is a positive pair and
”, “
is a negative pair. We refer to these training pairs as self-supervision data, because they are procedurally generated from the existing taxonomy and no human curation is involved.
To make the best use of above self-supervision data, we develop two novel techniques in TaxoExpan. The first one is a position-enhanced graph neural network (GNN) which encodes the local structure of an anchor concept using its ego network (egonet) in the existing taxonomy. If we view this anchor concept as the “parent” of the query concept, this ego network includes the potential “siblings” and “grand parents” of the query concept. We apply graph neural networks (GNNs) to model this ego network. However, regular GNNs fail to distinguish nodes with different relative positions to the query (i.e., some nodes are grand parents of the query while the others are siblings of the query). To address this limitation, we present a simple but effective enhancement to inject such position information into GNNs using position embedding. We show that such embedding can be easily integrated with existing GNN architectures (e.g., GCN [23] and GAT [50]) and significantly boosts the prediction performance. The second technique is a new noise-robust training scheme based on the InfoNCE loss [47]. Instead of predicting whether each individual query concept, anchor concept
is positive or not, we first group all pairs sharing the same query concept into a single training instance and learn a model to select the positive pair among other negative ones from the group. We show that such training scheme is robust to the label noise and leads to performance gains.
We test the effectiveness of TaxoExpan framework on three real-world taxonomies from different domains. Our results show that TaxoExpan can generate high-quality concept taxonomies in scientific domains and achieves state-of-the-art performance on the WordNet taxonomy expansion challenge [22].
Contributions. To summarize, our major contributions include: (1) a self-supervised framework that automatically expands existing taxonomies without manually labeled data; (2) an effective method for enhancing graph neural network by incorporating hierarchical positional information; (3) a new training objective that enables the learned model to be robust to label noises in self-supervision data; and (4) extensive experiments that verify both the effectiveness and the efficiency of TaxoExpan framework on three real-world large-scale taxonomies from different domains.
The rest of the paper is organized as follows. Section 2 discusses the related work. Section 3 formalizes our problem. Then, we present our TaxoExpan framework in Section 4 and conduct experiments in Section 5. Finally, we conclude this paper in Section 6.
We review two lines of related work: taxonomy construction and graph neural network.
Taxonomy Construction and Expansion. Automatic taxonomy construction is a long-standing task in the literature. Most existing approaches focus on building the entire taxonomy by first extracting hypernym-hyponym pairs and then organizing all hypernymy relations into a tree or DAG structure. For the first hypernymy discovery step, methods fall into two categories: (1) pattern-based methods which leverage pre-defined patterns [1, 16, 19, 34] to extract hypernymy relations from a corpus, and (2) distributional methods which calculate pairwise term similarity metrics based on term embeddings [27, 30, 37, 52] and use them to predict whether two terms hold the hypernymy relation. For the second hypernymy organization step, most methods formulate it as a graph optimization problem. They first build a noisy hypernymy graph using hypernymy pairs extracted and then derive the output taxonomy as a particular tree or DAG structure (e.g., maximum spanning tree [5, 35], optimal branching [49], and minimum-cost flow [14]) from the hypernymy graph. Finally, there are some methods that leverage entity set expansion techniques [40, 62] to incrementally construct a taxonomy either from scratch or from a tiny seed taxonomy.
In many real-world applications, some existing taxonomies may have already been laboriously curated by experts [12, 28] or via crowdsourcing [32], and are deployed in online systems. Instead of constructing the entire taxonomy from scratch, these applications demand the feature of expanding an existing taxonomy dynamically. There exist some studies on expanding WordNet with named entities from Wikipedia [46] or domain-specific concepts from different corpora [4, 6, 13, 21]. Task 14 of SemEval 2016 challenge [22] is specifically setup to enrich WordNet with concepts from domains like health, sport, and finance. One limitation of these approaches is that they depend on the synset structure unique to WordNet and thus cannot be easily generalized to other taxonomies.
To address the above limitation, more recent works try to develop methodologies for expanding a generic taxonomy. Wang et al. [51] design a hierarchical Dirichlet model to extend the category taxonomy in search engines using query logs. Plachouras et al. [36] learn paraphrase models on external paraphrase datasets and apply learned models to directly find paraphrases of concepts in the existing taxonomy. Vedula et al. [48] combine multiple features, some of which are retrieved from an external Bing Search API, into a ranking model to score candidate positions in terms of their matching scores with the query concept. Aly et al. [2] first learn term embeddings in a hyperbolic space and then attach each new concept to its most similar node in the existing taxonomy based on the hyperbolic embeddings. Comparing with these methods, our TaxoExpan framework has two advantages. First, it requires no external data resource and makes full use of the existing taxonomy as the self supervision, which leads to a broader application scope. Second, TaxoExpan explicitly models the local structure around each candidate position, which boosts the quality of expanded taxonomy. Graph Neural Network. Our work is also related to Graph Neural Network (GNN) which is a generic method of learning on graphstructure data. Many GNN architectures have been proposed to either learn individual node embeddings [8, 15, 23, 50] for the node classification and the link prediction tasks or learn an entire graph representation [25, 56, 61] for the graph classification task. In this work, we tackle the taxonomy expansion task with a fundamentally different formulation from previous tasks. We leverage some existing GNN architectures and enrich them with additional relative position information. Recently, You et al. [58] propose a method to add position information into GNN. Our methods are different from You et al.. They model the absolute position of a node in a full graph without any particular reference points; while our technique captures the relative position of a node with respect to the query node. Finally, some work on graph generation [20, 26, 57] involves a module to add a new node into a partially generated graph, which shares the similar goal as our model. However, such graph generation model typically requires fully labeled training data to learn from. To the best of our knowledge, this is the first study on how to expand an existing directed acyclic graph (as we model a taxonomy as a DAG) using self-supervised learning.
In this section, we first define a taxonomy, then formulate our problem, and finally discuss the scope of our study.
Taxonomy. A taxonomy T = (N, E) is a directed acyclic graph where each node represents a concept (i.e., a word or a phrase) and each directed edge
indicates a relation expressing that concept
is the most specific concept that is more general than concept
. In other words, we refer to
as the “
Problem Definition. The input of the taxonomy expansion task includes two parts: (1) an existing taxonomy , and (2) a set of new concepts C. This new concept set can be either manually specified by users or automatically extracted from text corpora. Our goal is to expand the existing taxonomy
into a larger taxonomy
is a set of newly discovered relations each including one new concept
Example 1. Figure 1 shows an example of our problem. Given a field-of-study taxonomy in the computer science domain and a set of new concepts
“Meta Learning”, . . . }, we find each new concept’s best position in
, “UDA” under “Semi-supervised Learning” as well as “GPU” under “Integrated Circuit”) and expand
to include those new concepts.
Simplified Problem. A simplified version of the above problem is that we assume the input set of new concepts contains only one element (i.e., |C| = 1), and we aim to find one single parent node of this new concept (i.e., |R| = 1). We discuss the connection between these two problem settings at the end of Section 4.1.
Discussion. In this work, we follow previous studies [2, 22, 48] and assume each concept in has an initial embedding vector learned from this concept’s surface name, or if available, its definition sentences [39] and associated web pages [51]. We also note that our problem formulation assumes those relations in the existing taxonomy are not modified. We acknowledge that such modification is necessary in some cases, but it is much less frequent and requires high cautiousness from human curators. Therefore, we leave it out of the scope of automation in this study.
In this section, we first introduce our taxonomy model and expansion goal. Then, we elaborate how to represent a query concept and an insertion position (i.e., an anchor concept), based on which we present our query-concept matching model. Finally, we discuss how to generate self-supervision data from the existing taxonomy and use them to train the TaxoExpan framework.
4.1 Taxonomy Model and Expansion Goal
A taxonomy T describes a hierarchical organization of concepts. These concepts form the node set N in T. Mathematically, we model each node as a categorical random variable and the entire taxonomy T as a Bayesian network. We define the probability of a taxonomy T as the joint probability of node set N which can be further factorized into a set of conditional probabilities as follows:
where is the set of model parameters and
is the set of
’s parent node(s) in taxonomy T.
Given learned model parameters , an existing taxonomy
, and a set of new concepts C, we can ideally find the best taxonomy
by solving the following optimization problem:
This naïve approach has two limitations. First, the search space of all possible taxonomies over the concept set is prohibitively large. Second, we cannot guarantee the structure of existing taxonomy
remains unchanged, which can be undesirable from the application point of view.
We address the above limitations by restricting the search space of our output taxonomy to be the exact expansion of the existing taxonomy . Specifically, we keep the parents of each existing taxonomy node
unchanged and only try to find a single parent node of each new concept in C. As a result, we divide the above computationally intractable problem into the following set of |C| tractable optimization problems:
where is the parent node of a new concept
and we refer to it as the “
Discussion. The above equation defines |C| independent optimization problems and each problem aims to find one single parent of a new concept . Therefore, we essentially reduce the more generic taxonomy expansion problem into |C| independent simplified problems (c.f. Section 3) and tackle it by inserting new concepts
Figure 2: Two egonets correspond to two anchor concepts.
one-by-one into the existing taxonomy. As a result of the above reduction, possible interactions among new concepts are ignored and we leave it to the future work. In the following sections, we continue to answer two keys questions: (1) how to model the conditional probability , and (2) how to learn model parameters
4.2 Modeling Query-Anchor Matching
We model the matching score between a query concept and an anchor concept
by projecting them into a vector space and calculating matching scores using their vectorized representations. We show the entire model architecture of TaxoExpan in Figure 3.
4.2.1 Representing Query Concept.
In this study, we assume each query concept has an initial feature vector learned based on some text associated with this concept. Such text can be as simple as the concept surface name, or in some prior studies [22, 51], the definition sentences and clicked web pages about the concept. We represent each query concept using its initial feature vector denoted as
. We will discuss how to obtain such initial feature vectors using embedding learning methods in the experiment section.
4.2.2 Representing Anchor Concept.
Each anchor concept corresponds to one node in the existing taxonomy that could be the “parent” of a query concept. One naïve way to represent an anchor concept is to directly use its initial feature vector. A key limitation of this approach is that it captures only the “parent” node information and loses other surrounding nodes’ signals, which could be crucial for determining whether the query concept should be put in this position. We illustrate this limitation below:
Example 2. Suppose we are given a query concept “high dependency unit” to predict whether it should be under the “hospital room” node (i.e., an anchor concept) in an existing taxonomy. As these two concepts have dissimilar embeddings based on their surface names, we may believe this query concept shouldn’t be placed underneath this anchor concept. However, if we know that this anchor concept has two children nodes, i.e., “intensive care unit” and “low dependency unit”, that are closely related to the query concept, we are more likely to put the query concept under this anchor concept, correctly.
The above example demonstrates the importance of capturing local structure information in the anchor concept representation. We model the anchor concept using its ego network. Specifically, we consider the anchor concept to be the “parent” node of a query concept. The ego network of the anchor concept consists of the “sibling” nodes and “grand parent” nodes of the query concept, as
shown in Figure 2. We represent the anchor concept based on its ego network using a graph neural network. Graph Neural Network Architectures. Given an anchor concept with its corresponding ego network
and its initial representation
, we use a graph neural network (GNN) to generate its final representation
. This GNN contains two components: (1) a graph propagation module that transforms and propagates node features over the graph structure to compute individual node embeddings in
, and (2) a graph readout module that combines node embeddings into a vector representing the full ego network
final graph embedding encodes all local structure information centered around the anchor concept and we use it as the final anchor representation
A graph propagation module uses a neighborhood aggregation strategy to iteratively update the representation of a node u by aggregating representations of its neighbors N(u) and itself. We denote iterations, a node’s representation captures the structural information within its K-hop neighborhood. Formally, we define a GNN with K-layers as follows:
where is node u’s feature in the k-th layer;
is node u’s initial feature vector, and AGG
is an aggregation function in the k-th layer. We instantiate AGG
using two popular architectures: Graph Convolutional Network (GCN) [23] and Graph Attention Network (GAT) [50]. GCN defines the AGG function as follows:
where is a normalization constant (same for all layers);
is a non-linear function (e.g., ReLU), and
the learnable weight matrix. If we interpret
of nodev’s feature to nodeu, GCN calculates it using only the graph structure without leveraging the node features. GAT addresses this limitation by defining
as follows:
where both and
are learnable parameters;
is another non-linear function (e.g., LeakyReLU), and “
” represents the concatenation operation. Plugging the above
into Eq. (3) we obtain the aggregation function in a single-head GAT. Finally, We execute M independent transformations of Eq. (3) and concatenate their output features to compose the final output embedding of node u. This defines the aggregation function in a multi-head GAT (with M heads) as follows:
where -th weight matrix in the m-th attention head. After obtaining each node’s final representation
, we generate the ego network’s representation
using a graph readout
Figure 3: Overview of TaxoExpan framework. is an embedding model that provides query concept’s initial feature vector
and the initial feature vector of each node in the egonet. The graph propagation module transforms initial feature vectors into better node representations based on which the graph readout module outputs the egonet embedding as the final anchor representation. Finally, a matching module inputs both query and anchor representations and outputs their matching score.
module as follows:
where READOUT is a permutation invariant function [59] such as element-wise mean or sum.
Position-enhanced Graph Neural Networks. One key limitation of the above GNN model is that they fail to capture each node’s position information relative to the query concept. Take Figure 2 as an example, the “hospital room” node in the left ego network is the anchor node itself while in the right ego network it is the child of the anchor node. Such position information will influence how node feature propagates within the ego network and how the final graph embedding is aggregated.
enhanced graph neural networks. The key idea is to learn a set of “position embeddings” and enrich each node feature with its corresponding position embedding. We denote node u’s position as and its position embedding at k-th layer as
. We re- place each node feature
with its position-enhanced version
in Eqs. (3-5) and adjust the dimensionality of
accordingly. Such position embeddings help us to learn better node representations from two aspects. First, we can capture more neighborhood information. Take
in the right hand side of Eq. (3) as an example, we enhance it to the following:
where is another weight matrix used to transform position embeddings. The above equation shows that a node’s new representation is jointly determined by its neighborhoods’ contents (
) and relative positions in the ego network (
). Second, for GAT architecture, we can better model neighbor importance as the term
in Eq. (3) currently depends on both
Furthermore, we propose two schemes to inject position information in the graph readout module. The first one, called weighted mean readout (WMR), is defined as follows:
Figure 4: Self-supervision generation.
where is the parameter indicating the importance of position
. The second scheme is called concatenation readout (CR) which combines the average embeddings of nodes with the same position as follows:
where P is the set of all positions we are modeling and is an indicator function which returns 1 if its internal statement is true and returns 0 otherwise.
4.2.3 Matching Query Concept and Anchor Concept.
Based on the learned query concept representation and anchor concept representation
, we calculate their match score using a matching module
. We study two architectures. The first one is a multi-layer perceptron with one hidden layer, defined as follows:
where are parameters;
is the sigmoid function, and
is the LeakyReLU activation function. The second architecture is a log-bilinear model defined as follows:
where W is a learnable interaction matrix. We choose these MLP and LBM as they are representative architecures in linear and bilinear interaction models, respectively.
4.3 Model Learning and Inference
The above section discusses how to model query-anchor matching using a parameterized function . In this section, we first introduce how we learn those parameters
using self-supervision from the existing taxonomy. Then, we establish the connection between the matching score with the conditional probability
, and discuss how to conduct model inference.
Self-supervision Generation. Figure 4 shows the generation process of self supervision data. Given one edge in the existing taxonomy
, we first construct a positive
anchor, query
pair by using child node
as the “query” and parent node
as the “anchor”. Then, we construct N negative pairs by fixing the query node
and randomly selecting N nodes
that are neither parents nor descendants of
. These N + 1 pairs (one positive and N negatives) collectively consist of one training instance
By repeating the above process for each edge in
, we obtain the full self-supervision dataset
. Notice that a node with C parents in
will derive C training instances in X.
Model Training. We learn our model on X using the InfoNCE loss [47] as follows:
where the subscript . If j = 1,
is a positive pair, otherwise,
is a negative pair. The above loss is the cross entropy of classifying the positive pair
correctly, with
as the model prediction. Optimizing this loss results in
estimating the following probability density (up to a multiplicative constant):
We prove the above result in Appendix and summarize our selflearning procedure in Algorithm 1. We establish the connection between matching score with the probability
in Eq. 1 as follows:
We elaborate the implication of this equation below. Model Inference. At the inference stage, we are given a new query concept and apply the learned model
to predict its parent node in the existing taxonomy
. Mathematically, we aim to find the anchor position
that maximizes
, which is equivalent to maximizing
because of Eq. (13) and the fact that
is the same across all positions. Therefore, we rank all candidate positions
based on their matching scores with
and select the top ranked one as the predicted parent node of this query concept. Although we currently select only the top one as query’s single parent, we can also choose top-k ones as query’s parents, if needed. Summary. Given an existing taxonomy and a set of new concepts, our TaxoExpan first generates a set of self-supervision data and learns its internal model parameters using Algorithm 1. For each new concept, we run the inference procedure and find its best parent node in the existing taxonomy. Finally, we place these new
concepts underneath their predicted parents one at a time, and output the expanded taxonomy.
Computational Complexity Analysis. At the training stage, our model uses training instances every epoch and thus scales linearly to the number of edges in the existing taxonomy. At the inference stage, for each query concept, we calculate
scores, one for every existing node in
. Although such
cost per query is expensive, we can significantly reduce it using two strategies. First, most computation efforts of TaxoExpan are matrix multiplications and thus we use GPU for acceleration. Second, as the graph propagation and graph readout modules are queryindependent (c.f. Fig. 4), we pre-compute all anchor representations and cache them. When a set of queries are given, we only run the matching module. In practice, it takes less than 30 seconds to calculate all matching scores between 2,450 queries with over 24,000 anchor positions on a single K80 GPU.
In this section, we study the performance of TaxoExpan on three large-scale real-world taxonomies.
5.1 Expanding MAG Field-of-Study Taxonomy
5.1.1 Datasets. We evaluate TaxoExpan on the public Field-of-Study (FoS) Taxonomy2 in Microsoft Academic Graph (MAG) [44]. This FoS taxonomy contains over 660 thousand scientific concepts and more than 700 thousand taxonomic relations. Although being constructed semi-automatically, this taxonomy is of high quality, as shown in the previous study [42]. Thus we treat each concept’s original parent nodes as its correct anchor positions. We remove all concepts that have no relation in the original FoS taxonomy and then randomly mask 20% of leaf concepts (along with their relations) for validation and testing3. The remaining FoS taxonomy is then treated as the input existing taxonomy. We refer to this dataset as MAG-Full. Based on MAG-Full, we construct another dataset focusing on the computer science domain. Specifically, we first select a subgraph consisting of all descendants of “computer science” node and then mask 10% of leaf concepts in this subgraph
Table 1: Dataset Statistics. |N| and |E| are the number of nodes and edges in the existing taxonomy. |D| indicates the taxonomy depth and |C| is the number of new concepts.
for validation and another 10% of leaf nodes for testing. We name this dataset as MAG-CS.
To obtain the initial feature vector, we first construct a corpus that consists of all paper abstracts mentioning at least one concept in the original MAG dataset. Then, we use “ ” to concatenate all tokens in one concept (e.g., “machine learning” “ma-chine_learning”) and learn 250-dimension word embeddings using skipgram model in word2vec4 [33]. Finally, we use these learned embeddings as the initial feature vector. Table 1 lists the statistics of these two datasets. All datasets and our model implementations are available at: https://github.com/mickeystroller/TaxoExpan.
5.1.2 Evaluation Metrics. As our model returns a rank list of all candidate parents for each input query concept, we evaluate its performance using the following three ranking-based metrics.
• Mean Rank (MR) measures the average rank position of a query concept’s true parent among all candidates. For queries with multiple parents, we first calculate the rank position of each individual parent and then take the average of all rank positions. Smaller MR value indicates better model performance.
• Hit@k is the number of query concepts whose parent is ranked in the top k positions, divided by the total number of queries.
• Mean Reciprocal Rank (MRR) calculates the reciprocal rank of a query concept’s true parent. We follow [55] and use a scaled version of MRR in the below equation:
where parent(c) represents the parent node set of the query concept c, and is the rank position of query concept c’s true parent i. We scale the original MRR by a factor 10 in order to amplify the performance gap between different methods.
5.1.3 Compared Methods. We compare the following methods:
(1) Closest-Parent: A rule-based method which first scores each candidate position in the existing taxonomy based on its cosine distance to the query concept between their initial embedding, and then ranks all positions using this score. The position with the smallest distance is chosen to be query concept’s parent.
(2) Closest-Neighbor: Another rule-based method that scores each position based on its distance to the query concept plus the average distance between its children nodes and the query.
(3) dist-XGBoost: A self-supervised boosting method that works directly on 39 manually-designed features generated using initial node embeddings without any embedding transformation. We input these features into XGBoost [9], a tree-based boosting
model, to predict the matching score between a query concept and a candidate position.
(4) ParentMLP: A self-supervised method that first concatenates the query concept embedding with the candidate position embedding and then feeds them into a Multi-Layer Perceptron (MLP) for prediction.
(5) DeepSetMLP: Another self-supervised method that extends ParentMLP by adding information of candidate position’s children nodes. Specifically, we first use DeepSet architecture [59] to generate the representation of the children node set and then concatenate it with query & candidate position representations before the final MLP module.
(6) TaxoExpan: Our proposed framework using position-enhanced GAT (PGAT) as graph propagation module, weighted mean readout (WMR) for graph readout, and log-bilinear model (LBM) for query-anchor matching. We learn this model using our proposed InfoNCE loss.
5.1.4 Implementation Details and Parameter Settings. For a fair comparison, we use the same 250-dimension embeddings across all compared methods. We use Google’s original word2vec implementation5 for learning embeddings and employ gensim6 to load trained embeddings for calculating term distances in Closest-Parent, Closest-Neighbor, and dist-XGBoost methods. For the other three methods, we implement them using PyTorch and DGL framework7. We tune hyper-parameters in all self-supervised methods on the masked validation set. For TaxoExpan, we use a two-layer position-enhanced GAT where the first layer has four attention heads (of size 250) and the second layer has one attention head (of size 500). For both layers, we use 50-dimension position embeddings and apply dropout with rate 0.1 on the input feature vectors. We use Adam optimizer with initial learning rate 0.001 and ReduceLROnPlateau scheduler8 with three patience epochs. We discuss the influence of these hyper-parameters in the next subsection.
5.1.5 Experimental Results. We present the experimental results in the following aspects.
1. Overall Performance. Table 2 presents the results of all compared methods. First, we find that Closest-Neighbor method clearly outperforms Closest-Parent method and DeepSetMLP is much better than ParentMLP. This demonstrates the effectiveness of modeling local structure information. Second, we compare dist-XGBoost method with Closest-Neighbor and show that self-supervision indeed helps us to learn an effective way to combine various neighbor distance information. All four self-supervised methods outperform rule-based methods. Finally, our proposed TaxoExpan has the overall best performance across all the metrics and defeats the second best method by a large margin.
2. Ablation Analysis of Model Architectures. TaxoExpan con-
tains three key components: a graph propagation module, a graph readout module, and a matching model. Here, we study how different choices of these components affect the performance of TaxoExpan. Table 3 lists the results and the first column contains the index of each model invariant.
Table 2: Overall results on MAG-CS and MAG-Full datasets. We run all methods three times and report the averaged result with standard deviation. Note that smaller MR indicates better model performance. For all other metrics, larger values indicate better performance. We highlight the best two models in terms of the average performance under each metric.
Table 3: Ablation analysis of model architectures on MAG- CS dataset. We assign an index to each model variant (shown in the first column). All models are run three times with their averaged scores reported.
First, we analyze graph propagation module by using simple average scheme for graph readout and MLP for matching. By comparing model 1 to model 3 and model 2 to model 4, we can see that graph attention architecture (GAT) is better than graph convolution architecture (GCN). Furthermore, the position-enhanced variants clearly outperform their non-position counterparts (model 3 versus model 1 and model 4 versus model 2). This illustrates the efficacy of the position embeddings in the graph propagation module.
Second, we study graph readout module by fixing the graph propagation module to be the best two variants among models 1-4. We can see both model 5 & 6 outperform model 3 and model 7 & 8 outperform model 4. This signifies that the position information also helps in the graph readout module. However, the best strategy of incorporating position information depends on the graph propagation module. The concatenation readout scheme works better for PGCN while the weighted mean readout is better for PGAT. One possible explanation is that the concatenation readout leads to more parameters in matching model and as PGAT itself has more parameters than PGCN, further introducing more parameters in PGAT may cause the model to be overfitted.
Finally, we examine the effectiveness of different matching models. We replace the MLP in models 5-8 with LBM to create model variants 9-12. We can clearly see that LBM works better than MLP. It could be that LBM better captures the interaction between the query representation and the final anchor representation.
Figure 5: Ablation analysis of training schemes on MAG-CS dataset. We compare models trained using Binary Cross Entropy (BCE) loss with those trained using InfoNCE loss.
3. Ablation Analysis of Training Schemes. In this subsection, we evaluate the effectiveness of our proposed training scheme. In this study, we first group a set of positive and negative pairs into one single training instance (c.f. Sect. 4.3) and learn the model using InfoNCE loss (c.f. Eq. (11)). An alternative is to treat these pairs as different instances and train the model using standard binary cross entropy (BCE) loss. Under this training scheme, we formulate our problem as a binary classification task. We compare these two training schemes for the top 4 best models in Table 3 (i.e., model 7, 8, 11, and 12). Results are shown in Figure 5. Our proposed training scheme with InfoNCE loss is overall much better, it beats the BCE loss scheme on 14 out of total 16 cases. One reason is that BCE loss is very sensitive to the noises in the generated self-supervision data while InfoNCE loss is more robust to such label noise. Furthermore, we find that LBM matching can benefit more from our training scheme with InfoNCE loss – with larger margin on all 8 cases, compared with the simple MLP matching.
4. Hyper-parameter Sensitivity Analysis. We analyze how some hyper-parameters in TaxoExpan affect the performance in Figure 6. First, we find that choosing an approximate position embedding dimension is important. The model performance increases as this dimensionality increases until it reaches about 50. When we further increase position embedding dimension, the model will overfit and the performance decreases. Second, we study the effect of negative sampling ratio N. As shown in Figure 6, the model performance first increases as N increases until it reaches about 30 and then
Figure 6: Hyper-parameter sensitivity analysis on MAG-CS dataset. We use PGAT for graph propagation, WMR for graph readout, and LBM for query-graph matching. Model is trained using InfoNCE loss.
becomes stable. Finally, we examine two hyper-parameters controlling the model complexity: the number of heads in PGAT and the final graph embedding dimension. We observe that the best model performance is reached when the number of attention heads falls in range 3 to 5 and the graph embedding dimension is set to 500. Too many attention heads or too large graph embedding dimension will lead to overfit and performance degradation.
5. Efficiency and Scalability. We further analyze the scalability of TaxoExpan and its efficiency during model inference stage. Figure 7 (left) tests the model scalability by running on MAG-CS dataset sampled using different ratios. The training time (of 20 epochs) are measured on one single K80 GPU. TaxoExpan demonstrates a linear runtime trend, which validates our complexity analysis in Sect. 4.3. Second, Figure 7 (right) shows that TaxoExpan is very efficient during model inference stage. Using GPU, TaxoExpan takes less than 30 seconds to predict the anchor positions for all 2450 new query concepts.
6. Case Study. Figure 8 shows some outputs of TaxoExpan on both MAG-CS and MAG-Full datasets. On MAG-CS dataset, we can see that over 20% of queries have their true parents correctly ranked at the first position and less than 1.5% queries have their “true” parents ranked outside of top 1000 positions. Among these 1.5% significantly wrong queries, we find some of them actually have incorrect existing parents. For example, the concept “boils and carbuncles”, which is a disease entity, is mistakenly put under parent node “dataset”. Similar cases also happen on MAG-Full dataset where we find the concept “blood staining” is currently under “
Besides the above label errors, we also observe two common mistake patterns. The first type of mistakes is caused by term ambiguity. For instance, the term “java” in concept “to an island in Indonesia where fruit apple is produced, rather than a programming language used in Apple company. The second type
Figure 7: (Left) Training time of 20 epochs on GPU with respect to % of sampled nodes in the existing taxonomy. (Right) Inference time of all 2450 queries in MAG-CS dataset. Note here y-axis is in logarithm scale.
of mistakes results from term granularity. For example, TaxoExpan outputs the two most likely parent nodes of concept ““
”. Although these two concepts are certainly relevant to “captcha”, they are too general compared to its true parent node “
Finally, we observe that TaxoExpan can return very sensible anchor positions of query concepts, even though they are not exactly the current “true” parents. For example, the concept “refers to a large online medical library and thus is related to both “world wide web” and “library science”. Also, the concept “email hacking” is clearly relevant to both “
7. TaxoExpan for Taxonomy Self-Cleaning. From the above case studies, we find another interesting application of TaxoExpan is to use it for cleaning the existing taxonomy. Specifically, we partition all leaf nodes of the existing taxonomy into 5 groups and randomly mask one group of nodes. Then, we train a TaxoExpan model on the remaining nodes and predict on the masked leaf nodes. Next, we select those entities whose true parents appear at the bottom of the rank lists returned by TaxoExpan (i.e., the long-tail part of two histograms in Figure 8). The parents of those selected entities are highly questionable and calls for further manual inspections. Our preliminary experiments on the MAG-CS taxonomy shows that about 30% of these entities have existing parent nodes which are less appropriate than the parents inferred by TaxoExpan.
5.2 Evaluation on SemEval Task Benchmark
5.2.1 Datasets. We further evaluate TaxoExpan using SemEval Task 14 Benchmark dataset9 [22] which includes WordNet 3.0 as the existing taxonomy and additional 1,000 domain-specific concepts with manual labels, split into 400 training concepts and 600 testing concepts. Each concept is either a verb or a noun and has a textual definition of a few sentences. The original task goal is to enrich the taxonomy by performing two actions for each new concept: (1) attach, where a new concept is treated as a new synset and is attached as a hyponym of one existing synset in WordNet, and (2) merge, where a new concept is merged into an existing synset. However, previous state-of-the-art methods [22, 39, 48], including the winning solution, are only performing the attach operation. In this work, we also follow this convention and attach each new concept to the top-ranked synset in the WordNet. Finally, we obtain the initial feature vectors (for both new concepts and existing words in the WordNet) using pre-trained subword-aware fasttext embeddings10. For each concept, we generate its definition embedding and name embedding by averaging the embedding of each token in its textual definition and name string, correspondingly. Then, we sum the definition and name embeddings of a concept and use them as the initial embeddings for the TaxoExpan model.
5.2.2 Evaluation Metrics. We use the three official metrics defined in original SemEval Task 14 for evaluation:
(1) Accuracy (Wu&P) is the semantic similarity between a predicted parent node and the true parent
, calculated as Wu&P
, where
is the depth of node x is the WordNet taxonomy and
represents the Least Common Ancestor of
(2) Recall is the percentage of concepts for which an attached parent is predicted11.
(3) F1 is the harmonic mean of Wu&P accuracy and recall.
5.2.3 Baseline Methods. We compare the following methods:
(1) FWFS [22]: The original baseline in Task 14. Given a concept c with its definition , this method picks the first word w in
that has the same part of speech as c and treats this word as the parent node of c.
(2) MSejrKU [39]: The winning solution of Task 14. This method leverages distributional and syntactic features to train a SVM classifier which is then used to predict the goodness of fit for a new concept with an existing synset in WordNet.
(3) ETF [48]: The current state-of-the-art method that learns a LambdaMART model with 15 manually designed features, including topological features from the taxonomy’s graph structure and semantic features from corpus and Bing search results.
(4) ETF-FWFS [48]: The ensemble model of FWFS and ETF, which adds the FWFS property as a binary feature into the LambdaMART model in ETF.
(5) dist-XGBoost: The same tree boosting model described in the previous subsection 5.1.3.
(6) TaxoExpan: Our proposed taxonomy expansion framework.
(7) TaxoExpan-FWFS: Similar to ETF-FWFS, this is the ensemble model of FWFS and TaxoExpan. We treat the FWFS heuristic as a binary feature and add it into the final matching module.
For all previous methods, we directly report their best performance in the literature. For the remaining methods, we tune them following the same procedure described in the Section 5.1.4.
5.2.4 Experimental Results. Table 4 shows the experimental results on SemEval dataset. We can see that both dist-XGBoost and TaxoExpan methods can outperform the previous winning system of this task (i.e., MSejrKU) and the baseline ETF. In addition, we can see the FWFS heuristic is indeed very powerful for this dataset and incorporating it as a strong feature can significantly boost the performance. However, this feature requires human-labeled definition sentences and thus can not be easily generalized to taxonomies other than WordNet. Finally, we show that TaxoExpan-FWFS can achieve the new state-of-the-art performance on this dataset.
Table 4: Model performance on SemEval dataset. TaxoExpan versus all previous state-of-the-art methods. We report the best performance of all existing methods in the literature.
This paper studies taxonomy expansion when no human labeled supervision data are given. We propose a novel TaxoExpan framework which generates self-supervision data from the existing taxonomy and learns a position-enhanced GNN model for expansion. To make the best use of self-supervision data, we design a noise-robust objective for effective model training. Extensive experiments demonstrate the effectiveness and efficiency of TaxoExpan on three taxonomies from different domains. Interesting future work includes modeling inter-dependency among new concepts, leveraging current method to cleaning the input existing taxonomy, and incorporating feedbacks from downstream applications (e.g., search & recommendation) to generate more diverse supervision signals for expanding the taxonomy.
Research was sponsored in part by DARPA under Agreements No. W911NF-17-C-0099 and FA8750-19-2-1004, National Science Foundation IIS 16-18481, IIS 17-04532, and IIS-17-41317, and DTRA HDTRA11810026. Any opinions, findings, and conclusions or recommendations expressed in this document are those of the author(s) and should not be interpreted as the views of any U.S. Government. We thank Yuxiao Dong, Ziniu Hu, Li Ma for insightful discussions on this project and anonymous reviewers for valuable feedbacks.
Here we prove that optimizing the loss function in Eq. (11) will result in estimating the probability density in Eq. (12). By construction, X contains query
’s one positive anchor (i.e., its true parent
) sampled from the true distribution
tive anchors
sampled from a uniform distribution
we merge these N +1 anchors into a small set and consider the task of selecting true anchor
’s position
view Eq. (11) as the cross entropy of position distribution ˆP from model prediction relative to the true distribution
. Specifically, the model predicted position distribution ˆ
one of
is the true anchor and all the others are negative
Figure 8: Example output of TaxoExpan on MAG-CS and MAG-Full datasets. We draw a histogram of the ranks of query concepts’ true parents within the rank list returned by TaxoExpan. In subfigure (a), for example, we have 519 (out of 2450) queries that their parents are exactly ranked in the first position.
anchors. Meanwhile, in the true position distribution:
From above, we can see that the optimal value for is proportional to
[1] Eugene Agichtein and Luis Gravano. 2000. Snowball: extracting relations from large plain-text collections. In ACM DL.
[2] Rami Aly, Shantanu Acharya, Alexander Ossa, Arne Köhn, Christian Biemann, and Alexander Panchenko. 2019. Every Child Should Have Parents: A Taxonomy Refinement Algorithm Based on Hyperbolic Term Embeddings. In ACL.
[3] Luis Espinosa Anke, José Camacho-Collados, Claudio Delli Bovi, and Horacio Saggion. 2016. Supervised Distributional Hypernym Discovery via Domain Adaptation. In EMNLP.
[4] Luis Espinosa Anke, José Camacho-Collados, Sara Rodríguez-Fernández, Horacio Saggion, and Leo Wanner. 2016. Extending WordNet with Fine-Grained Collocational Information via Supervised Distributional Learning. In COLING.
[5] Mohit Bansal, David Burkett, Gerard de Melo, and Dan Klein. 2014. Structured Learning for Taxonomy Induction with Belief Propagation. In ACL.
[6] Luisa Bentivogli, Andrea Alejandra Bocco, and Emanuele Pianta. 2003. ArchiWordNet: Integrating WordNet with Domain-Specific Knowledge.
[7] Piotr Bojanowski, Edouard Grave, Armand Joulin, and Tomas Mikolov. 2016. Enriching Word Vectors with Subword Information. arXiv preprint arXiv:1607.04606 (2016).
[8] Jian Jhen Chen, Tengfei Ma, and Cao Xiao. 2018. FastGCN: Fast Learning with Graph Convolutional Networks via Importance Sampling. In ICLR.
[9] Tianqi Chen and Carlos Guestrin. 2016. XGBoost: A Scalable Tree Boosting System. In KDD.
[10] Anne Cocos, Marianna Apidianaki, and Chris Callison-Burch. 2018. Comparing Constraints for Taxonomic Organization. In NAACL.
[11] Jacob Devlin, Ming-Wei Chang, Kenton Lee, and Kristina Toutanova. 2018. BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding. In NAACL-HLT.
[12] Christiane Fellbaum. 1998. WordNet.
[13] Christiane Fellbaum, Udo Hahn, and Barry D. Smith. 2006. Towards new information resources for public health - From WordNet to MedicalWordNet. Journal of biomedical informatics (2006).
[14] Amit Gupta, Rémi Lebret, Hamza Harkous, and Karl Aberer. 2017. Taxonomy Induction Using Hypernym Subsequences. In CIKM.
[15] William L. Hamilton, Zhitao Ying, and Jure Leskovec. 2017. Inductive Representation Learning on Large Graphs. In NIPS.
[16] Marti A. Hearst. 1992. Automatic Acquisition of Hyponyms from Large Text Corpora. In COLING.
[17] Wen Hua, Zhongyuan Wang, Haixun Wang, Kai Zheng, and Xiaofang Zhou. 2017. Understand Short Texts by Harvesting and Analyzing Semantic Knowledge. TKDE (2017).
[18] Jun Huang, Zhaochun Ren, Wayne Xin Zhao, Gaole He, Ji-Rong Wen, and Daxiang Dong. 2019. Taxonomy-Aware Multi-Hop Reasoning Networks for Sequential Recommendation. In WSDM.
[19] Meng Jiang, Jingbo Shang, Taylor Cassidy, Xiang Ren, Lance M. Kaplan, Timothy P. Hanratty, and Jiawei Han. 2017. MetaPAD: Meta Pattern Discovery from Massive Text Corpora. In KDD.
[20] Wengong Jin, Regina Barzilay, and Tommi S. Jaakkola. 2018. Junction Tree Variational Autoencoder for Molecular Graph Generation. ICML.
[21] David Jurgens and Mohammad Taher Pilehvar. 2015. Reserating the awesometastic: An automatic extension of the WordNet taxonomy for novel terms. In NAACLHLT.
[22] David Jurgens and Mohammad Taher Pilehvar. 2016. SemEval-2016 Task 14: Semantic Taxonomy Enrichment. In SemEval@NAACL-HLT.
[23] Thomas N. Kipf and Max Welling. 2017. Semi-Supervised Classification with Graph Convolutional Networks. In ICLR.
[24] Zornitsa Kozareva and Eduard H. Hovy. 2010. A Semi-Supervised Method to Learn and Construct Taxonomies Using the Web. In EMNLP.
[25] John Boaz Lee, Ryan A. Rossi, and Xiangnan Kong. 2018. Graph Classification using Structural Attention. In KDD.
[26] Yujia Li, Oriol Vinyals, Chris Dyer, Razvan Pascanu, and Peter W. Battaglia. 2018. Learning Deep Generative Models of Graphs. In ICLR.
[27] Dekang Lin. 1998. An Information-Theoretic Definition of Similarity. In ICML.
[28] Carolyn E. Lipscomb. 2000. Medical Subject Headings (MeSH). Bulletin of the Medical Library Association (2000).
[29] Bang Wu Liu, Weidong Guo, Di Niu, Chaoyue Wang, Shang-Zhong Xu, Jinghong Lin, Kunfeng Lai, and Yu Wei Xu. 2019. A User-Centered Concept Mining System for Query and Document Understanding at Tencent. In KDD.
[30] Anh Tuan Luu, Yi Tay, Siu Cheung Hui, and See-Kiong Ng. 2016. Learning Term Embeddings for Taxonomic Relation Identification Using Dynamic Weighting Neural Network. In EMNLP.
[31] Yuning Mao, Xiang Ren, Jiaming Shen, Xiaotao Gu, and Jiawei Han. 2018. End-to-End Reinforcement Learning for Automatic Taxonomy Induction. In ACL.
[32] Rui Meng, Yongxin Tong, Lei Chen, and Caleb Chen Cao. 2015. CrowdTC: Crowdsourced Taxonomy Construction. In ICDM.
[33] Tomas Mikolov, Ilya Sutskever, Kai Chen, Gregory S. Corrado, and Jeffrey Dean. 2013. Distributed Representations of Words and Phrases and their Compositionality. In NIPS.
[34] Ndapandula Nakashole, Gerhard Weikum, and Fabian M. Suchanek. 2012. PATTY: A Taxonomy of Relational Patterns with Semantic Types. In EMNLP-CoNLL.
[35] Roberto Navigli, Paola Velardi, and Stefano Faralli. 2011. A Graph-Based Algorithm for Inducing Lexical Taxonomies from Scratch. In IJCAI.
[36] Vassilis Plachouras, Fabio Petroni, Timothy Nugent, and Jochen L. Leidner. 2018. A Comparison of Two Paraphrase Models for Taxonomy Augmentation. In
NAACL-HLT.
[37] Stephen Roller, Katrin Erk, and Gemma Boleda. 2014. Inclusive yet Selective: Supervised Distributional Hypernymy Detection. In COLING.
[38] Stephen Roller, Douwe Kiela, and Maximilian Nickel. 2018. Hearst Patterns Revisited: Automatic Hypernym Detection from Large Text Corpora. In ACL.
[39] Michael Sejr Schlichtkrull and Héctor Martínez Alonso. 2016. MSejrKu at SemEval-2016 Task 14: Taxonomy Enrichment by Evidence Ranking. In SemEval@NAACL-HLT.
[40] Jiaming Shen, Zeqiu Wu, Dongming Lei, Jingbo Shang, Xiang Ren, and Jiawei Han. 2017. SetExpan: Corpus-Based Set Expansion via Context Feature Selection and Rank Ensemble. In ECML/PKDD.
[41] Jiaming Shen, Zeqiu Wu, Dongming Lei, Chao Zhang, Xiang Ren, Michelle T. Vanni, Brian M. Sadler, and Jiawei Han. 2018. HiExpan: Task-Guided Taxonomy Construction by Hierarchical Tree Expansion. In KDD.
[42] Zhihong Shen, Hao Ma, and Kuansan Wang. 2018. A Web-scale system for scientific knowledge exploration. In ACL.
[43] Vered Shwartz, Yoav Goldberg, and Ido Dagan. 2016. Improving hypernymy detection with an integrated path-based and distributional method. ACL (2016).
[44] Arnab Sinha, Zhihong Shen, Yang Song, Hao Ma, Darrin Eide, Bo-June Paul Hsu, and Kuansan Wang. 2015. An Overview of Microsoft Academic Service (MAS) and Applications. In WWW.
[45] Darin Stewart. 2008. Building Enterprise Taxonomies.
[46] Antonio Toral, Rafael Muñoz, and Monica Monachini. 2008. Named Entity WordNet. In LREC.
[47] Aaron van den Oord, Yazhe Li, and Oriol Vinyals. 2018. Representation Learning with Contrastive Predictive Coding. ArXiv (2018).
[48] Nikhita Vedula, Patrick K. Nicholson, Deepak Ajwani, Sourav Dutta, Alessandra Sala, and Srinivasan Parthasarathy. 2018. Enriching Taxonomies With Functional Domain Knowledge. In SIGIR.
[49] Paola Velardi, Stefano Faralli, and Roberto Navigli. 2013. OntoLearn Reloaded: A Graph-Based Algorithm for Taxonomy Induction. Computational Linguistics (2013).
[50] Petar Velickovic, Guillem Cucurull, Arantxa Casanova, Adriana Romero, Pietro Liò, and Yoshua Bengio. 2018. Graph Attention Networks. In ICLR.
[51] Jingjing Wang, Changsung Kang, Yi Chang, and Jiawei Han. 2014. A hierarchical Dirichlet model for taxonomy expansion for search engines. In WWW.
[52] Julie Weeds, David J. Weir, and Diana McCarthy. 2004. Characterising Measures of Lexical Distributional Similarity. In COLING.
[53] Wentao Wu, Hongsong Li, Haixun Wang, and Kenny Q. Zhu. 2012. Probase: a probabilistic taxonomy for text understanding. In SIGMOD Conference.
[54] Grace Hui Yang. 2012. Constructing Task-Specific Taxonomies for Document Collection Browsing. In EMNLP-CoNLL.
[55] Rex Ying, Ruining He, Kaifeng Chen, Pong Eksombatchai, William L. Hamilton, and Jure Leskovec. 2018. Graph Convolutional Neural Networks for Web-Scale Recommender Systems. In KDD.
[56] Zhitao Ying, Jiaxuan You, Christopher Morris, Xiang Ren, William L. Hamilton, and Jure Leskovec. 2018. Hierarchical Graph Representation Learning with Differentiable Pooling. In NeurIPS.
[57] Jiaxuan You, Bowen Liu, Zhitao Ying, Vijay S. Pande, and Jure Leskovec. 2018. Graph Convolutional Policy Network for Goal-Directed Molecular Graph Generation. In NeurIPS.
[58] Jiaxuan You, Rex Ying, and Jure Leskovec. 2019. Position-aware Graph Neural Networks. In ICML.
[59] Manzil Zaheer, Satwik Kottur, Siamak Ravanbakhsh, Barnabás Póczos, Ruslan Salakhutdinov, and Alexander J. Smola. 2017. Deep Sets. In NIPS.
[60] Chao Zhang, Fangbo Tao, Xiusi Chen, Jiaming Shen, Meng Jiang, Brian M. Sadler, Michelle T. Vanni, and Jiawei Han. 2018. TaxoGen: Constructing Topical Concept Taxonomy by Adaptive Term Embedding and Clustering. In KDD.
[61] Muhan Zhang, Zhicheng Cui, Marion Neumann, and Yixin Chen. 2018. An End-to-End Deep Learning Architecture for Graph Classification. In AAAI.
[62] Xiangling Zhang, Yueguo Chen, Jun Chen, Xiaoyong Du, Ke Wang, and Ji-Rong Wen. 2017. Entity Set Expansion via Knowledge Graphs. In
[63] Yuchen Zhang, Amr Ahmed, Vanja Josifovski, and Alexander J. Smola. 2014. Taxonomy discovery for personalized recommendation. In WSDM.