Heterogeneous Network Representation Learning: Survey, Benchmark, Evaluation, and Beyond

2020·arXiv

Abstract

1 INTRODUCTION

Networks and graphs constitute a canonical and ubiquitous paradigm for the modeling of interactive objects, which has drawn significant research attention from various scientific domains [1], [2], [3], [4], [5], [6]. However, real-world objects and interactions are often multi-modal and multi-typed (e.g., authors, papers, venues and terms in a publication network [7], [8]; users, places, categories and GPS-coordinates in a location-based social network [9], [10], [11]; and genes, proteins, diseases and species in a biomedical network [12], [13]). To capture and exploit such node and link heterogeneity, heterogeneous networks have been proposed and widely used in many real-world network mining scenarios, such as meta-path based similarity search [14], [15], [16], node classification and clustering [17], [18], [19], knowledge base completion [20], [21], [22], and recommendations [23], [24], [25].

In the meantime, current research on graphs has largely focused on representation learning (embedding), especially following the pioneer of neural network based algorithms that demonstrate revealing empirical evidence towards unprecedentedly effective yet efficient graph mining [26], [27], [28]. They aim to convert graph data (e.g., nodes [29], [30], [31], [32], [33], [34], [35], [36], links [37], [38], [39], [40], and subgraphs [41], [42], [43], [44]) into low dimensional distributed vectors in the embedding space where the graph topological information (e.g., higher-order proximity [45], [46], [47], [48] and structure [49], [50], [51], [52]) is preserved. Such embedding vectors are then directly executable by various downstream machine learning algorithms [53], [54], [55].

Right on the intersection of heterogeneous networks and graph embedding, heterogeneous network embedding (HNE) recently has also received significant research attention [56], [57], [58], [59], [60], [61], [62], [63], [64], [65], [66], [67], [68], [69], [70], [71], [72], [73], [74], [75], [76]. Due to the application favor of HNE, many algorithms have been separately developed in different application domains such as search and recommendations [5], [23], [77], [78]. Moreover, as knowledge bases (KBs) also fall under the general umbrella of heterogeneous networks, many KB embedding algorithms can be compared with the HNE ones [4], [20], [21], [79], [80], [81], [82], [83], [84].

Unfortunately, various HNE algorithms are developed in quite disparate communities across academia and industry. They have never been systematically and comprehensively analyzed either in concepts or through experiments. In fact,

Fig. 1: Different heterogeneous networks constructed from the same real-world application data.

due to the lack of benchmark platforms (with ready-to-use datasets and baselines), researchers often tend to construct their own datasets and re-implement a few most popular (sometimes outdated) algorithms for comparison, which renders fair performance evaluation and clear improvement attribution extremely hard, if not impossible.

Simply consider the toy examples of a publication dataset in Figure 1.1 Earlier HNE algorithms like metap-ath2vec [59] were developed on the heterogeneous network with node types of authors, papers and venues as in (a). However, one can enrich papers with a large number of terms and topics as additional nodes as in (b), which makes the random-walk based shallow embedding algorithms rather inefficient, but favors neighborhood aggregation based deep graph neural networks like R-GCN [67]. Moreover, one can further include node attributes like term embedding and labels like research fields and make them only available to the semi-supervised attributed embedding models, which may introduce even more bias [66], [75], [85], [86]. Eventually, it is often hard to clearly attribute performance gains between technical novelty and data tweaking.

In this work, we first formulate a unified yet flexible mathematical paradigm of HNE algorithms, easing the understanding of the critical merits of each model (Section 2). Particularly, based on a uniform taxonomy that clearly categorizes and summarizes the existing models (and likely future models), we propose a generic objective function of network smoothness, and reformulate all existing models into this uniform paradigm while highlighting their individual novel contributions (Section 3). We envision this paradigm to be helpful in guiding the development of future novel HNE algorithms, and in the meantime facilitate their conceptual contrast towards existing ones.

As the second contribution, we prepare four benchmark heterogeneous network datasets through exhaustive data collection, cleaning, analysis and curation (Section 4).2 The datasets we come up with cover a wide spectrum of application domains (i.e., publication, recommendation, knowledge base, and biomedicine), which have various properties regarding scale, structure, attribute/label availability, etc. This diverse set of data, together with a series of standard network mining tasks and evaluation metrics, constitute a handy and fair benchmark resource for future HNE algo-

rithms.

As the third contribution, many existing HNE algorithms (including some very popular ones) either do not have a flexible implementation (e.g., hard-coded node and edge types, fixed set of meta-paths, etc.), or do not scale to larger networks (e.g., high memory requirement during training), which adds much burden to novel research (i.e., requiring much engineering effort in correct reimplementation). To this end, we focus on 13 popular HNE algorithms, where we carefully refactor and scale up the original implementations and apply additional interfaces for plug-and-run experiments on our prepared datasets (Section 5).2 Based on these ready-to-use and efficient implementations, we then conduct all-around empirical evaluations of the algorithms, and report their benchmark performances. The empirical results, while providing much insight into the merits of different models that are consistent with the conceptual analysis in Section 3, also serve as the example utilization of our benchmark platform that can be followed by future studies on HNE.

Note that, although there have been several attempts to survey or benchmark heterogeneous network models [7], [8], [79], [87] and homogeneous graph embedding [26], [27], [28], [88], [89], [90], none of them has deeply looked into the intersection of the two. We advocate that our unified framework for the research and experiments on HNE is timely and necessary. Firstly, as we will cover in this work, there has been a significant amount of research on the particular problem of HNE especially in the very recent several years, but most of them scatter across different domains, lacking proper connections and comparisons. Secondly, none of the existing surveys has proposed a generic mathematically complete paradigm for conceptual analysis of all HNE models. Thirdly, existing surveys mostly do not provide systematic benchmark evaluation results, nor do they come with benchmark datasets and open-source baselines to facilitate future algorithm development.

The rest of this paper is organized as follows. Section 2 first introduces our proposed generic HNE paradigm. Subsequently, representative models in our survey are conceptually categorized and analyzed in Section 3. We then present in Section 4 our prepared benchmark datasets with detailed analysis. In Section 5, we provide a systematic empirical study over 13 popular HNE algorithms to benchmark the current state-of-the-art of HNE. Section 6 concludes the paper with visions towards future usage of our platform and research on HNE.

2 GENERIC PARADIGM

2.1 Problem Definitions

Definition 2.1. Heterogeneous network. A heterogeneous network is a network with multiple types of nodes and links. Particularly, within H, each node is associated with a node type , and each link is associated with a link type . Each node of type is also potentially associated with attribute , while each link of type with attribute . It is worth noting that the type of a link automatically defines the types of nodes and on its two ends.

Heterogeneous networks have been intensively studied due to its power of accommodating multi-modal multi-typed interconnected data. Besides the classic example of DBLP data used in most existing works as well as Figure 1, consider a different yet illustrative example from NYTimes in Figure 2.3 Nodes in this heterogeneous network include news articles, categories, phrases, locations, and datetimes. To illustrate the power of heterogeneous networks, we introduce the concept of meta-path, which has been leveraged by most existing works on heterogeneous network modeling [7], [8].

Definition 2.2. Meta-path. A meta-path is a path defined on the network schema denoted in the form of

types, respectively.

Each meta-path captures the proximity among the nodes on its two ends from a particular semantic perspective. Continue with our example of heterogeneous network from news data in Figure 2. The meta-path of carries different semantics from arti- . Thus, the leverage of different meta-paths allows heterogeneous network models to compute the multi-modal multi-typed node proximity and relation, which has been shown beneficial to many real-world network mining applications [15], [25], [91].

Next, we introduce the problem of general network embedding (representation learning).

Definition 2.3. Network embedding. For a given network G = {V, E}, where V is the set of nodes (vertices) and E is the set of links (edges), a network embedding is a mapping function , where |V |. This mapping defines the latent representation (a.k.a. embedding) of each node , which captures network topological information in E.

In most cases, network proximity is the major topological information to be captured. For example, DeepWalk [29] captures the random-walk based node proximity and illustrates the 2-dim node representations learned on the famous Zachary’s Karate network of small groups, where a clear correspondence between the node position in the input graph and learned embedding space can be observed. Various follow-up works have improved or extended DeepWalk, while a complete coverage of them is beyond the scope of this work. In this work, we focus on the embedding of heterogeneous networks.

Now we define the main problem of focus in this work, heterogeneous network embedding (HNE), which lies in the intersection between Def 2.1 and Def 2.3.

Definition 2.4. Heterogeneous network embedding. For a given heterogeneous network H, a heterogeneous network embedding is a set of mapping functions , where K is the number of node types, . Each mapping defines the latent representation (a.k.a. embedding) of all nodes of type k, which captures the network topological information regarding the heterogeneous links in E.

Fig. 2: Toy example of a heterogeneous network constructed from the news data.

Compared with homogeneous networks, the definition of topological information in heterogeneous networks is even more diverse. As we will show in Section 3, the major distinctions among different HNE algorithms mostly lie in their different ways of capturing such topological information. Particularly, the leverage of meta-paths as in Def 2.2 often plays an essential role, since many popular HNE algorithms exactly aim to model the different proximity indicated by meta-paths [59], [63], [66], [69], [70], [73], [75], [77].

2.2 Proposed Paradigm

In this work, we stress that one of the most important principles underlying HNE (as well as most other scenarios of network modeling and mining) is homophily [92]. Particularly, in the network embedding setting, homophily can be translated as ‘nodes close on a network should have similar embeddings’, which matches the requirement of Def 2.3. In fact, we further find intrinsic connections between the well-perceived homophily principle and widely-used smoothness enforcement technique on networks [93], [94], [95], which leads to a generic mathematical paradigm covering most existing and likely many future HNE algorithms.

Based on earlier well-established concepts underlying network modeling and embedding learning [93], [94], [95], [96], [97], we introduce the following key objective function of network smoothness enforcement as follows

where and are the node embedding vectors to be learned. is the proximity weight, is the embedding distance function, and denotes possible additional objectives such as regularizers, all three of which can be defined and implemented differently by the particular HNE algorithms.

3 ALGORITHM TAXONOMY

In this section, we find a universal taxonomy for existing HNE algorithms with three categories based on their common objectives, and elaborate in detail how they fit into our paradigm of Eq. (1). The main challenge of instantiating Eq. (1) on heterogeneous networks is the consideration of complex interactions regarding multi-typed links and higher-order meta-paths. In fact, our Eq. (1) also readily generalizes to homogeneous networks, though that is beyond the scope of this work.

3.1 Proximity-Preserving Methods

As mentioned above, one basic goal of network embedding is to capture network topological information. This can be achieved by preserving different types of proximity among nodes. There are two major categories of proximity-preserving methods in HNE: random walk approaches (inspired by DeepWalk [29]) and first/second-order proximity based ones (inspired by LINE [30]). Both types of proximity-preserving methods are considered as shallow network embedding, due to their essential single-layer decomposition of certain affinity matrices [98].

3.1.1 Random Walk Approaches

metapath2vec [59]. Following homogeneous network embedding [29], [31], metapath2vec utilizes the node paths traversed by meta-path guided random walks to model the context of a node regarding heterogeneous semantics. For-

mally, given a meta-path , the transition probability at step i is defined as

p

where denotes the neighbors of v associated with edge type l. Assume is the set of generated random walk sequences. The objective of metapath2vec is

where C(v) is the context (i.e., skip-grams) of v in P. For example, if and the context window size is 2, then . Let be the number of times that , and we can rewrite Eq. (3) as

Calculating the denominator in this objective requires summing over all nodes, which is computationally expensive. In actual computation, it is approximated using negative sampling [30], [99].

HIN2Vec [63]. HIN2Vec considers the probability that there is a meta-path M between nodes u and v. Specifically,

where 1 is an all-ones vector; is the Hadamard product; is a normalization function. Here and can be viewed as the embeddings of u, v and M, respectively. Let . We have

is the sigmoid function, so we have

HIN2Vec generates positive tuples (u, v, M) (i.e., u connects with v via the meta-path M) using homogeneous random walks [29] regardless of node/link types. For each positive tuple (u, v, M), it generates several negative tuples by replacing u with a random node . Its objective is

This is essentially the negative sampling approximation of the following objective

where is the number of path instances between u and v following the meta-path M.

Other random walk approaches are summarized in Table 1. To be specific, MRWNN [57] incorporates content priors into DeepWalk for image retrieval; SHNE [69] incorporates additional node information like categorical attributes, images, etc. by leveraging domain-specific deep encoders; HHNE [73] extends metapath2vec to the hyperbolic space; GHE [19] proposes a semi-supervised meta-path weighting technique; MNE [100] conducts random walks separately for each view in a multi-view network; JUST [65] proposes random walks with Jump and Stay strategies that do not rely on pre-defined meta-paths; HeteSpaceyWalk [101] introduces a scalable embedding framework based on heterogeneous personalized spacey random walks; TapEm [102] proposes a task-guided node pair embedding approach for author identification.

3.1.2 First/Second-Order Proximity Based Approaches

PTE [103]. PTE proposes to decompose a heterogeneous network into multiple bipartite networks, each of which describes one edge type. Its objective is the sum of loglikelihoods over all bipartite networks:

Here is the set of edge types; is the type-l edge weight of (u, v) (if there is no edge between u and v with type l, then is the total edge weight between u and v.

AspEm [60]. AspEm assumes that each heterogeneous network has multiple aspects, and each aspect is defined as a subgraph of the network schema [14]. An incompatibility measure is proposed to select appropriate aspects for embedding learning. Given an aspect a, its objective is

where is the set of edge types in aspect is a normalization factor; is the aspect-specific embedding of u.

HEER [61]. HEER extends PTE by considering typed closeness. Specifically, each edge type l has an embedding , and its objective is

where is the edge embedding of . In [61], has different definitions for directed and undirected edges based on the Hadamard product. To simplify our discussion, we assume . Let . Then we have and

Other first/second-order proximity based approaches are summarized in Table 1. To be specific, Chang et al. [56] introduce a node type-aware content encoder; CMF [58] performs joint matrix factorization over the decomposed bipartite networks; HEBE [62] preserves proximity regarding each meta-graph; Phine [11] combines additional regularization towards semi-supervised training; MVE [104] proposes an attenion based framework to consider multiple views of the same nodes; DRL [68] proposes to learn one type of edges at each step and uses deep reinforcement learning approaches to select the edge type for the next step; GERM [105] adopts a genetic evolutionary approach to preserve the critical edge-types while removing the noisy or incompatible ones given specific tasks; mg2vec [106] jointly embeds nodes and meta-graphs into the same space by exploiting both first-order and second-order proximity.

3.1.3 Unified Objectives

Based on the discussions above, the objective of most proximity-preserving methods can be unified as

Here, is the weight of node pair (u, v) in the objective4; is an algorithm-specific regularization term (summarized in Table 1); s(u, v) is a proximity function between u and v. Negative Sampling. Directly optimizing the objective in Eq. (4) is computationally expensive because it needs to traverse all nodes when computing the softmax function. Therefore, in actual computation, most studies adopt the negative sampling strategy [30], [99], which modifies the objective as follows:

Here, b is the number of negative samples; is known as the noise distribution that generates negative samples.

Negative sampling serves as a generic paradigm to unify network embedding approaches. For example, starting from the negative sampling objective, Qiu et al. [98] unify DeepWalk [29], LINE [30], PTE [103] and node2vec [31] into a matrix factorization framework. In this paper, as mentioned in Section 2.2, we introduce another objective (i.e., Eq. (1)) that is equivalent to Eq. (4) and consider network embedding from the perspective of network smoothness enforcement.

Network Smoothness Enforcement. Note that in most cases, we can write s(u, v) in Eq. (4) as . For example, in metapath2vec, PTE, etc.; in HIN2Vec; in HEER. In these cases,

The last step holds because . Therefore, our goal is equivalent to

Here and are two regularization terms. Without can be minimized by letting ; without can be minimized by letting in Eq. (1) is then the sum of and . Among them, is algorithm-specific, while and are commonly shared across most HNE algorithms in the proximity-preserving group. We summarize the choices of and in Table 1.

Although Eq. (4) can cover most existing and likely many future approaches, we would like to remark that there are also studies adopting other forms of proximity-preserving objectives. For example, SHINE [107] uses reconstruction loss of autoencoders; HINSE [64] adopts spectral embedding based on adjacency matrices with different meta-graphs; MetaDynaMix [108] introduces a low-rank matrix factorization framework for dynamic HIN embedding; HeGAN [72] proposes an adversarial learning approach with a relation type-aware discriminator.

3.2 Message-Passing Methods

Each node in a network can have attribute information represented as a feature vector . Message-passing meth-

node classification,link prediction AspEm [60]

TABLE 1: A summary of proximity-preserving based HNE algorithms. (Additional notations: : a function to encode text/image/attribute information; : distance between two points in the Poincar´e ball; : a function to map two d-dimensional embeddings to one d-dimenional vector.)

ods aim to learn node embeddings based on by aggregating the information from u’s neighbors. In recent studies, Graph Neural Networks (GNNs) [33] are widely adopted to facilitate this aggregation/message-passing process. Compared to the proximity based HNE methods, message-passing methods, especially the GNN based ones are often considered as deep network embedding, due to their multiple layers of learnable projection functions.

R-GCN [67]. R-GCN has K convolutional layers. The initial node representation is just the node feature . In the k-th convolutional layer, each representation vector is updated by accumulating the vectors of neighboring nodes through a normalized sum.

Different from the regular GCN model [33], R-GCN considers edge heterogeneity by learning multiple convolution matrices W ’s, each of which corresponding to one edge type. During message passing, neighbors under the same edge type will be aggregated and normalized first. The node embedding is the output of the K-th layer (i.e., ).

In unsupervised settings, message-passing approaches use link prediction as their downstream task to train GNNs. To be specific, the likelihood of observing edges in the heterogeneous network is maximized. R-GCN optimizes a cross-entropy loss through negative sampling. Essentially, it is the approximation of the following objective:

where .

Recently, CompGCN [109] extends R-GCN by leveraging a variety of entity-relation composition operations so as to jointly embed nodes and relations.

HAN [75]. Instead of considering one-hop neighbors, HAN utilizes meta-paths to model higher-order proximity. Given

TABLE 2: A summary of message-passing based HNE algorithms. (Additional notations: : neighbors of v defined through random walk with restart [113]; : neighbors of u with edge type : neighbors of u with node type o; : nodes connects with u via meta-path : a meta-path instance of M connecting u and : a function to aggregate information from neighbors. We show the transductive version of GATNE.)

a meta-path M, the representation of node u is aggregated from its meta-path based neighbors connects with u via the meta-path M}. HAN proposes an attention mechanism to learn the weights of different neighbors:

where is the node-level attention vector of is the projected feature vector of node u; || is the concatenation operator. Given the meta-path specific embedding , HAN uses a semantic-level attention to weigh different meta-paths:

where q is the semantic-level attention vector.

In the original HAN paper, the authors mainly consider the task of semi-supervised node classification. For unsupervised learning (i.e., without any node labels), according to [112], HAN can use the link prediction loss introduced in GraphSAGE [34], which is the negative sampling approximation of the following objective:

Here, is the edge weight of (u, v).

MAGNN [112]. MAGNN extends HAN by considering both the meta-path based neighborhood connects with u via meta-path M} and the nodes along the meta-path instances. Given a meta-path M, MAGNN first employs an encoder to transform all the node features along an instance of M into a single vector.

where is the projected feature vector of node denotes a meta-path instance of M connecting u and is a relational rotation encoder inspired by [114]. After encoding each meta-path instance, MAGNN proposes an intra-meta-path aggregation to learn the weight of different neighbors:

where is the node-level attention vector of M. This attention mechanism can also be extended to multiple heads. After aggregating the information within each meta-path, MAGNN further combines the semantics revealed by all meta-paths using an inter-meta-path aggregation:

where is the semantic-level attention vector for node type . In comparison with HAN, MAGNN employs an additional projection to get the final representation .

For link prediction, MAGNN adopts the loss introduced in GraphSAGE [34], which is equivalent to the objective in Eq. (6).

HGT [85]. Inspired by the success of Transformer [115], [116] in text representation learning, Hu et al. propose to use

Here, is the output of the k-th HGT layer (); and are node type-aware linear mappings; is the i-th attention head; is a prior tensor representing the weight of each edge type in the attention. Parallel to the calculation of attention, the message passing process can be computed in a similar way by incorporating node and edge types:

where is also a node type-aware linear mapping. To aggregate the messages from v’s neighborhood, the attention vector serves as the weight to get the updated vector:

and following the residual connection [117], the output of the k-th layer is

Here, is a linear function mapping v’s vector back to its node type-specific distribution.

For the unsupervised link prediction task, HGT borrows the objective function from Neural Tensor Network [20], which will be further discussed in Section 3.3.

Some other message-passing approaches are summarized in Table 2. For example, HEP [110] aggregates u’s representation from (i.e., the node type-aware neighborhood) to reconstruct u’s own embedding; HetGNN [71] also adopts a node type-aware neighborhood aggregation, but the neighborhood is defined through random walk with restart [113]; GATNE [70] aggregates u’s representation from (i.e., the edge type-aware neighborhood) and is applicable to both transductive and inductive network embedding settings; MV-ACM [111] aggregates u’s representation from (i.e., the meta-path-aware neighborhood) and utilizes an adversarial learning framework to learn the reciprocity between different edge types.

The objective of message-passing approaches mentioned above can also be written as Eq. (4) (except HGT, whose objective is the same as NTN [20] and will be discussed in Section 3.3), where s(u, v) is a function of and . The only difference is that here is aggregated from using GNNs. Following the derivation of proximity-preserving approaches, if we still write in Eq. (1) as the sum of and , we can get the exactly same and as in Eq. (5). We summarize the choices of , and the aggregation function in Table 2.

Within this group of algorithms, HEP has an additional reconstruction loss , and MV-ACM [111] has an adversarial loss , where G and D are the generator and discriminator of complementary edge types, respectively. All the other algorithms in this group have .

Similar to the case of proximity-preserving approaches, there are also message-passing methods adopting other forms of objectives. For example, GraphInception [66] studies collective classification (where the labels within a group of instances are correlated and should be inferred collectively) using meta-path-aware aggregation; HDGI [86] maximizes local-global mutual information to improve unsupervised training based on HAN; NEP [74] performs semi-supervised node classification using an edge type-aware propagation function to mimic the process of label propagation; NLAH [124] also considers semi-supervised node classification by extending HAN with several precomputed non-local features; HGCN [125] explores the collective classification task and improves GraphInception by considering semantics of different types of edges/nodes and relations among different node types; HetETA [126] studies a specific task of estimating the time of arrival in intelligent transportation using a heterogeneous graph network with fast localized spectral filtering.

3.3 Relation-Learning Methods

As discussed above, knowledge graphs can be regarded as a special case of heterogeneous networks, which are schema-rich [127]. To model network heterogeneity, existing knowledge graph embedding approaches explicitly model the relation types of edges via parametric algebraic operators, which are often shallow network embedding [79], [128]. Compared to the shallow proximity-preserving HNE models, they often focus on the designs of triplet based scoring functions instead of meta-paths or meta-graphs due to the large numbers of entity and relation types.

Each edge in a heterogeneous network can be viewed as a triplet (u, l, v) composed of two nodes and an edge type (i.e., entities and relations, in the terminology of KG). The goal of relation-learning methods is to learn a scoring function which evaluates an arbitrary triplet and outputs a scalar to measure the acceptability of this triplet. This idea is widely adopted in KB embedding. Since there are surveys of KB embedding algorithms already [79], we only cover the most popular approaches here and highlight their connections to HNE.

TransE [4]. TransE assumes that the relation induced by l-labeled edges corresponds to a translation of the embedding (i.e., ) when (u, l, v) holds. Therefore, the scoring function of TransE is defined as

where p = 1 or 2. The objective is to minimize a margin based ranking loss.

TABLE 3: A summary of relation-learning based HNE algorithms. (Additional notations: f: a non-linear activation function; : the Frobenius norm of a matrix; : the fast Fourier transform and its inverse; x: the complex conjugate of : the inverse of relation : the tensor product along the i-th mode; : the set of learned parameters; : 2D reshaping matrices of and [82]; vec: the vectorization operator; : the convolution operator; : a matrix aligning the output vectors from the convolution with all kernels [84].)

where T is the set of positive triplets (i.e., edges); is the set of corrupted triplets, which are constructed by replacing either u or v with an arbitrary node. Formally,

TransE is the most representative model using “a translational distance” to define the scoring function. It has many extensions. For example, TransH [118] projects each entity vector to a relation-specific hyperplane when calculating the distance; TransR [80] further extends relation-specific hyperplanes to relation-specific spaces; RHINE [76] distinguishes affiliation relations from interaction relations and adopts different objectives for the two types of relations. For more extensions of TransE, please refer to [83].

Recently, Sun et al. [114] propose the RotatE model, which defines each relation as a rotation (instead of a translation) from the source entity to the target entity in the complex vector space. Their model is able to describe various relation patterns including symmetry/antisymmetry, inversion, and composition.

DistMult [81]. In contrast to translational distance models [4], [80], [118], DistMult exploits a similarity based scoring function. Each relation is represented by a diagonal matrix , and is defined using a bilinear function:

Note that for any u and v. Therefore, DistMult is mainly designed for symmetric relations.

Using the equation , we have

ComplEx [121]. Instead of considering a real-valued embedding space, ComplEx introduces complex-valued representations for and . Similar to DistMult, it utilizes a similarity based scoring function:

where is the real part of a complex number; e is the complex conjugate of . Here, it is possible that , which allows ComplEx to capture asymmetric relations.

In the complex space, using the equation , we have

Besides ComplEx, there are many other extensions of DistMult. RESCAL [119] uses a bilinear scoring function similar to the one in DistMult, but is no longer restricted to be diagonal; HolE [120] composes the node representations using the circular correlation operation, which combines the expressive power of RESCAL with the efficiency of DistMult; SimplE [122] considers the inverse of relations and calculates the average score of (u, l, v) and ; TuckER [123] proposes to use a three-way Tucker tensor decomposition approach to learning node and relation embeddings.

ConvE [82]. ConvE goes beyond simple distance or similarity functions and proposes deep neural models to score a triplet. The score is defined by a convolution over 2D shaped embeddings. Formally,

where and denote the 2D reshaping matrices of node embedding and relation embedding, respectively; vec is the vectorization operator that maps a m by n matrix to a mn-dimensional vector; “” is the convolution operator.

There are several other models leveraging deep neural scoring functions. For example, NTN [20] proposes to combine the two node embedding vectors by a relation-specific tensor , where is the dimension of ; NKGE [83] develops a deep memory network to encode information from neighbors and employs a gating mechanism to integrate structure representation and neighbor representation ; SACN [84] encodes node representations using a graph neural network and then scores the triplet using a convolutional neural network with the translational property.

Most relation-learning approaches adopt a margin based ranking loss with some regularization terms that generalizes Eq. (8):

In [129], Qiu et al. point out that the margin based loss shares a very similar form with the following negative sampling loss:

Following [129], if we use the negative sampling loss to rewrite Eq. (9), we are approximately maximizing

For translational models [4], [76], [80], [118] whose is described by a distance function, maximizing J is equivalent to

In this case, we can write as . For RotatE [114], the objective is the same except that .

For similarity based approaches [81], [119], [121], [122], we follow the derivation of Eq. (5), and the objective will be

Here, in RESCAL [119] and DistMult [81]; or in SimplE [122]; and in ComplEx [121]. The regularization term is .

For neural triplet scorers [20], [21], [82], [84], the forms of are more complicated than translational distances or bilinear products. In these cases, since distance (or dissimilarity) and proximity can be viewed as reverse metrics, we define as , where C is an constant upper bound of . Then the derivation of the loss function is similar to that of Eq. (10), i.e.,

In this case, .

We summarize the choices of and in Table 3. Note that for relation learning methods, is usually not a distance metric. For example, in most translational distance models and deep neural models. This is intuitive because (u, l, v) and (v, l, u) often express different meanings.

4 BENCHMARK

4.1 Dataset Preparation

Towards off-the-shelf evaluation of HNE algorithms with standard settings, in this work, we collect, process, analyze, and publish four new real-world heterogeneous network datasets from different domains, which we aim to set up as a handy and fair benchmark for existing and future HNE algorithms.2

DBLP. We construct a network of authors, papers, venues, and phrases from DBLP. Phrases are extracted by the popular AutoPhrase [130] algorithm from paper texts and further filtered by human experts. We compute word2vec [99] on all paper texts and aggregate the word embeddings to get 300-dim paper and phrase features. Author and venue features are the aggregations of their corresponding paper features. We further manually label a relatively small portion of authors into 12 research groups from four research areas by crawling the web. Each labeled author has only one label.

Fig. 3: Portion of different node types in four real-world heterogeneous network datasets.

Fig. 4: Degree distribution of different node types in four real-world heterogeneous network datasets.

Fig. 5: Local clustering coefficient of different node types in four real-world heterogeneous network datasets.

Fig. 6: Number of five most frequent 2-hop meta-paths in four real-world heterogeneous network datasets.

Yelp. We construct a network of businesses, users, locations, and reviews from Yelp.5 Nodes do not have features, but a large portion of businesses are labeled into sixteen categories. Each labeled business has one or multiple labels.

Freebase. We construct a network of books, films, music, sports, people, locations, organizations, and businesses from Freebase.6 Nodes are not associated with any features, but a large portion of books are labeled into eight genres of literature. Each labeled book has only one label.

PubMed. We construct a network of genes, diseases, chemicals, and species from PubMed.7 All nodes are extracted by AutoPhrase [130], typed by bioNER [131], and further filtered by human experts. The links are constructed through open relation pattern mining [132] and manual selection. We compute word2vec [99] on all PubMed papers and aggregate the word embeddings to get 200-dim features for all types of nodes. We further label a relatively small portion of diseases into eight categories. Each labeled disease has only one label.

The four datasets we prepare are from four different domains, which have been individually studied in some existing works on HNE. Among them, DBLP has been most commonly used, because all information about authors, papers, etc. is public and there is no privacy issue, and the results are often more interpretable to researchers in computer science related domains. Other real-world networks like Yelp and IMDB have been commonly studied for recommender systems. These networks are naturally heterogeneous, including at least users and items (e.g., businesses and movies), as well as some additional item descriptors (e.g., categories of businesses and genres of movies). Freebase is one of the most popular open-source knowledge graph, which is relatively smaller but cleaner compared with the others (e.g., YAGO [133] and Wikidata [134]), where most entity and relation types are well defined. One major difference between conventional heterogeneous networks and knowledge graphs is the number of types of nodes and links. We further restrict the types of entities and relations inside Freebase, so as to get a heterogeneous network that is closer to a knowledge graph, while in the meantime does not have too many types of nodes and links. Therefore, most conventional HNE algorithms can be applied on this dataset and properly compared against the KB embedding ones. PubMed is a novel biomedical network we directly construct through text mining and manual processing on biomedical literature. This is the first time we make it available to the public, and we hope it to serve both the evaluation of HNE algorithms and novel downstream tasks in biomedical science such as biomedical information retrieval and disease evolution study.

We notice that several other heterogeneous network datasets such as OAG [85], [87], [135] and IMDB [75], [112], [125] have been constructed recently in parallel to this work. Due to their similar nature and organization as some of our datasets (e.g., DBLP, Yelp), our pipeline can be easily adopted on these datasets, so we do not copy them here.

4.2 Structure Analysis

A summary of the statistics on the four datasets is provided in Table 4 and Figure 3. As can be observed, the datasets have different sizes (numbers of nodes and links) and heterogeneity (numbers and distributions of node/link types). Moreover, due to the nature of the data sources, DBLP and PubMed networks are attributed, whereas Yelp and Freebase networks are abundantly labeled. A combination of these four datasets thus allows researchers to flexibly start by testing an HNE algorithm in the most appropriate settings, and eventually complete an all-around evaluation over all settings.

We also provide detailed analysis regarding several most widely concerned properties of heterogeneous networks, i.e., degree distribution (Figure 4), clustering coefficient (Figure 5), and number of frequent meta-paths (Figure 6). In particular, degree distribution is known to significantly influence the performance of HNE algorithms due to the widely used node sampling process, whereas clustering coefficient impacts HNE algorithms that utilize latent community structures. Moreover, since many HNE algorithms rely on meta-paths, the skewer distribution of meta-paths can bias towards algorithms using fewer meta-paths.

As we can see, the properties we concern are rather different across the four datasets we prepare. For example, there are tighter links and more labels in Yelp, while there are more types of nodes and links in Freebase; compared with nodes in Freebase and PubMed which clearly follow the long-tail degree distribution, certain types of nodes in DBLP and Yelp are always well connected (e.g., phrases in DBLP and businesses in Yelp), forming more star-shaped subgraphs; the type-wise clustering coefficients and meta-path distributions are the most skewed in DBLP and most balanced in PubMed. The set of four datasets together provide a comprehensive benchmark towards the robustness and generalizability of various HNE algorithms (as we will also see in Section 5.2).

4.3 Settings, Tasks, and Metrics

We mainly compare all 13 algorithms under the setting of unsupervised unattributed HNE over all datasets, where the essential goal is to preserve different types of edges in the heterogeneous networks. Moreover, for message-passing algorithms that are particularly designed for attributed and semi-supervised HNE, we also conduct additional experiments for them in the corresponding settings. Particularly, due to the nature of the datasets, we evaluate attributed HNE on DBLP and PubMed datasets where node attributes are available, and semi-supervised HNE on Yelp and Freebase where node labels are abundant. We always test the computed network embeddings on the two standard network mining tasks of node classification and link prediction. Note that, while most existing studies on novel HNE algorithms have been focusing on these two standard tasks [59], [61], [63], [75], [85], [103], [112], we notice that there are also various other tasks due to the wide usage of heterogeneous networks in real-world applications [11], [126], [136], [137], [138], [139], [140]. While the performance of HNE there can be rather task-dependant and data-dependant, to firstly simplifying the task into either a standard node classification or link prediction problem can often serve to provide more insights into the task and dataset, which can help the further development of novel HNE algorithms.

For the standard unattributed unsupervised HNE setting, we first randomly hide 20% links and train all HNE algorithms with the remaining 80% links. For node classifi-cation, we then train a separate linear Support Vector Machine (LinearSVC) [141] based on the learned embeddings on 80% of the labeled nodes and predict on the remaining 20%. We repeat the process for five times and compute the average scores regarding macro-F1 (across all labels) and micro-F1 (across all nodes). For link prediction, we use the Hadamard function to construct feature vectors for node pairs, train a two-class LinearSVC on the 80% training links and evaluate towards the 20% held out links. We also repeat the process for five times and compute the two metrics of AUC (area under the ROC curve) and MRR (mean reciprocal rank). AUC is a standard measure for classification, where we regard link prediction as a binary classification problem, and MRR is a standard measure for ranking, where we regard link prediction as a link retrieval problem. Since exhaustive computation over all node pairs is too heavy, we always use the two-hop neighbors as the candidates for all nodes. For attributed HNE, node features are used during the training of HNE algorithms, whereas for semi-supervised HNE, certain amounts of node labels are used (80% by default).

5 EXPERIMENTAL EVALUATIONS

5.1 Algorithms and Modifications

We amend the implementations of 13 popular HNE algorithms for seamless and efficient experimental evaluations on our prepared datasets. The algorithms we choose and the modifications we make are as follows.

• metapath2vec [59]: Since the original implementation contains a large amount of hard-coded data-specific settings such as node types and meta-paths, and the optimization is unstable and limited as it only examines one type of meta-path based context, we completely reimplement the algorithm. In particular, we first run random walks to learn the weights of different meta-paths based on the number of sampled instances, and then train the model using the unified loss function, which is a weighted sum over the loss functions of individual meta-paths. Both the random walk and meta-path-based embedding optimization are implemented with multithreads in parallel.

• PTE [103]: Instead of accepting labeled texts as input and working on text networks with the specific three types of nodes (word, document, and label) and three types of links (word-word, document-word, and label-word), we revise the original implementation and allow the model to consume heterogeneous networks directly with arbitrary types of nodes and links by adding more type-specific objective functions.

• HIN2Vec [63]: We remove unnecessary data preprocessing codes and modify the original implementation so that the program first generates random walks, then trains the model, and finally outputs node embeddings only.

• AspEm [60]: We clean up the hard-coded data-specific settings in the original implementation and write a script to connect the different components of automatically selecting the aspects with the least incompatibilities, as well as learning, matching, and concatenating the embeddings based on different aspects.

• HEER [61]: We remove the hard-coded data-specific settings and largely simplify the data preprocessing step in the original implementation by skipping the knockout step and disentangling the graph building step.

• R-GCN [67]: The existing implementation from DGL [142] is only scalable to heterogeneous networks with thousands of nodes, due to the requirement of putting the whole graphs into memory during graph convolutions. To scale up R-GCN, we perform fixed-sized node and link sampling for batch-wise training following the framework of GraphSAGE [34].

• HAN [75]: Since the original implementation of HAN contains a large amount of hard-coded data-specific settings such as node types and meta-paths, and is unfeasible for large-scale datasets due to the same reason as R-GCN, we completely reimplement the HAN algorithm based on our implementation of R-GCN. In particular, we first automatically construct meta-path based adjacency lists for the chosen node type, and then sample the neighborhood for the seed nodes during batch-wise training.

• MAGNN [112]: The original DGL-based [142] unsupervised implementation of MAGNN solely looks at predicting the links between users and artists in a music website dataset with the use of a single neural layer. Hence, we remove all the hard-coded data-specific settings and refactor the entire pipeline so that the model can support multi-layer mini-batch training over arbitrary link types.

• HGT [85]: The original PyG-based [143] implementation of HGT targets on the specific task of author disambiguation among the associated papers in a dynamic academic graph [135]. Therefore, we refactor it by removing hard-coded data-specific settings, assigning the same timestamp to all the nodes, and conducting training over all types of links.

• TransE [4]: We modify the OpenKE [144] implementation so that the model outputs node embeddings only.

• DistMult [81]: We remove the hard-coded data-specific settings and largely simplify the data preprocessing step in the original implementation.

• ComplEx [121]: Same as for TransE.

• ConvE [82]: Same as for DistMult.

We set the embedding size of all algorithms to 50 by default, and tune other hyperparameters following the original papers through standard five-fold cross validation on all datasets. We have put the implementation of all compared algorithms in a python package and released them together with the datasets to constitute an open-source ready-to-use HNE benchmark.2

5.2 Performance Benchmarks

We provide systematic experimental comparisons of the 13 popular state-of-the-art HNE algorithms across our four datasets, on the scenarios of unsupervised unattributed HNE, attributed HNE, and semi-supervised HNE.

TABLE 5: Performance comparison (%) under the standard setting of unattributed unsupervised HNE.

Fig. 7: Performance comparison under controlled experiments with varying emb. sizes and link removals (PubMed).

Fig. 8: Performance comparison under controlled experiments with varying label amount and attr. noise (PubMed).

Table 5 shows the performance of compared algorithms on unsupervised unattributed HNE, evaluated towards node classification and link prediction. We have the following observations.

From the perspective of compared algorithms:

(1) Proximity-preserving algorithms often perform well on both tasks under the unsupervised unattributed HNE setting, which explains why proximity-preserving is the most widely used HNE or even general network embedding framework when node attributes and labels are unavailable. Among the proximity-preserving methods, HIN2Vec and HEER show reasonable results on link prediction but perform not so well on node classification (especially on DBLP and Freebase). In fact, these two methods focus on modeling link representations in their objectives (in HIN2Vec and in HEER), thus are more suitable for link prediction.

(2) Under the unsupervised unattributed HNE setting, message-passing methods perform poorly except for HGT, especially on node classification. As we discuss before, message-passing methods are known to excel due to their integration of node attributes, link structures, and training labels. When neither of node attributes and labels are avail-

able, we use random vectors as node features and adopt a link prediction loss, which largely limits the performance of R-GCN, HAN, and MAGNN. We will focus our evaluation on the message-passing algorithms in the attributed and semi-supervised HNE settings later. On the contrary, HGT exhibits competitive results on both node classification and link prediction. This is attributed to the usage of nodetype and link-type dependent parameters which maintains dedicated representations for different types of nodes and links. In addition, the heterogeneous mini-batch graph sampling algorithm designed by HGT further reduces the loss of structural information due to sampling and boosts the performance to a greater extent. Finally, the link prediction result of HAN on the Yelp dataset is not available. This is because HAN can only embed one type of nodes at a time (we embed Business in Yelp) and thus predict the link between two nodes with the same type (i.e., BusinessBusiness). However, all links in Yelp connect distinct types of nodes (e.g., Business-Location, Business-User), and HAN cannot predict such links (thus marked as N/A).

(3) Relation-learning methods such as TransE and ComplEx perform better on Freebase and PubMed on both tasks, especially on link prediction. In fact, in Table 4 and Figure

3 we can observe that both datasets (especially Freebase) have more link types. Relation-learning approaches, which are mainly designed to embed knowledge graphs (e.g., Freebase), can better capture the semantics of numerous types of direct links. From the perspective of datasets: (1) All approaches have relatively low F1 scores on Yelp

and PubMed (especially Yelp) on node classification. This is

because both datasets have larger numbers of classes (i.e., 16 in Yelp and 8 in PubMed) as shown in Table 4. Moreover, unlike the cases of the other datasets, a node in Yelp can have multiple labels, which makes the classification task more challenging. (2) In Figure 4, we can observe that the degree distri-

bution of Freebase is more skewed. Therefore, when we

conduct link sampling or random walks on Freebase during representation learning, nodes with lower degrees will be sampled less frequently and their representations may not be learned accurately. This observation may explain why the link prediction metrics on Freebase are in general lower than those on Yelp and PubMed. (3) As we can see in Figure 3-6, most studied network

properties are more balanced on Freebase and PubMed

(especially PubMed) across different types of nodes and links. This in general makes both the node classification and link prediction tasks harder for all algorithms, and also makes the gaps among different algorithms smaller.

5.3 Ablation Studies

To provide an in-depth performance comparison among various HNE algorithms, we further conduct controlled experiments by varying the embedding sizes and randomly removing links from the training set.

In Figure 7, we show the micro-F1 scores for node classi-fication and AUC scores for link prediction computed on the PubMed dataset. We omit the other results here, which can be easily computed in our provided benchmark package. As we can observe, some algorithms are more robust to varying settings while some others are more sensitive. In general, varying embedding size and link removal can significantly impact the performance of most algorithms on both tasks, and sometimes can even lead to different ordering of certain algorithms. This again emphasizes the importance of setting up standard benchmark including datasets and evaluation protocols for systematic HNE algorithm evaluation. In particular, on PubMed, larger embedding sizes like over 50 can harm the performance of most algorithms especially on node classification, probably due to overfitting with the limited labeled data. Interestingly, the random removal of links does have a negative impact on link prediction, but it does not necessarily harm node classification. This means that node classes and link structures may not always be tightly correlated, and even parts of the links already provide the necessary information useful enough for node classification.

Towards the evaluation of the message-passing HNE algorithms designed to integrate node attributes and labels into representation learning like R-GCN, HAN, MAGNN, and HGT, we also conduct controlled experiments by adding random Gaussian noises to the node attributes and masking different amounts of training labels.

In Figure 8, we show the results on the PubMed dataset. As we can observe, the scores in most subfigures are sig-nificantly higher than the scores in Table 5, indicating the effectiveness of R-GCN, HAN, MAGNN, and HGT in integrating node attributes and labels for HNE. In particular, the incorporation of node attributes boosts the node classifica-tion results of R-GCN, HAN, and MAGNN significantly (almost tripling the F1 scores and significantly higher than all algorithms that do not use node attributes), but it offers very little help to HGT. This suggests that R-GCN, HAN, and MAGNN can effectively leverage the semantic information associated with attributes, whereas HGT relies more on the network structures and type information of nodes and links. Moreover, MAGNN achieves the highest node classification results with large margins as it successfully incorporates the attributes of intermediate nodes along the meta-paths (which are ignored by HAN). In addition, when random noises with larger variances are added to node attributes, the performance of node classification significantly drops, while the performance of link prediction is less affected. As more training labels become available, without a surprise, the node classification results of all four algorithms increase, but surprisingly the link prediction results are almost not affected. These observations again reveal the different natures of the two tasks, where node classes are more related to node contents, whereas links should be typically inferred from structural information.

6 FUTURE

In this work, we present a comprehensive survey on various existing HNE algorithms, and provide benchmark datasets and baseline implementations to ease future research in this direction. While HNE has already demonstrated strong performance across a variety of downstream tasks, it is still in its infancy with many open challenges. To conclude this work and inspire future research, we now briefly discuss the limitation of current HNE and several specific directions potentially worth pursuing.

Beyond homophily. As we formulate in Eq. (1), current HNE algorithms focus on the leverage of network homophily. Due to recent research on homogeneous networks that study the combination of positional and structural embedding [49], [145], it would be interesting to explore how to generalize such design principles and paradigms to HNE. Particularly, in heterogeneous networks, relative positions and structural roles of nodes can both be measured under different meta-paths or meta-graphs, which are naturally more informative and diverse. However, such considerations also introduce harder computational challenges.

Beyond accuracy. Most, if not all, existing research on HNE has primarily focused on the accuracy towards different downstream tasks. It would be interesting to further study the scalability and efficiency (for large-scale networks) [34], [146], temporal adaptability (for dynamic evolving networks) [147], [148], robustness (towards adversarial attacks) [32], [149], interpretability [150], uncertainty [151], fairness [152], [153] of HNE, and so on.

Beyond node embedding. Graph- and subgraph-level embeddings have been intensively studied on homogeneous networks to enable graph-level classification [43], [154] and unsupervised message-passing model training [36], [155], but they are hardly studied on heterogeneous networks [156]. Although existing works like HIN2Vec [63] study the embedding of meta-paths to improve the embedding of nodes, direct applications of graph- and subgraph-level embeddings in the context of heterogeneous networks largely remain nascent.

Revisiting KB embedding. The difference between KB embedding and other types of HNE is mainly due to the numbers of node and link types. Direct application of KB embedding to heterogeneous networks fails to consider meta-paths with rich semantics, whereas directly applying HNE to KB is unrealistic due to the exponential number of meta-paths. However, it would still be interesting to study the intersection between these two groups of methods (as well as two types of data). For example, how can we combine the ideas of meta-paths on heterogeneous networks and embedding transformation on KB for HNE with more semantic-aware transformations? How can we devise truncated random walk based methods for KB embedding to include higher-order relations?

Modeling heterogeneous contexts. Heterogeneous networks mainly model different types of nodes and links. However, networks nowadays are often associated with rich contents, which provide contexts of the nodes, links, and subnetworks [157], [158], [159]. There have been studies exploiting text-rich heterogeneous information networks for taxonomy construction [160], [161] and text classification [162], [163]. Meanwhile, how to model heterogeneous interactions under multi-faceted contexts through the integration of multi-modal content and structure could be a challenging but rewarding research area.

Understanding the limitation. While HNE (as well as many neural representation learning models) has demonstrated strong performance in various domains, it is worthwhile to understand its potential limits. For example, when do modern HNE algorithms work better compared with traditional network mining approaches (e.g., path counting, subgraph matching, non-neural or linear propagation)? How can we join the advantages of both worlds? Moreover, while there has been intensive research on the mathematical mechanisms behind neural networks for homogeneous network data (e.g., smoothing, low-pass filtering, invariant and equivariant transformations), by unifying existing models on HNE, this work also aims to stimulate further theoretical studies on the power and limitation of HNE.

ACKNOWLEDGEMENT

Special thanks to Dr. Hanghang Tong for his generous help in the polish of this work. Research was sponsored in part by the U.S. Army Research Laboratory under Cooperative Agreement No. W911NF-09-2-0053 and W911NF-13-1-0193, DARPA No. W911NF-17-C-0099 and FA8750-19-2-1004, National Science Foundation IIS 16-18481, IIS 17-04532, and IIS-17-41317, DTRA HDTRA11810026, and grant 1U54GM114838 awarded by NIGMS through funds provided by the trans-NIH Big Data to Knowledge (BD2K) initiative.