b

DiscoverSearch
About
My stuff
DSSLP: A Distributed Framework for Semi-supervised Link Prediction
2020·arXiv
Abstract
Abstract

Link prediction is widely used in a variety of industrial applications, such as merchant recommendation, fraudulent transaction detection, and so on. However, it’s a great challenge to train and deploy a link prediction model on industrial-scale graphs with billions of nodes and edges. In this work, we present a scalable and distributed framework for semi-supervised link prediction problem (named DSSLP), which is able to handle industrial-scale graphs. Instead of training model on the whole graph, DSSLP is proposed to train on the k-hops neighborhood of nodes in a mini-batch setting, which helps reduce the scale of the input graph and distribute the training procedure. In order to generate negative examples effectively, DSSLP contains a distributed batched runtime sampling module. It implements uniform and dynamic sampling approaches, and is able to adaptively construct positive and negative examples to guide the training process. Moreover, DSSLP proposes a model-split strategy to accelerate the speed of inference process of the link prediction task. Experimental results demonstrate that the effectiveness and efficiency of DSSLP in serval public datasets as well as real-world datasets of industrial-scale graphs.

Index Terms—link prediction, graph neural networks, graph embedding

In recent years, graph neural network (GNN), which apply deep neural network to learning representation in graph data, drew lots of attention in the machine learning and data mining community. Plenty of literatures demonstrate the effectiveness of applying GNN in solving link prediction problem [43], which means to predict whether two nodes in a graph are likely to have a link. As graph data exists ubiquitously, the link prediction problem has many applications, such as recommendation, fraud detection, knowledge graph completion, etc.

However, most of GNN-based link prediction models are performed within graphs with millions of nodes and tens of millions of edges, in a single machine. There is few literature which focuses on applying GNN-based link prediction models to industral-scale graphs with hundreds of millions of nodes

image

as well as billions of edges, which exist in lots of scenarios among industrial community. Take Ant Financial 1 as an example, there are hundreds of millions customers who pay with Alipay 2 everyday. Billions of transactions are made each month, which form a very-large-scale transaction graph, containing hundreds of millions of nodes (i.e., customers and merchants) and billions of edges (i.e., transactions).

Many applications, such as personalized merchant recommendation, fraudulent transaction detection, etc., can be formulated as the link prediction problems. Take personalized merchant recommendation as an example, transaction between users and merchants can naturally form a user-merchant bipartite graph (i.e., users/merchants as nodes and transactions as edges). Hence, personalized merchant recommendation can be formulated as a link prediction problem, which aims to predict whether a certain user would have transactions with a certain merchant in the future, based on the user-merchant transaction graph.

In this paper, we focus on how to build a scalable and distributed GNN-based link prediction model which can be applied to the industrial-scale graphs. This topic is challenged from three aspects.

First, scalability. The proposal need to handle industrial-scale graph data, which may contain hundreds of millions of nodes and billions of edges, as well as the abundant attribute information of nodes and edges. In the other hand, to our best knowledge, there is few literature which mentions about link prediction on industrial settings.

Second, distributed runtime sampling. To handle industrial-scale graph data, the proposal should perform distributed model training. As most conventional GNN-based linkprediction models apply various sampling techniques to generate negative samples [2], [21], our proposal should implement effective runtime sampling in the distributed environment.

Last, effectiveness of inference. Different from node prediction (i.e., classification or regression of nodes), the inference procedure of link prediction usually need to perform in much more larger datasets, since the upper bound of the number of links in a graph with N nodes is  O(N 2), while O(N) for nodes. Therefore, the effectiveness of inference becomes a challenge when facing an industrial-scale graph.

In this paper, we present a GNN-based Distributed framework for Semi-Supervised Link Prediction problem, named DSSLP, which is able to perform model training and inference effectively in an industrial-scale graph with billions of nodes and edges. First, by avoiding training in the whole graph, we proposed the k-hops neighborhood training setting. As the representation of a certain node is learned by aggregating the information of its k-hops neighborhood, our proposal perform model training in a distributed, mini-batch setting, while each batch contains the subgraph formed by k-hops neighborhoods of all nodes in the batch. Second, we implement batched runtime sampling, which can perform uniform sampling and dynamic sampling, to generate negative samples. The proposed framework also contains a high-performance inference modules. It reduce the upper bound of computation complexity of node representation inference from  O(N 2)to O(V ) by model splitting technique.

The rest of this paper is organizing as follow:

Link prediction is a long-standing problem that attracts attention from both physical and computer science communities [8]–[13], [47]. Various of techniques, such as statistic models, matrix factorization, graph neural networks, and so on, have been applied to this problem. In this section, we will address the most related works that motivate us a lot.

One kind of popular approaches is to evaluate node similarity based on structural information of the graph. The basic idea behind is that the more commonality two nodes share in terms of topology, the more likely a link will exist between them. As a result, early heuristic methods are designed based on different statistics about network structure like degree [14], [18], common neighbors [15]–[17], connected path [19]. Although works well in practice, the performance of these heuristic methods may vary a lot as they usually have strong assumptions on when links exist [2], [25]. Furthermore, researchers also develop algorithms, called network embedding, to encode each node of a graph into a low-dimensional vector, and use a certain distance to measure the existence probability of links between node pairs. A variety of network embedding techniques, such as matrix factorization [20], [27] , random walk based methods [11], [12], [22] are engaged into this filed and lead to a superior performance in a number of applications [23], [24]. However, evaluating node similarity only based on structural information may be limited, especially for nodes only sparsely connected [21], or for graphs with abundant attribute information. Therefore, they meet their bottleneck when applied to real-world problems [21].

It has been a trend that the link prediction problem refers to not only the structural information, but also node/edge attributes, as the attribute information may also play an important role in deciding whether a node will interact with others in some real-world graphs [26]. [21] proposes a model linearly combining structural information and attribute information, while [2], [44] directly encodes structural and attribute information into a low-dimensional feature vector based on graph neural networks [3]–[6]. [28] and [29] state that combining the structural information and attribute information in a graph helps improve performance. The main drawback of those methods is scalability, which prevents them from being widely applied to real-world problems.

With the widely use in the industrial applications, these link prediction methods are challenged for their scalability and efficiency. Several works aims to alleviate this problem for network embedding methods. [46] incorporate several modifications to traditional multi-relation embedding systems and present PBG, which is able to handle graphs containing over 100 million nodes and 2 billion edges. However, few of work is proposed to leverage both structural and attribute information for link prediction atop the industrial-scale graphs with billions of nodes and edges.

In this section, we will define some notations which will be used in the following sections, as well as describe some algorithmic details of a link prediction model.

A. Notations

Generally, a graph can be defined as G = (V, E) with associated adjacency matrix A, where V is the set of nodes viand E is the set of edges  (vi, vj), while  Ai,j = 1means that there is an edge between node  viand node  vj. Moreover, many real-world graphs contain node attribute information, which results in the concept of attributed graph (or network) [45]. A real-valued matrix  X ∈ R|V |×mis defined to represent m-dimensional node attributes [1]. In attributed graphs, a subgraph not only contains the structural information, but also contains the attributed information.

Here we will introduce a concept named k-hops neighborhood. The k-hops neighborhood of node  vi(denoted as  N (k)vi) in an attributed graph G is defined as a subgraph induced from G by a set of nodes  {vj|d(vi, vj) ≤ k, vj ∈ V }, where d(vi, vj)denotes the length of shortest path from node  vito node  vj.

B. Node Embedding

Node embedding (or called graph embedding) methods aim to encode nodes into low-dimensional vectors, which is able to preserve some kinds of graph properties. One kind of popular methods to learn a deep encoder is to update parameters by aggregating information from the local neighborhood of a node in an iterative fashion [3]–[6]. Hence, the embedding of node viin (k + 1)-th hidden layer is denoted as:

image

where  φ(·)is an aggregator function (i.e., mean, sum and max pooling operator, and so on) defined on the set of k-th hidden layer embeddings of its 1-hop neighborhood,  W(k)denotes the trainable parameters of k-th hidden layer, and  h(0)i = Xi. By stacking the aggregating operator K times, each node embedding is able to aggregate information from its K-hops neighborhood.

C. GNN-based Semi-supervised Link Prediction Model

Given a snapshot of a graph, the link prediction problem aims to accurately predict the links (edges) that will be added to the graph in the future. It naturally follows a semi-supervised setting, in which existing links in the graph are treated as positive examples, while the negative examples can be generated by different sampling strategies.

Formally, from an encoder-decoder perspective [1], the existence score of a link between node  viand node  vjis defined as:

image

where  fenc : V → Rdis a function that maps node set V to d-dimensional embedding space  Rd, while  fdec : Rd × Rd → Ris a pairwise decoder that maps pairs of node embeddings to a real-valued edge score measurement.

Usually, the encoder  fencand decoder  fdeccan be learned by optimizing an emprical reconstruction loss:

image

where  G(vi, vj)is a user-specified pairwise proximity measure. Specially,  G(vi, vj) = 1represents that there is a link between node  viand  vj, while  G(vi, vj) = 0indicates no link exists. Meanwhile,  ℓ(·)represents a user-specified loss function (e.g.  L2distance) measuring the discrepancy between estimated value  S(vi, vj)and the ground truth  G(vi, vj), and D is a dataset for model training.

In the GNN-based link prediction model, a deep encoder with K layers (i.e., Equation 1) is applied to generate node embeddings. The decoder  fdec(·)here can be dot product or euclidean distance of two node embeddings, or a classifier (i.e., logistic regression and multilayer perceptron) defined on the concatenation of two node embeddings. That is

image

S(vi, vj)is used to measure the existence score of an edge between nodes  viand  vj. The larger the score  S(vi, vj)is, the more likely that an edge will appear between node  viand vj. To that end, we penalize a margin-based ranking loss [31], [32] to guide the learning process:

image

where  (vi, vj) ∈ Erepresents a positive edge while  ( ´vi, ´vj)is a negative example sampled from a certain distribution P. In our model, positive and negative edges are sampled according to a set of sampling strategies, which will be detailed in the following section.  λis a margin hyperparameter to separate positive and negative edges. The loss function in (5) ensures that negative edges get lower scores than positive edges by at least a margin  λ. In addition, we also provide a classification loss as an alternative.

By minimizing the loss function introduced above via mini-batch stochastic gradient descent (SGD), DSSLP jointly learning node embedding  hiand link prediction score  S(vi, vj). Therefore, DSSLP can be not only used in link prediction task, but also transferred to other tasks like node classification task with few extra labeled data.

A. Overview

DSSLP is designed to solve industrial-scale link prediction problems. As illustrated in Fig. 1, DSSLP provides a complete solution for link prediction applications: (1) Data preprocessing. Generate k-hops neighborhood for each node and store them into a shared file system (Maxcompute 3); (2) Training. Train the link prediction model described above in a distributed and asynchronous manner based on the parametersever architecture [30]; (3) Inference. Compute link prediction score for industrial-scale graph data with carefully designed model-split strategy.

DSSLP receive a special format of k-hops neighborhood as input rather than the full graph, which helps reduce the time and memory cost. During training procedure, every worker pulls parameters from a parameter sever and update them independently. For each worker, we present the work flow in Fig. 2, which mainly consists of three parts: (1) Encode node embeddings based on subgraphs with a certain of embedding representation algorithms like GCN [3], GAT [5], Geniepath [6]; (2) Sample positive and negative links according to observed links within a batch; (3) Optimize a certain loss to guide the learning process. After that, we can apply a welltrained model to predicting the existence of links in a graph. Specially, for industrial-scale graph data, we split the welltrained model into two parts to avoid some repeated computing and accelerate the inference procedure.

When applied to industrial-scale link prediction applications, a framework should not only care about performance, but also hold the properties of high quality in scalability and efficiency. In the following, we will detail our framework and explain how this framework is possible to efficiently and accurately predict links between arbitrary two nodes on huge graphs.

B. Operating on k-hops neighborhood

The first challenge for a link prediction model is how to operate on huge graphs with billions of nodes and edges. The cost of memory and time will arrive at an unacceptable level when loading a full huge graph into memory, as a typical graph

image

Fig. 1. Overview of DSSLP Framework

that contains one billion nodes with each associated with 64-dimensional float features would require 250G memory to just store these features, which is beyond the capacity of typical commodity machines.

Rethinking the total work flow of the link prediction model, we find that, although embedding representation is based on structure and attribute information of a graph, it’s unnecessary to load the full graph into memory at once. As stated in section III-B, embedding a node is to aggregate information from its local neighborhood and a k-hops embedding of node x can be accurately calculated from its k-hops neighborhood N (k)vi[2]. Therefore, we format our inputs based on k-hops neighborhood instead of the full huge graph.

Our DSSLP framework takes pairs of nodes together with their k-hops neighbor as input. As shown in Fig. 2, the input of our framework is encoded into an “edge” format, which mainly consists of four components: (1) Id of node  v1; (2) Id of  v2; (3)  N (k)v1, k-hops neighborhood of node  v1; (4)  N (k)v2, k-hops neighborhood of node  v2. Ids of  v1and  v2are used to label the root nodes of subgraphs and the node pair  (v1, v2)forms a link to be predicted, while  N (k)vicontains information of nodes and edges of the k-hops neighborhood rooted in node vi.

For a certain node  vi, the k-hops neighborhood  N (k)vimainly consists of two parts: nodes associated with node features and edges together with their features in k-hops neighborhood of vi, which is constructed by the following steps:

1) Insert the root node  viinto the node list  Tntogether with its features.

2) For each node in  Tn, insert edges directly connected to them into the edge list  Te

3) Update  Tnwith nodes associated with edges in  Te. 4) Repeat the second and third steps for k times.

By alternately update the node list  Tnand the edge list  Te, we finally generate the k-hops neighborhood for node  vi.

To prepare the input of our model, we first extract k-hops neighborhood of  v1and  v2respectively, and then construct input with the “edge” format stated above. Since the k-hops neighborhood is information complete for computing the k-hops embedding of its root node, this kind of input provide sufficient structural and attribute information for our model to predict the existence between nodes  v1and  v2.

With input in such format, our DSSLP model is born with the ability of running parallelly and distributedly, as the model is able to learn embeddings of root nodes directly from a piece or pieces of data records without the need of referring to the full graph. Furthermore, our model is memory friendly, as the scale of input is decreasing from a full graph to a k-hops subgraph. For example, given a graph with 0.6 billion nodes and 10.4 billion edges, the average node degree is about 17, and then a 3-hop subgraph  N (3)vionly contains about  173 =4913 nodes, which is far less than the amount of nodes in the full graph.

C. Batched Runtime Sampling

The second challenge is how to efficiently sampling negative edges to guide the learning process. A straightforward way is sampling negative links in an offline manner: randomly select a certain number of observed edges as positive links together with about the same number of unobserved edges as negative links [2], [21]. However, when applied into industrial-scale applications, offline sampling may only cover a small part of the whole possible negative links and leads to an “imbalance” problem [21]. What’s worse, the imbalance data may break the basic assumption that training and test set should obey the same distribution. To solve this problem, we propose a batched runtime sampling method, which is flexible for different scale of graphs and is able to approximate the distribution of positive and negative edges of the whole observed graph G.

The runtime sampling method operates on a batch of records during the training procedure. Given a batch of records, a pair of root nodes  (vi,1, vi,2)in the i-th record is sampled from the whole observed edge set E, which naturally forms a positive link. Meanwhile, a pair of nodes  (vi,j, vm,n)is sampled as a negative link if  vi,jand  vm,nare root nodes within the batch and the edge  (vi,j, vm,n)can not be observed in the edge set  Eb(all edges within the batch). As all nodes directly connected to the root node  vi,jare collected in  vi,j’s k-hops neighborhood, sampling negative links according to  Ebis equal to sampling according to the whole edge set E.

A set of strategies are developed to make our runtime sampling method flexible and powerful: a) Shuffling: One key point for batched runtime sampling is that negative examples should be sampled according to the whole dataset rather than only within a batch. While candidates of negative links are combinations of nodes from the whole graph, batched sampling makes candidates of negative samples be limited to the combinations of root nodes within a training batch, which may not reflect the real distribution of the whole graph.

image

Fig. 2. Work flow of DSSLP framework durning training time.

A simple but efficient shuffling strategy is used to solve this problem when we design our batched runtime sampling method. The basic idea behind is that we should provide enough combinations of nodes as candidates of negative samples to sampling methods, which, is equal to provide enough combinations of data records. To that end, we perform a runtime shuffling strategy: First, we pick m records in order from the training dataset and put them into a prefetch buffer; Then, we randomly pick n records one by one to fill a training batch. Note that, once a record was pulled from the buffer, a new record will be picked from train dataset and replace the former record in the buffer.

With such shuffling strategy, a training batch is filled with randomly selected records, and a certain data record will meet any other records with enough training steps. Therefore, the candidates of negative links generated by batched runtime sampling approximate those sampled over the whole training dataset. b) Uniform sampling: We provide a basic uniform negative sampling method in our sampling suite. The uniform negative sampling method is designed based on the basic assumption that every node plays an equal role in the graph. Therefore, the uniform negative sampling method treats every node equally without any bias, and samples about the same number of negative neighbors from each node to form negative links.

The uniform negative sampling method takes a batch of records as input and output a set of negative links according to root nodes and observed edges within the training batch. Given a root node x from a certain record, the uniform sampling method works as follows:

1) Randomly select a root node y from the training batch.

2) Pick the link (x, y) as a negative link if it can not be observed in the edge set  Ebof the batch, and drop otherwise.

3) Repeat first two steps until getting neg num negative links or reaching pre-defined max trail times.

It’s worth noting that, even without the shuffling strategy, the uniform negative sampling method is also able to provide a diversity of negative examples for a certain node, as negative links are randomly constructed. The uniform negative sampling method is just a basic solution for runtime sampling and it can not approximate the real distribution in some real-world problems, as the importance of a node may vary due to different roles it played in the real-world applications. It is a good choice to apply this basic sampling method to applications where we don’t have any prior knowledge, and even statistic informations like in-degree, out-degree, are not available or beyond belief. c) Dynamic sampling: Moreover, we also propose a dynamic sampling method based on the degree of each node to approximate the distribution of the whole observed graph G. It is a common sense that the larger the degree of a node is, the more likely we will observe a link connected to it. Therefore, we form an inverse relationship between node degree and the number of negative links associated with the node  vi:

image

where  nviis the number of negative edges to be sampled,  λvirepresent the degree of node  vi, and  ηis a ratio hyperparameter to adjust the value of  nvi. In case that  λviis extremely large or small, we add two extra parameters  t, αto (6):

image

where  αis the max negative number to be sampled for a node. t is a threshold to adjust the real sampling number. (7) ensures that the number of negative links for a certain node should be in the range of  [1, α].

Different from the uniform sampling method described above, our dynamic sampling method adjusts the number of negative edges associated with a certain node dynamically according to (7). It worth noting that, from a statistic perspective, node degree is equal to the frequency of positive links connected to the node, which naturally reflects the real distribution of observed links. Therefore, examples sampled by our dynamic sampling method will approximately obey the same distribution of links in the observed graph G, as the sampling procedure takes the node degree into account.

D. Offline Inference

The third challenge is that, as the total number of links to be predicted may reach a quite large scale, as many as billions or even trillions, it may take tens or hundreds of hours to perform the inference over an industrial-scale dataset, which is usually beyond tolerance for industry applications. To solve this problem, we propose a model-split strategy to accelerate the inference procedure.

The basic idea of the model-split strategy is to reduce the computational complexity of the inference procedure. Given a graph G with N nodes, the upper bound of the number of links is  O(N 2), while O(N) for nodes, which means that we should perform the forward propagation procedure of our model as many as  N 2times. However, there are many unnecessary computations, as the same node may appear in different edges and its embedding is computed repeatedly in different input records, which leads to a high time cost. Based on this observation, we split the inference procedure into two parts: node embedding representation procedure and prediction procedure. As shown in Fig. 3, The model-split strategy mainly consists of four steps:

1) Split parameters of well trained model into two parts: parameters related to embedding representation procedure and prediction-related parameters.

2) Compute node embeddings with parameters related to embedding representation procedure.

3) Pack node embeddings into the edge format with reference to node ids from edge candidates.

4) Compute prediction scores with prediction-related parameters.

Note that if we use ranking loss during training procedure, there’s no prediction-related parameters, as we directly use the result of dot product as the prediction score. If we use the alternative classification loss, the prediction score is computed by a neural network and all the four steps should work in the inference procedure.

Compared with the original inference procedure, the model split strategy reduce the upper bound of the computational complexity in embedding representation procedure from O(N 2)to O(N). As embedding representation is the most time consuming part of our model, the model-split strategy is efficient to speed up the inference procedure.

In this section, we present the experimental results of our approaches on some public datasets together with some real-world applications at Alipay.

A. Datasets

The datasets we used mainly consist of two parts: public datasets and some real-world industrial-scale data at Alipay. a) Public Datasets: Following [2], [27], we evaluate our approaches on seven different datasets :

USAir [33]: a network of US air transportation system with 332 nodes and 2126 edges. Each edge represents an airline from source city to destination.

NS [34]: a co-authorship network of researchers in network science field with 1589 nodes and 2742 edges.

PB [35]: a network of the US political blogs with 1222 nodes and 16714 edges.

Yeast [36]: a network of molecular interactions in cells of Yeast with 2375 nodes and 11693 edges.

C.ele [37]: a neural network of C. elegans with 297 nodes and 2,148 edges.

Power [37]: an electrical power grid of the western US with 4941 nodes and 6594 edges.

Router [38]: a router-level Internet collected by Rocketfuel Project with 5022 nodes and 6253 edges. Details about those dataset are presented in Table I. node degree means the average node degree of a dataset and implies the sparsity of related network.

TABLE I SUMMARY OF PUBLIC DATASETS

image

b) Real-world Dataset At Alipay: We apply our approaches to real-world problems at Alipay, the world’s largest mobile payment platform. Datasets we used can be concluded as follows:

Trade Network. Trade network describes transactions between different users (i.e., customers and merchants). In the trade network, each user is treated as a node and we add an edge to the network if a deal happens between two users. We preprocess the data by deleting isolated nodes which are useless in propagating information through the topology [6]. The preprocess data contains about 376 million nodes and 4.48 billion edges. Moreover, a set of attributes is used to describe users and edges. Each node is associated with 3001-dimensional features while each edge contains 2-dimensional features. A few nodes

image

Fig. 3. Original inference process VS. Inference with model-split strategy.

in trade network are labeled to identify a special role in transaction.

Friendship Network. Friendship network describes the friendship between different individuals on the Internet. After preprocessing the data as stated above, the friendship network contains about 638 million nodes each with 446-dimensional features and 10.4 billion edges with 28-dimensional features. As time goes on, two individuals may build a friendship and the dataset records the start time of a friendship with a timestamp, which is used to separate the whole dataset into train and test sets.

TABLE II SUMMARY OF ALIPAY DATASETS

image

To prove the effectiveness of our approaches, which jointly learns link prediction and node embedding, we preform a node classification task on Trade Network and a link prediction task on Friendship Network.

For the node classification task on Trade Network, we first train a link prediction model and then feed node embeddings learned by the model to downstream node classification task with provided node labels. We present details about train and test set in Table II. About  1.14×108training records are used to train the link prediction model while  6.96 × 106labeled node are used to train the classifier. It’s worth noting that, extra node labels are only used to train the classifier and play no role in node embedding representation procedure.

On the other hand, for the link prediction task on Friendship Network, we first use a timestamp to split the graph into two part: links built before the timestamp as observed links while links after the timestamp as the positive test set. As for the negative links in test set, we randomly sampled unobserved links according to the graph. The total training set contains about  1.02×108records, while the test set contains  6.2×106records with about equal number of positive and negative links. It’s worth noting that all the training records are built based on observed links, and we use the runtime negative sampling method to generate negative samples during training time. Therefore, there is no need to prepare negative samples for training set.

B. Experimental Settings

We evaluate the performance of our model mainly on two tasks: the link prediction task and the node classification task. Experimental Settings are designed as follows:

Link prediction task. We first compare our model with some popular methods in academic, including : Common Neighbors (CA) [15] , Jaccard index (Jaccard) [39], Adamic-Adar (AA) [16], resource allocation (RA) [17], node2vec [12], Weisfeiler-Lehman Neural Machine (WLNM) [27], SEAL [2]. The first four methods are heuristic methods while the last three represent latest learn-based methods. We report results of our methods together with those methods on seven public dataset introduced in section V-A. Furthermore, we conduct a set of experiments on Friendship dataset with node2vec [12] as the baseline, which, to our best knowledge, is a public available method that is able handle such large scale graph.

Node classification task. As our model jointly learn link prediction and node embedding, we apply the learned embedding to node classification task on Trade dataset.

TABLE III RESULTS ON PUBLIC DATASET

image

Also, we use one of the most famous unsupervised embedding representation method node2vec [12] as the baseline. Gradient Boosting Decision Tree [40] (GBDT) is adopted as the classifier which is widely used in industry.

We report the AUC measure on all those experiments, because it is not influenced by the distribution of the classes compared with 0-1 accuracy. Moreover, we use Geniepath [6] as the default embedding representation method and embedding size of our model is set to 64 in all of those experiments. When performing the link prediction task, inner-product is used as the metric function of our model.

C. Results

Following the evaluation protocol stated above, we details results on public datasets and read-world industrial-scale datasets in terms of performance in this section. a) Link prediction task: We report results in terms of link prediction task in Table III and Table IV. In table III, we compare our model with some popular methods in academic on seven public dataset. Underline is used to indicate the top-2 results and Bold represents the best results. Meanwhile, we illustrate results on real-world industrial-scale dataset in Table IV.

As shown in Table III, our methods consistently outperforms stated heuristic methods like Common Neighbors (CA) [15], Jaccard index (Jaccard) [39], Adamic-Adar (AA) [16], resource allocation (RA) [17] along all seven public dataset. On Router dataset, our method even achieve an AUC of 0.970 while those heuristic methods only get 0.561. And the mean AUC across the seven dataset of our method is 0.950 while the best mean AUC of those four heuristic methods is 0.818. In all, our method achieves a significant improvement in terms of AUC compared with those heuristic methods.

Our methods also show advantages over some latest learning-based methods such as node2vec (N2V) [12], WLMN [27], and SEAL [2]. Compared with the state of the art method, SEAL, our model win four out of seven dataset in terms of AUC and the performance on the left three dataset is also about the same level with that of SEAL. Results shown in Table III demonstrate that our model achieves competitive performance compared with those latest methods in academic on small public datasets.

Moreover, results on large-scale dataset also demonstrate the effectiveness of our model. Note that, when applied to large-scale dataset, prior methods may fail to work due to the high cost of memory. As a result, we mainly take node2vec as the baseline, which, to our best knowledge, is a public available method to handle such large graph and also widely used in industrial applications. As presented in Table IV, our model outperforms node2vec by about 15% (0.842 VS. 0.738). As node2vec only embed the structure information without referring to attribute information, we also show the result based on node2vec embedding with raw features for fairness. Our model also achieves significant improvement compared with node2vec + raw features(0.842 VS. 0.769). b) Node classification task: Results in terms of node classification task on Trade dataset are presented in Table IV. As our model learns node embeddings from such large scale dataset without referring to any node labels, we also use node2vec as the baseline.

The node embedding of our method is more distinctive compared with that of node2vec as shown in Table IV. As the node labels in trade dataset represent a special role in transactions, which may mainly be influenced by attributes of a node, embeddings with structural information only may not works well. Therefore, node2vec achieves only 0.569 in terms of AUC on Trade dataset, which is far less than that of our model. Furthermore, our model outperforms node2vec significantly (0.757 VS. 0.658), even node2vec is combined with raw features.

TABLE IV RESULT ON REAL-WORLD DATASET AT ALIPAY

image

D. Analysis

In this section, we try to analysis and explain the effect of different strategies used in our model by varying the variable we will analysis while fix other variables which is described as follows:

a) effectiveness: We analysis the influence of shuffling strategy and sampling strategy to show the effectiveness of those strategies. Results of all four combinations across those two strategies are illustrated in Fig. 4, in which “Uniform” and “Dynamic” refer to model with uniform sampling and dynamic sampling respectively, while suffix “without shuf-fling” indicates model trained without shuffling strategy. The horizontal axis means the training epoches while vertical axis means losses.

image

Fig. 4. Loss curve varying according to different strategies

As illustrated in Fig. 4, both shuffling strategy and dynamic strategy are proved to be effective to help minimize the loss to a better place. By fixing the sampling strategy, we can conclude the effect of the shuffling strategy: when enabling the shuffling strategy, both loss curves of models with uniform sampling and dynamic sampling decrease to a lower place compared with models without shuffling strategy. On the other hand, by fixing shuffling shuffling strategy, models with dynamic sampling strategy achieve better result than models with uniform sampling. Moreover, model with dynamic negative sampling and shuffling strategies achieve best result across the four models.

The result described above may be due to the reason that, by shuffling, a certain record is able to meet any other records across the whole training set, which is equal to train the model on the whole dataset rather than within a min-batch. Moreover, by adopting dynamic negative sampling, the distribution of positive and negative links in training set approximates that over the whole observed graph, which helps construct more reasonable number of positive and negative samples. b) Scalability: For link prediction tasks on industrial-scale dataset, scalability is an import dimension that we should take into account. To prove the scalability of our framework, we design a set of link prediction experiments by varying worker numbers during training process.

As presented in Fig. 5 and Fig. 6, we find that, with more worker numbers, the training process will speed up and converge to about the same level or just slightly worse than model with less workers. It’s reasonable that with more workers, the performance decrease slightly as we adopt asynchronous optimization strategy during training process. We think it’s acceptable when applied our model to real-world tasks with large scale data by trading off time cost and performance.

image

Fig. 5. Loss curve varying according to worker num

image

Fig. 6. Time according to worker num

c) Offline inference efficiency: By adopting model-split strategy, the offline inference process is quite efficient compared with original inference process with Tensorflow [41]. It’s takes about 15-20 hours for inferring the whole Friendship dataset with original inference process, while with model-split strategy, it only takes about 5 hours. It’s worth noting that, the more dense the graph is, the better efficiency we will get with model-split strategy.

In this work, we propose a scalable framework for link prediction at Alipay. Based on subgraphs, our model is potential to jointly learn link prediction and node embedding on graphs with billions of nodes and edges without using any extra label information on nodes or edges. Benefit from the shuffling and sampling strategies, our model is able to approximate distributions of positive and negative links over the whole training dataset. Moreover, the proposed model-split strategy helps accelerate the speed of inference process significantly. Experiments show that our approach works consistently well on various real-world problems. In future, we are interested in applying our framework on heterogeneous networks based on meta-path negative sampling methods.

We would like to thank our colleagues from Alipay, MYbank and the infrastructure team of Ant Financial, who have offered great help in the devolopment and applicaiton of DSSLP. Furthermore, we like to thank Yang Shuang and Yuan Qi for their unwavering support of this project.

[1] Hamilton W L, Ying R, Leskovec J. Representation learning on graphs: Methods and applications[J]. arXiv preprint arXiv:1709.05584, 2017.

[2] Zhang M, Chen Y. Link prediction based on graph neural net- works[C]//Advances in Neural Information Processing Systems. 2018: 5165-5175.

[3] Kipf T N, Welling M. Semi-supervised classification with graph convo- lutional networks[J]. arXiv preprint arXiv:1609.02907, 2016.

[4] Hamilton W, Ying Z, Leskovec J. Inductive representation learning on large graphs[C]//Advances in Neural Information Processing Systems. 2017: 1024-1034.

[5] Velikovi P, Cucurull G, Casanova A, et al. Graph attention networks[J]. arXiv preprint arXiv:1710.10903, 2017.

[6] Liu Z, Chen C, Li L, et al. Geniepath: Graph neural networks with adaptive receptive paths[C]//Proceedings of the AAAI Conference on Artificial Intelligence. 2019, 33: 4424-4431.

[7] Kipf T N, Welling M. Variational graph auto-encoders[J]. arXiv preprint arXiv:1611.07308, 2016.

[8] Leicht E A, Holme P, Newman M E J. Vertex similarity in networks[J]. Physical Review E, 2006, 73(2): 026120.

[9] Sun D, Zhou T, Liu J G, et al. Information filtering based on transferring similarity[J]. Physical Review E, 2009, 80(1): 017101.

[10] Clauset A, Moore C, Newman M E J. Hierarchical structure and the prediction of missing links in networks[J]. Nature, 2008, 453(7191): 98.

[11] Pirotte A, Renders J M, Saerens M. Random-walk computation of similarities between nodes of a graph with application to collaborative recommendation[J]. IEEE Transactions on Knowledge and Data Engineering, 2007 (3): 355-369.

[12] Grover A, Leskovec J. node2vec: Scalable feature learning for net- works[C]//Proceedings of the 22nd ACM SIGKDD international conference on Knowledge discovery and data mining. ACM, 2016: 855-864.

[13] LibenNowell D, Kleinberg J. The linkprediction problem for social networks[J]. Journal of the American society for information science and technology, 2007, 58(7): 1019-1031.

[14] Liu W, L L. Link prediction based on local random walk[J]. EPL (Europhysics Letters), 2010, 89(5): 58007.

[15] Newman M E J. Clustering and preferential attachment in growing networks[J]. Physical review E, 2001, 64(2): 025102.

[16] Adamic L A, Adar E. Friends and neighbors on the web[J]. Social networks, 2003, 25(3): 211-230.

[17] Zhou T, L L, Zhang Y C. Predicting missing links via local informa- tion[J]. The European Physical Journal B, 2009, 71(4): 623-630.

[18] Zhang G Q, Wang D, Li G J. Enhancing the transmission efficiency by edge deletion in scale-free networks[J]. Physical Review E, 2007, 76(1): 017101.

[19] Katz L. A new status index derived from sociometric analysis[J]. Psychometrika, 1953, 18(1): 39-43.

[20] Miller K, Jordan M I, Griffiths T L. Nonparametric latent feature mod- els for link prediction[C]//Advances in neural information processing systems. 2009: 1276-1284.

[21] Menon A K, Elkan C. Link prediction via matrix factorization[C]//Joint european conference on machine learning and knowledge discovery in databases. Springer, Berlin, Heidelberg, 2011: 437-452.

[22] Perozzi B, Al-Rfou R, Skiena S. Deepwalk: Online learning of social representations[C]//Proceedings of the 20th ACM SIGKDD international conference on Knowledge discovery and data mining. ACM, 2014: 701-710.

[23] Cai H, Zheng V W, Chang K C C. A comprehensive survey of graph em- bedding: Problems, techniques, and applications[J]. IEEE Transactions on Knowledge and Data Engineering, 2018, 30(9): 1616-1637.

[24] Goyal P, Ferrara E. Graph embedding techniques, applications, and performance: A survey[J]. Knowledge-Based Systems, 2018, 151: 78-94.

[25] Kovcs I A, Luck K, Spirohn K, et al. Network-based prediction of protein interactions[J]. Nature communications, 2019, 10(1): 1240.

[26] Barbieri N, Bonchi F, Manco G. Who to follow and why: link prediction with explanations[C]//Proceedings of the 20th ACM SIGKDD international conference on Knowledge discovery and data mining. ACM, 2014: 1266-1275.

[27] Zhang M, Chen Y. Weisfeiler-Lehman neural machine for link predic- tion[C]//Proceedings of the 23rd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining. ACM, 2017: 575-583.

[28] Zhao H, Du L, Buntine W. Leveraging node attributes for incomplete relational data[C]//Proceedings of the 34th International Conference on Machine Learning-Volume 70. JMLR. org, 2017: 4072-4081.

[29] Nickel M, Jiang X, Tresp V. Reducing the rank in relational factoriza- tion models by including observable patterns[C]//Advances in Neural Information Processing Systems. 2014: 1179-1187.

[30] Li M, Zhou L, Yang Z, et al. Parameter server for distributed machine learning[C]//Big Learning NIPS Workshop. 2013, 6: 2.

[31] Cheng D, Gong Y, Zhou S, et al. Person re-identification by multi-channel parts-based cnn with improved triplet loss function[C]//Proceedings of the iEEE conference on computer vision and pattern recognition. 2016: 1335-1344.

[32] Chen W, Chen X, Zhang J, et al. Beyond triplet loss: a deep quadruplet network for person re-identification[C]//Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 2017: 403-412.

[33] Vladimir Batagelj and Andrej Mrvar. http://vlado.fmf.unilj.si/pub/networks/data/, 2006.

[34] Newman M E J. Finding community structure in networks using the eigenvectors of matrices[J]. Physical review E, 2006, 74(3): 036104.

[35] Ackland R. Mapping the US political blogosphere: Are conservative bloggers more prominent?[C]//BlogTalk Downunder 2005 Conference, Sydney. BlogTalk Downunder 2005 Conference, Sydney, 2005.

[36] Von Mering C, Krause R, Snel B, et al. Comparative assessment of large- scale data sets of proteinprotein interactions[J]. Nature, 2002, 417(6887): 399.

[37] Watts D J, Strogatz S H. Collective dynamics of small-worldnetworks[J]. nature, 1998, 393(6684): 440.

[38] Spring N, Mahajan R, Wetherall D. Measuring ISP topologies with Rocketfuel[C]//ACM SIGCOMM Computer Communication Review. ACM, 2002, 32(4): 133-145.

[39] Chowdhury G G. Introduction to modern information retrieval[M]. Facet publishing, 2010.

[40] Chen T, Guestrin C. Xgboost: A scalable tree boosting system[C]//Proceedings of the 22nd acm sigkdd international conference on knowledge discovery and data mining. ACM, 2016: 785-794.

[41] Abadi M, Agarwal A, Barham P, et al. Tensorflow: Large-scale ma- chine learning on heterogeneous distributed systems[J]. arXiv preprint arXiv:1603.04467, 2016.

[42] Koller D, Friedman N. Probabilistic graphical models: principles and techniques[M]. MIT press, 2009.

[43] LibenNowell D, Kleinberg J. The linkprediction problem for social networks[J]. Journal of the American society for information science and technology, 2007, 58(7): 1019-1031.

[44] Kipf T N, Welling M. Variational graph auto-encoders[J]. arXiv preprint arXiv:1611.07308, 2016.

[45] Zhou Y, Cheng H, Yu J X. Graph clustering based on structural/attribute similarities[J]. Proceedings of the VLDB Endowment, 2009, 2(1): 718-729.

[46] Lerer A, Wu L, Shen J, Lacroix T, Wehrstedt L, Bose A, Peysakhovich A. PyTorch-BigGraph: A Large-scale Graph Embedding System[C]// SysML. 2019

[47] Hu, B., Zhang, Z., Shi, C., Zhou, J., Li, X., & Qi, Y. (2019, July). Cash-out user detection based on attributed heterogeneous information network with a hierarchical attention mechanism. In Proceedings of the AAAI Conference on Artificial Intelligence (Vol. 33, pp. 946-953).


Designed for Accessibility and to further Open Science