Get Rid of Suspended Animation Problem: Deep Diffusive Neural Network on Graph Semi-Supervised Classification

2020·Arxiv

Abstract

Abstract

Existing graph neural networks may suffer from the “suspended animation problem” when the model architecture goes deep. Meanwhile, for some graph learning scenarios, e.g., nodes with text/image attributes or graphs with long-distance node correlations, deep graph neural networks will be necessary for effective graph representation learning. In this paper, we propose a new graph neural network, namely DIFNET (Graph Diffusive Neural Network), for graph representation learning and node classification. DIFNET utilizes both neural gates and graph residual learning for node hidden state modeling, and includes an attention mechanism for node neighborhood information diffusion. Extensive experiments will be done in this paper to compare DIFNET against several state-of-the-art graph neural network models. The experimental results can illustrate both the learning performance advantages and effectiveness of DIFNET, especially in addressing the “suspended animation problem”.

1 Introduction

Graph neural networks (GNNs) [Wu et al., 2019; Zhang, 2019] have been one of the most important research topic in recent years, and many different types of GNN models have been proposed already [Kipf and Welling, 2016]. Generally, via the aggregations of information from the neighbors [Hammond et al., 2011; Defferrard et al., 2016], GNNs can learn the representations of nodes in graph data effectively. The existing GNNs, e.g., GCN [Kipf and Welling, 2016], GAT [Veliˇckovi´c et al., 2018] and their derived variants [Klicpera et al., 2018; Li et al., 2018b; Gao et al., 2019], are mostly based on the approximated graph convolutional operator [Hammond et al., 2011; Defferrard et al., 2016]. As introduced in [Zhang and Meng, 2019], such an approximated aggregation operator actually reduces the neighboring information integration as an one-hop random walk layer followed by a fully-connected layer. Generally, in many learning scenarios, e.g., nodes with text/image attributes or graphs with close correlations among nodes that are far away, a deep GNN architecture will be necessary to either achieve the high-order

Figure 1: The learning performance of GCN (bias disabled) with 2-layer, . . . , 7-layer, and 10-layer, , 50-layer on the Cora dataset. The x axis denotes the iterations over the whole training set. The y axes denote the training and testing accuracy, respectively.

latent attribute representations or integrate information from those long-distance nodes for graph representation learning.

Also as pointed out in [Zhang and Meng, 2019], most of the existing GNN models based on approximated graph convolutional operator will suffer from the “suspended animation problem” when the model architecture goes deep. Especially, when the GNNs’ model depth reaches certain limit (namely, the GNNs’ suspended animation limit), the model will not respond to the training data anymore and become not learnable. As illustrate in Figure 1, we show a case study of the vanilla GCN model (with bias term disabled) on the Cora dataset (a well-known benchmark graph dataset). According to the results, the model can achieve the best performance with 2 layers, and its performance gets degraded as the depth increases and further get choked as the depth reaches 5 and more. Therefore, 5-layer is also called the “suspended animation limit” of GCN. Similar phenomena have been observed for the GCN model (with bias term enabled), GAT and their other derived models as well.

In this paper, we will introduce a novel graph neural network model, namely DIFNET (Graph Diffusive Neural Net), for graph representation learning. DIFNET is not based on the approximated graph convolutional operator and can work perfectly with deep model architectures. To be more specific, DIFNET introduces a new neuron unit, i.e., GDU (gated diffusive unit), for modeling and updating the graph node hidden states at each layer. Equipped with both neural gates and graph residual learning techniques, GDU can help address the aforementioned “suspended animation problem” ef-

Figure 2: The learning performance of DIFNET with 2-layer, 3-layer, . . . , 50-layer on the Cora dataset.

fectively. Furthermore, DIFNET also involves the attention mechanism for the neighboring node representation diffusion prior to feeding them into GDU. All the variables involved in GDU can be learned automatically with the backpropagation algorithm efficiently. To demonstrate the effectiveness of DIFNET, in Figure 2, we also illustrate the learning performance of DIFNET on the Cora dataset (all the settings are identical to those of Figure 1). From the plots, we observe that DIFNET proposed in this paper can converge very fast and also outperforms GCN for the testing results. What’s more, DIFNET generalizes very well to the deep architectures, and its learning performance can even get improved as the model depth increases. For instance, among all these layers, the highest testing accuracy score is achieved by DIFNET (10-layer). To be more specific, the layers for DIFNET denotes the number of GDU based layers. More detailed information and experimental results about DIFNET will be provided in the following sections of this paper. We summarize the contributions of this paper as follows: • We introduce a novel graph neural network model DIFNET in this paper, which can help address the “suspended animation problem” effectively on graph representation learning.

• We propose a new neuron model, i.e., GDU, which involves both neural gates and residual learning for node representation modeling and updating.

• DIFNET introduces an attention mechanism for the node representation diffusion and integration, which can work both effectively and efficiently.

• We test the effectiveness of DIFNET on several commonly used graph benchmark datasets and compare against both classic and state-of-the-art GNN models. The remaining parts of this paper will be organized as follows. We will introduce the related work in Section 2. Defini-tions of important terminologies and formulation of the studied problem will be provided in Section 3. Detailed information about the DIFNET model will be introduced in Section 4. The effectiveness of DIFNET will be tested in Section 5. Finally, we will conclude this paper in Section 6.

2 Related Work

Several research topics are closely correlated with this paper, which include graph neural network, GNN learning prob-

lems, neural gates and residual learning, respectiely.

Graph Neural Network: Due to the inter-connected structure of graph data, traditional deep models (assuming the training instances are i.i.d.) cannot be directly applied to the graph data. In recent years, graph neural networks [Monti et al., 2016; Atwood and Towsley, 2015; Masci et al., 2015; Kipf and Welling, 2016; Scarselli et al., 2009; Zhou et al., 2018; Niepert et al., 2016] have become one of the most popular research topic for effective graph representation learning. GCN proposed in [Kipf and Welling, 2016] feeds the generalized spectral features into the convolutional layer for representation learning. Similar to GCN, deep loopy graph neural network [Zhang, 2018] proposes to update the node states in a synchronous manner, and it introduces a spanning tree based learning algorithm for training the model. LOOPYNET accepts nodes’ raw features into each layer of the model, and it doesn’t suffer from the suspended animation problem according to [Zhang and Meng, 2019]. GAT [Veliˇckovi´c et al., 2018] leverages masked self-attentional layers to address the shortcomings of GCN. In recent years, we have also observed several derived variants of GCN and GAT models, e.g., [Klicpera et al., 2018; Li et al., 2018b; Gao et al., 2019]. Due to the limited space, we can only name a few number of the representative graph neural network here. The readers are also suggested to refer to page1, which provides a summary of the latest graph neural network research papers with code on the node classification problem. GNNs’ Performance Problems: For the existing graph neural networks, several performance problems have been observed. In [Zhang and Meng, 2019], the unique “suspended animation problem” with graph neural networks is identi-fied and formally defined. A theoretic analysis about the causes of such a problem is also provided in [Zhang and Meng, 2019], which also introduce the graph residual learning as the potential solution. Several other problems have also been observed in learning deep graph neural networks, e.g., the “over-smoothing problem” [Li et al., 2018a]. To resolve such a problem, many research efforts have been witnessed [Li et al., 2019; G¨urel et al., 2019; Sun et al., 2019; Huang and Carley, 2019]. To enable deep GCNs, [Li et al., 2019] proposes to adopt residual/dense connections and dilated convolutions into the GCN architecture. [G¨urel et al., 2019] tries to provide an explanation about the cause why GCN cannot benefit from deep architectures. In [Sun et al., 2019], the authors propose a novel RNN-like deep graph neural network architecture by incorporating AdaBoost into the computation of network. Similar methods is also observed in [Huang and Carley, 2019], which also extends the recurrent network for deep graph representation learning.

Neural Gates and Residual Learning: Besides these two main topics, this paper is also closely correlated with neural gates learning [Hochreiter and Schmidhuber, 1997; Chung et al., 2014] and residual learning [Srivastava et al., 2015; He et al., 2015; Zhang and Meng, 2019], respectively. Neural gate learning initially proposed in LSTM [Hochreiter and Schmidhuber, 1997] proposes a way to handle the gradient vanishing/exploding problem with deep model learning. Later on, such concepts have been widely used in deep model learning [Chung et al., 2014; Gao and Glowacka, 2016]. Residual network [He et al., 2015] has been demonstrated to be effective for deep CNN learning, which extends the highway network [Srivastava et al., 2015] to a more general case. In [Zhang and Meng, 2019], the authors also observe that conventional naive residual learning cannot work for graph neural networks any more, and they introduce several new graph residual terms for representation learning instead.

3 Problem Formulation

Here, we will provide the formulation of the problem studied in this paper. Prior to that, we will also introduce the notations and terminologies to be used in the problem formulation

3.1 Notations

In the sequel of this paper, we will use the lower case letters (e.g., x) to represent scalars, lower case bold letters (e.g., x) to denote column vectors, bold-face upper case letters (e.g., X) to denote matrices, and upper case calligraphic letters (e.g., X) to denote sets or high-order tensors. Given a matrix X, we denote X(i, :) and X(:, j) as its row and column, respectively. The () entry of matrix X can be denoted as either X(i, j). We use and to represent the transpose of matrix X and vector x. For vector x, we represent its -norm as . The Frobenius-norm of matrix X is represented as . The element-wise product of vectors x and y of the same dimension is represented as , whose concatenation is represented as .

3.2 Terminology Definition

In this paper, we focus on studying the node representation learning problem in graph data. DEFINITION 1. (Graph): Formally, we can denote the graph data studied in this paper as G = (V, E, w, x, y), where V and E denote the sets of nodes and links in the graph. Mapping projects the links to their weights; whereas mappings and project the nodes to their raw features and labels. Here, X and Y denote the raw feature space and label space, respectively.

According to the above definition, for presentation simplicity, given a node , we can also denote its raw feature vector and label vector as and , respectively. Meanwhile, given a link between nodes and , we can also denote its corresponding link weight as (or for simplicity). In the case that studied graph is unweighted, we have and . DEFINITION 2. (Neighbor Set): Given the input graph G, for any node in the graph, we can denote its neighbor set as the group of nodes adjacent to based on the links in the graph, i.e., .

In the case when the studied graph is directed, the neighborhood set can be furthermore divided into the the inneighbor and out-neighbor, which will not be considered in this paper.

Figure 3: Detailed architecture of the GDU neuron (for node

3.3 Problem Statement

In this paper, we will study the semi-supervised node classi-fication problem via graph representation learning. Based on the notations and terminologies defined above, we can provide studied problem formulation in this paper as follows.

DEFINITION 3. (Semi-Supervised Node Classification): Given the input graph G = (V, E, w, x, y) with a subset of partially labeled nodes , the node classification problem studied in this paper aims at learning a mapping to project the nodes to their desired labels. To be more specific, in building the mapping f, we aim to to learn the representations of nodes based on both nodes’ selfraw features and the diffused information from their nearby neighbors. What’s more, the mapping f should also be generalizable to deep architectures without suffering from the suspended animation problem.

4 Method

To address the semi-supervised node classification problem, we will introduce the DIFNET (deep diffusive neural network) model in this part. We will first talk about the GDU neuron used in DIFNET, and also provide necessary information analyzing its effectiveness in addressing the suspended animation problem. For the neighborhood information diffusion, we will propose an attention mechanism for information propagation in DIFNET. Finally, detailed information about learning the DIFNET model will be introduced at the end.

4.1 Gated Diffusive Unit (GDU)

GDU is a new neuron model proposed for graph representation learning specifically. Formally, given the input graph data G involving the node set V and link set E, according to Section 3, we can denote the raw feature vector and neighbor set of node as and , respectively. The DIFNET model to be build involves multiple layers, and we can denote the hidden state of node at the layer of DIFNET as . For the nearby neighbors of , we can denote their hidden states as . DIFNET introduce an attention based diffusion operator to propagate and aggregate the information among such neighbor nodes, which can be denoted as

The above operator will be defined in detail in Section 4.3.

As illustrated in Figure 3, the GDU (e.g., for node ) has multiple input portals. Besides the neighborhood aggregated information , GDU will also accept ’s lower-layer representation and graph residual term as the other two inputs. For the aggregated input from the neighbors, certain information in the vector can be useless for learning ’s new state. Therefore, GDU defines an adjustment gate to update as follows:

Here, is the sigmoid function parameterized by .

Meanwhile, for the representation input from the lower-layer of , i.e., , GDU also introduces a similar evolving gate for adjusting the hidden state vector:

In addition, to involve the graph residual learning [Zhang and Meng, 2019] in DIFNET to overcome the suspended animation problem, GDU can also accept the graph residual terms as the third input, which can be denoted as

Formally, can represent any graph residual terms defined in [Zhang and Meng, 2019]. For instance, if the naive residual term is adopted here, we will have ; whereas for the raw residual term, we have . For the graph-naive residual term and graph-naive residual term, can be defined accordingly based on the graph adjacency matrix. In this paper, we will use the graph-raw residual term, as it can achieve the best performance as shown in [Zhang and Meng, 2019].

Based on these three inputs, and , GDU can effectively integrate the useful information from them to update the hidden state of node . Instead of simple summation, integration of such information is achieved in GDU via two selection gates and denoted as follows:

In the above equation, term 1 denotes a vector filled with value 1 of the same dimensions as the selection gate vectors and . Operator denotes the hyperbolic tangent activation function and denotes the entry-wise product as introduced in Section 3.1.

4.2 More Discussions on GDU

Here, we would like to provide more discussions about the GDU neuron, and also introduce a simplified version of GDU to lower down the learning cost of DIFNET.

Design of GDU on Suspended Animation

The design of the GDU architecture can help DIFNET solve the “suspended animation problem” when the model goes deep in three perspectives:

• Anti-Suspended Animation: Instead of integrating the neighborhood information with the approximated graph convolutional operator via either equal importance (as GCN) or with weighted summation (as GAT), GDU introduces (a) an attentive neighborhood information diffusion and integration mechanism (to be introduced in the following subsection), and (b) a node state adjustment mechanism via four neural gates defined above. GDU is not based on the approximated graph convolutional operator at all. Both the attentive diffusion mechanism and neural gates based adjustment mechanism are dynamic, which can compute the parameters automatically depending on the specific learning scenarios and can overcome the suspended animation side-effects on the model learning performance.

• Residual Learning: As introduced in [Zhang and Meng, 2019], the graph residual terms (especially the raw residual term and graph-raw residual term) can help the GNN models address the “suspended animation problem”. Both theoretic proofs and empirical experiments have demonstrated its effectiveness. The architecture of GDU also effectively integrate the graph residual learning mechanism into the learning process, where the input portal can accept various graph residual terms.

• Gradient Exploding/Vanishing: In addition, as introduced in [Hochreiter and Schmidhuber, 1997], for the neural network models in a deep architecture, the learning process may also suffer from the gradient exploding/vanishing problem. Similar problem has been observed for the GNN models, which is different from the “suspended animation problem” that we mention above. Meanwhile, similar to LSTM [Hochreiter and Schmid- huber, 1997] and GRU [Chung et al., 2014], the neural gates introduced in GDU can help ease even solve the “gradient exploding/vanishing” problem effectively.

Viewed in such perspectives, GDU is very different from the neurons used in the existing GNN models. In the fol- lowing part, we will focus on applying GDU to construct the DIFNET model for the graph data representation learning.

Simplified GDU Meanwhile, we also observed that the GDU neuron involves a complex architecture involving multiple variables, which may lead to a relatively high time cost for learning the model. To lower down the learning time cost, we also introduce a simplified version of GDU in this paper here. The main difference of the simplified GDU versus the original one introduced before lie in Equations 5-6, which can be denoted as

Figure 4: Framework architecture of the DIFNET model.

follows:

where the selection gates

According to the equations, one of the selection gate de-fined in Equation 6 is removed, and the definition of gate is also simplified, which is determined by the adjusted representations and only. Also the architecture of GDU is also updated, where the node state updating equation will balance between and (controlled by the gate ). The residual term is added to the representation at the end only, where is the variable for vector dimension adjustment, which can be shared across layers with the same hidden dimensions.

4.3 Attention based Neighborhood Diffusion

In this part, we will define the Diffusionoperator used in Equation (1) for node neighborhood information integration. The DIFNET model defines such an operator based on an attention mechanism. Formally, given the nodes hidden states , DIFNET defines the diffusion operator together with the nodes integrated representations as follows:

where covers the hidden states of all the nodes in the input graph. Matrices Q and K are the identical copies of V. The output matrix denotes the aggregated neighbor representations of all the nodes. Term denotes a mask matrix, where entry M(i, j) = 1 iff . Matrix M can effectively help prune the influences among nodes which are not connected in the original input graph.

According to the above equation, DIFNET quantifies the influence of node on in the layer based on their hidden states. If we expand the above equation, given and its adjacent neighbor set , we can also rewrite the learned ’s aggregated neighborhood representation vector as

where the influence weight is defined as

4.4 Graph Diffusive Neural Network Learning

Based on both the GDU neurons and the attentive diffusion operator, we can construct the DIFNET model, whose detailed architecture is provided in Figure 4. Based on the input graph, DIFNET involves multiple layers of GDU neurons, which accept inputs from the previous layer, diffusion operators and the graph residual terms. In DIFNET, at each layer of the model, one GDU neuron is constructed to represent the state of each node. Based on the learned representations of each node at the last layer, one fully-connected layer is adopted to project the node representations to their labels. Formally, , according to Figure 4, its learning process of DIFNET can be denoted as the following equations:

In the above equation, notation Embeddingwill embed the nodes’ raw input features (usually in a large dimension) into a latent representation. Depending on the raw features, different models can be used to define the Embeddingfunction. For instance, CNN can be used if is an image; LSTM/RNN can be adopted if is text; and fully connected layers can be used if is just a long feature vector. The nodes’ hidden states is also initialized based on the embedded raw feature vectors with a linear layer parameterized by . The learning process involves several diffusion and GDU layers, which will update nodes’ representations iteratively. The final outputted latent state vectors will be projected to the nodes’

Table 1: Learning result by models with deeper architectures.

labels with a fully-connected layer (also normalized by the softmax function).

Based on the set of labeled nodes, i.e., the training set T , DIFNET computes the introduced loss by comparing the learned label against the ground-truth, which can be denoted as the following equation:

By minimizing the above loss term, DIFNET can be effectively learned with the error back-propagation algorithm.

5 Experiments

To test the effectiveness of the proposed DIFNET model, extensive experiments have been done on several graph benchmark datasets in this paper, including Cora, Citeseer, and Pubmed. Cora, Citeseer and Pubmed are the benchmark datasets used for semi-supervised node classification problem used in almost all the state-of-the-art GNN papers [Kipf and Welling, 2016; Veliˇckovi´c et al., 2018; Zhang and Meng, 2019]. In the experiments, for fair comparison, we will follow the identical experimental settings (e.g., train/validation/test partition) as [Kipf and Welling, 2016].

Reproducibility: Both the datasets and source code used can be accessed via link2. Detailed information about the server used to run the model can be found at the footnote3.

Experimental Settings: By default, the experimental settings of the DIFNET model is as follows on all these datasets: learning rate: 0.01 (0.005 on pubmed), max-epoch: 1000, dropout rate: 0.5, weight-decay: 5e-4, hidden size: 16, graph residual term: graph-raw.

5.1 Model Depth vs. Suspended Animation

As illustrated by Figure 2 provided in the introduction section, we have shown that DIFNET model can converge very fast even with less epochs than GCN. What’s more, as the model depth increases, we didn’t observe the suspended animation problem with DIFNET at all, which can perform well even with very depth architectures, e.g., 50 layers.

Furthermore, in Table 1, we also provide the learning accuracy of DIFNET (with different depths) on Cora, Citeseer and

Table 2: Learning result accuracy of node classification methods. In the table, ‘-’ denotes the results of the methods on these datasets are not reported in the existing works. We have run DIFNET with depth values from . Results reported of DIFNET in this paper denotes the best one among all these depths.

Pubmed, respectively. To illustrate its performance advantages, we also add the results of GCN (with different depths) on these three datasets in the table as well. The results demonstrate that DIFNET can also work well in handling the suspended animation problem for the other datasets from different sources as well.

5.2 Comparison with State-of-the-Art

Besides the results showing that DIFNET can perform well in addressing the suspended animation problem, to make the experimental results more complete, we also compare DIFNET with both classic and state-of-the-art baseline models, whose results are provided in Table 2. According to the results, compared against these baseline methods, DIFNET can outperform them with great advantages. Also for DIFNET with the simplified GDU, it can achieve comparable performance as DIFNET with the original GDU with the same learning epochs. However, the learning time cost with simplified GDU is greatly reduced according to the experiment results.

6 Conclusion

In this paper, we focus on studying the graph representation learning problem and apply the learned graph representations for node semi-supervised classification. To address the common “suspended animation problem” with the existing graph neural networks, we introduce a new graph neural network model, namely DIFNET, in this paper. DIFNET is constructed based on both the GDU neurons and the attentive diffusion operator. To handle the suspended animation problem, GDU includes both neural gate learning and graph residual learning into its architecture, which can also handle the conventional gradient vanishing/exploding problem as well. Extensive experiments have been done on several benchmark graph datasets for node classification, whose results also demonstrate the effectiveness of both DIFNET and GDU in deep graph representation learning.

References

[Atwood and Towsley, 2015] James Atwood and Don Towsley. Search-convolutional neural networks. CoRR, abs/1511.02136, 2015.

[Belkin et al., 2006] Mikhail Belkin, Partha Niyogi, and Vikas Sindhwani. Manifold regularization: A geometric framework for learning from labeled and unlabeled examples. J. Mach. Learn. Res., 7:2399–2434, December 2006.

[Chung et al., 2014] Junyoung Chung, C¸ aglar G¨ulc¸ehre, KyungHyun Cho, and Yoshua Bengio. Empirical evaluation of gated recurrent neural networks on sequence modeling. CoRR, abs/1412.3555, 2014.

[Defferrard et al., 2016] Micha¨el Defferrard, Xavier Bresson, and Pierre Vandergheynst. Convolutional neural networks on graphs with fast localized spectral filtering. CoRR, abs/1606.09375, 2016.

[Gao and Glowacka, 2016] Yuan Gao and Dorota Glowacka. Deep gate recurrent neural network. In ACML, 2016.

[Gao et al., 2019] Yang Gao, Hong Yang, Peng Zhang, Chuan Zhou, and Yue Hu. Graphnas: Graph neural architecture search with reinforcement learning. CoRR, abs/1904.09981, 2019.

[G¨urel et al., 2019] Nezihe Merve G¨urel, Hansheng Ren, Yujing Wang, Hui Xue, Yaming Yang, and Ce Zhang. An anatomy of graph neural networks going deep via the lens of mutual information: Exponential decay vs. full preservation. ArXiv, abs/1910.04499, 2019.

[Hammond et al., 2011] David K. Hammond, Pierre Vandergheynst, and Remi Gribonval. Wavelets on graphs via spectral graph theory. Applied and Computational Harmonic Analysis, 30(2):129–150, Mar 2011.

[He et al., 2015] Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. Deep residual learning for image recognition. CoRR, abs/1512.03385, 2015.

[Hochreiter and Schmidhuber, 1997] Sepp Hochreiter and J¨urgen Schmidhuber. Long short-term memory. Neural computation, 9(8):1735–1780, 1997.

[Huang and Carley, 2019] Binxuan Huang and Kathleen M. Carley. Inductive graph representation learning with recurrent graph neural networks. CoRR, abs/1904.08035, 2019.

[Kipf and Welling, 2016] Thomas N. Kipf and Max Welling. Semi- supervised classification with graph convolutional networks. CoRR, abs/1609.02907, 2016.

[Klicpera et al., 2018] Johannes Klicpera, Aleksandar Bojchevski, and Stephan G¨unnemann. Personalized embedding propagation: Combining neural networks on graphs with personalized pagerank. CoRR, abs/1810.05997, 2018.

[Li et al., 2018a] Qimai Li, Zhichao Han, and Xiao-Ming Wu. Deeper insights into graph convolutional networks for semi-supervised learning. CoRR, abs/1801.07606, 2018.

[Li et al., 2018b] Zhuwen Li, Qifeng Chen, and Vladlen Koltun. Combinatorial optimization with graph convolutional networks and guided tree search. CoRR, abs/1810.10659, 2018.

[Li et al., 2019] Guohao Li, Matthias M¨uller, Ali K. Thabet, and Bernard Ghanem. Can gcns go as deep as cnns? CoRR, abs/1904.03751, 2019.

[Lin et al., 2019] Guangfeng Lin, Jing Wang, Kaiyang Liao, Fan Zhao, and Wanjun Chen. Structure fusion based on graph convolutional networks for semi-supervised classification. CoRR, abs/1907.02586, 2019.

[Lu and Getoor, 2003] Qing Lu and Lise Getoor. Link-based classification. In Proceedings of the Twentieth International Conference on International Conference on Machine Learning, ICML’03, pages 496–503. AAAI Press, 2003.

[Masci et al., 2015] Jonathan Masci, Davide Boscaini, Michael M. Bronstein, and Pierre Vandergheynst. Shapenet: Convolutional neural networks on non-euclidean manifolds. CoRR, abs/1501.06297, 2015.

[Monti et al., 2016] Federico Monti, Davide Boscaini, Jonathan Masci, Emanuele Rodol`a, Jan Svoboda, and Michael M. Bronstein. Geometric deep learning on graphs and manifolds using mixture model cnns. CoRR, abs/1611.08402, 2016.

[Niepert et al., 2016] Mathias Niepert, Mohamed Ahmed, and Konstantin Kutzkov. Learning convolutional neural networks for graphs. CoRR, abs/1605.05273, 2016.

[Perozzi et al., 2014] Bryan Perozzi, Rami Al-Rfou, and Steven Skiena. Deepwalk: Online learning of social representations. CoRR, abs/1403.6652, 2014.

[Scarselli et al., 2009] Franco Scarselli, Marco Gori, Ah Chung Tsoi, Markus Hagenbuchner, and Gabriele Monfardini. The graph neural network model. IEEE Transactions on Neural Networks, 20(1):61–80, Jan 2009.

[Srivastava et al., 2015] Rupesh Kumar Srivastava, Klaus Greff, and J¨urgen Schmidhuber. Highway networks. CoRR, abs/1505.00387, 2015.

[Sun et al., 2019] Ke Sun, Zhouchen Lin, and Zhanxing Zhu. Adagcn: Adaboosting graph convolutional networks into deep models. ArXiv, abs/1908.05081, 2019.

[Veliˇckovi´c et al., 2018] Petar Veliˇckovi´c, Guillem Cucurull, Arantxa Casanova, Adriana Romero, Pietro Li`o, and Yoshua Bengio. Graph Attention Networks. International Conference on Learning Representations, 2018.

[Weston et al., 2008] Jason Weston, Fr´ed´eric Ratle, and Ronan Collobert. Deep learning via semi-supervised embedding. In Proceedings of the 25th International Conference on Machine Learning, ICML’08, pages 1168–1175, New York, NY, USA, 2008. Association for Computing Machinery.

[Wu et al., 2019] Zonghan Wu, Shirui Pan, Fengwen Chen, Guodong Long, Chengqi Zhang, and Philip S. Yu. A comprehensive survey on graph neural networks. 2019.

[Yang et al., 2016] Zhilin Yang, William W. Cohen, and Ruslan Salakhutdinov. Revisiting semi-supervised learning with graph embeddings. CoRR, abs/1603.08861, 2016.

[Zhang and Meng, 2019] Jiawei Zhang and Lin Meng. Gresnet: Graph residual network for reviving deep gnns from suspended animation. CoRR, abs/1909.05729, 2019.

[Zhang, 2018] Jiawei Zhang. Deep loopy neural network model for graph structured data representation learning. CoRR, abs/1805.07504, 2018.

[Zhang, 2019] Jiawei Zhang. Graph neural networks for small graph and giant network representation learning: An overview. CoRR, abs/1908.00187, 2019.

[Zhou et al., 2018] Jie Zhou, Ganqu Cui, Zhengyan Zhang, Cheng Yang, Zhiyuan Liu, and Maosong Sun. Graph neural networks: A review of methods and applications. CoRR, abs/1812.08434, 2018.