Bridging the Gap between Spatial and Spectral Domains: A Survey on Graph Neural Networks

2020·Arxiv

Abstract

Abstract

Deep learning’s success has been widely recognized in a variety of machine learning tasks, including image classification, audio recognition, and natural language processing. As an extension of deep learning beyond these domains, graph neural networks (GNNs) are designed to handle the nonEuclidean graph-structure which is intractable to previous deep learning techniques. Existing GNNs are presented using various techniques, making direct comparison and cross-reference more complex. Although existing studies categorize GNNs into spatial-based and spectral-based techniques, there hasn’t been a thorough examination of their relationship. To close this gap, this study presents a single framework that systematically incorporates most GNNs. We organize existing GNNs into spatial and spectral domains, as well as expose the connections within each domain. A review of spectral graph theory and approximation theory builds a strong relationship across the spatial and spectral domains in further investigation.

1 Introduction

The effectiveness of deep learning [LeCun et al., 2015] has been widely recognized in various machine learning tasks [Redmon et al., 2016; Hinton et al., 2012; Luong et al., 2015], especially on the Euclidean data. Recent decades have witnessed a great number of emerging applications where effective information analysis generally boils down to the non-Euclidean geometry of the data represented by a graph, including social networks, transportation networks, disease contact networks, brain’s neuronal networks, gene data on biological regulatory networks, telecommunication networks, and knowledge graph [Zhang et al., 2018; Zhou et al., 2018; Wu et al., 2019b; Hamilton et al., 2017b]. General deep learning techniques can hardly handle such non-Euclidean problems in graph-structured data. Representing a graph is challenging because it is irregular, i.e., each graph has a variable size of nodes, and each node in a graph has a different number of neighbors, rendering some operations such as convolutions not compatible with the graph structure. Deep learning-based approaches for graph data have recently piqued people’s curiosity. This growing trend has attracted the attention of the machine learning community, and a huge number of GNN models based on various theories have been constructed. [Bruna et al., 2014; Kipf and Welling, 2017; Defferrard et al., 2016; Hamilton et al., 2017a; Atwood and Towsley, 2016; Veliˇckovi´c et al., 2017].

However, there is still a lack of comprehension of GNN’s physical meaning and representational capacity. This constraint not only makes it difficult to compare and enhance state-of-the-art procedures, but it also makes it an unmanageable black-box that can pose serious concerns in areas such as business intelligence and drug research. As a result, there is a compelling need to de-mystify GNNs, prompting researchers to explain GNNs in a broader context [Xu et al., 2019; Gilmer et al., 2017; Ying et al., 2019; Yuan et al., 2020]. However, these studies can only explain a small number of GNNs, leaving the majority of them unsolved.

Spatial-based methods (e.g., random walk, Page Rank, attention model, low-pass filter, message passing) and spectral-based methods (e.g., random walk, Page Rank, attention model, low-pass filter, message passing) are the two types of GNNs currently in use (e.g., ARMA filter, auto-regressive filter). Existing surveys [Bronstein et al., 2017; Zhang et al., 2018; Zhou et al., 2018; Wu et al., 2019b] have done a good job of separating and summarizing the works in their various subcategories. However, under current taxonomies, comprehension of the linkages between spatial and spectral-based techniques has not been fully explored. GNN research is still focused on revealing the underlying mechanisms, which is difficult due to the fact that most GNN mechanisms are not intrinsically consistent [Yuan et al., 2020; Xu et al., 2019; Li et al., 2019b; Li et al., 2018].

The goal of this research is to propose a generic framework for unifying GNNs under various processes by uncovering hidden relationships between GNNs from a theoretical standpoint. Our study is unusual in that it connects current GNN research in a white-box, interpretable framework. To begin, we provide a quick overview of the suggested framework for connecting the spatial and spectral domains, as well as their linkages. Then, for each subdivision of the spatial and spectral domains, many sample works are presented. Finally, we show how current work on enhancing GNNs in terms of over-smoothing and how a large scale graph can be broken down into one of these subcategories. Detailed contributions of this paper are summarized as follows:

1. Proposing taxonomies for the spatial- and spectral-based methods, respectively. This paper unifies GNNs in the spatial domain by formulating node aggregation, and categorizes GNNs by the types of frequency response functions in the spectral domain.

2. Unifying spatial and spectral models into a generic framework. The proposed framework links the spectral and spatial domain by comparing the analytical forms of node aggregation function and frequency response function.

3. Analyzing the proposed framework with trendy topics. Time complexity and expressive power are studied with approximation theory. Also, over-smoothing and scaling issue, two severe problems in frontier research, are studied, showing that they are special cases of our proposed framework.

2 Problem Setup and Framework Overview

A graph is defined as G = (V, E, A), where V is a set of n nodes and E represents edges. An entry denotes a node, indicates an edge between nodes i and j. The adjacency matrix is defined by if there is a link between node i and j, and else 0. Node features is a matrix with each entry representing the feature vector on node i. Another popular graph matrix is the graph Laplacian which is defined as where D is the degree matrix. Due to its generalization ability [Bollob´as, 2004] , the symmetric normalized Laplacian is often used, which is defined as . Another option is random walk normalization: . Note that normalization could also be applied to the adjacency matrix. Their relationship can be derived as . Since this survey focuses on spectraland spatial-based methods, two related definitions are listed below. Definition 2.1 (Spatial Method) By integrating graph connectivity G and node features X, the updated node representations (Z) are defined as:

where G is often implemented with A or L in existing works. Therefore, spatial methods focus on finding a node aggregation function that learns how to aggregate node features to obtain a updated node embedding Z.

sis (i.e., graph Fourier transform) [Shuman et al., 2013; Zhu and Rabbat, 2012]: , where is the diagonal matrix whose diagonal elements are the corresponding eigenvalues (i.e., ), and U is also called eigenvectors. The graph Fourier transform of a signal X is defined as and its inverse as . Definition 2.2 (Spectral Method) A graph convolution operation is defined in the Fourier domain such that

where is the element-wise product, and are two signals defined on nodes. It follows that a node signal is filtered by spectral signal as:

where g is known as frequency response function. Therefore, the objective of spectral methods is to learn a function .

Framework Overview: As shown in Fig. 1, the proposed framework categorizes GNNs into the spatial (A-0) and spectral (B-0) domains, either of which is further divided into three subcategories (A-1/A-2/A-3 and B-1/B-2/B-3), respectively. Based on the types of node aggregation (i.e., f in Def. 2.1), A-0 is further divided into linear (A-1), polynomial (A-2) and rational (A-3) propagation. Linear propagation means only the first order neighbors are involved, while polynomial propagation considers high-order neighbors. Beyond them, rational propagation adds reverse propagation. B-0 is split into linear (B-1), polynomial (B-2) and rational (B-3) approximation based on the types of frequency filtering (i.e., g in Def. 2.2), since they explicitly belong to the approximation techniques. Each category and subcategory will be elaborated with examples in Section 3 and 4. Two major relationships are inside our framework: (1) Inter-Relationship between A-0 and B-0 By definition 2.1 and 2.2, there is a correspondence between node aggregation function (f) and frequency response function (g):

where f = g and three pairs of equivalence relationships exist across A-0 and B-0:

• Linear Propagation and Approximation: Through graph and matrix theories, A-1 adjusts weights on a set of neighbor nodes, which corresponds to adjusting weights on frequency components in B-1.

• Polynomial Propagation and Approximation: Aggregating different orders of neighbors in A-2 can be rewritten as the sum of different orders of frequency components, which is exactly the analytical form of B-2.

• Rational Propagation and Approximation: A-3 de-fines a label propagation with reverse propagation, which is equivalent to B-3 that approximates frequency response with rational function. (2) Intra-Relationship inside A-0 and B-0 Three spatial subcategories (A-1/A-2/A-3) are strongly connected under (1) Generalization: A-1 can be extended to A-2 by adding more neighbors of higher order. A-2 can be upgraded to A-3 by adding reverse propagation; (2) Specialization: A-1 is a special case of A-2 when setting the order to 1. A-2 is a special case of A-3 if the reverse propagation is removed. Similarly, three spectral subcategories (B-1/B-2/B-3) are strongly connected with (1) Generalization: B-1 can be extended to B-2 by adding more higher order of eigenvalues. B-2 can be upgraded to B-3 if the denominator of frequency response function is not 1; (2) Specialization: B-1 is a special case of B-2 by setting the highest order to 1. B-2 is a special case of B-3 by setting the denominator of the frequency response function to 1.

Figure 1: Illustration of major graph neural operations and their relationship. Spatial and spectral methods are divided into three groups, respectively. Group A-1, A-2, and A-3 are strongly related by generalization and specialization, so are group B-1, B-2, and B-3. The equivalence relationships between A-1 and B-1 are marked with blue, red for A-2/B-2, and green for A-3/B-3.

3 Spatial-based GNN (A-0)

According to the node aggregation, we categorize spatial-based GNNs into three subcategories below, which have different strategies on the neighbor selection and propagation flow.

3.1 Linear Propagation (A-1)

Many works [Perozzi et al., 2014; Xu et al., 2019; Gilmer et al., 2017; Hamilton et al., 2017a; Veliˇckovi´c et al., 2017] can be treated as learning the aggregation scheme among first-order neighbors (i.e., direct neighbors). This aspect focuses on learning the weights for the node and its direct neighbors. Formally, updated node embeddings, Z(v), can be written as:

where denotes a neighbor of node is their representations, and indicate the weight functions. First item on the right hand side denotes the representation of node , while the second term represents the update from its neighbors. Applying random walk normalization, Eqn. 3.1 can be written as:

or with symmetric normalization:

where represents the degree of node . Normalization has better generalization capacity, which is supported by a theoretical proof on performance improvement [Johnson and Zhang, 2007]. In a simplified configuration, weights for the neighbors () are the same. Therefore, they can be rewritten in matrix form as:

or

where and are the weights. All the above can be generalized as the same form:

where denotes the normalized A, which could be implemented by random walk or symmetric normalization. Several state-of-the-art models are selected as examples in this category:

(1) Graph Convolutional Network (GCN) [Kipf and Welling, 2017] adds a self-loop to the current node , and applies a renormalization, changing to . GCN can be formulated as:

where . Therefore, GCN is equivalent to Eqn. 5 when setting and with the renormalization trick, and the result of GCN is exactly the sum of the current node and average of its neighbors.

(2) GraphSAGE [Hamilton et al., 2017a] with mean aggregator averages a node with its neighbors by:

where h indicates the representation, and N denotes the neighbor nodes. Eqn. 6 can be written in matrix form as:

which is equivalent to Eqn. 5 with and . Note that the key difference between GCN and GraphSAGE is the normalization type. The former is symmetric normalization and the latter is random walk normalization.

(3) Graph Isomorphism Network (GIN) [Xu et al., 2019] updates node representations as:

which is equivalent to Eqn. 5 with and . Note that GIN dose not perform normalization.

3.2 Polynomial Propagation (A-2)

To collect richer local structure, numerous studies involve high-order neighbor nodes (i.e., neighbors of neigbhors) [At- wood and Towsley, 2016; Defferrard et al., 2016; Wu et al., 2019a; Tang et al., 2015; Grover and Leskovec, 2016], since direct neighbors (i.e., first-order neighbors) are not always sufficient for representing the node. On the other hand, if the order number is large, GNNs may average all node representations, causing an over-smoothing issue and losing its focus on the local neighborhood [Li et al., 2018]. This motivates many models to tune the aggregation scheme on different orders of neighbors. Therefore, proper constraint and flexibil-ity of orders are critical for node representations. High-order neighbors have been proved to characterize complex signal such as Gabor-like filters [Abu-El-Haija et al., 2019]. Formally, this type of work can be formulated as:

where indicates a k-th order neighbors of node v and N(v) denotes direct neighbors of v. Eqn. 9 can be rewritten in matrix form as:

where is a polynomial function. Applying normalization, Eqn. 10 can be rewritten as:

where , and A could also be normalized by random walk normalization. Several existing works are analyzed below, showing that they are variants of Eqn. 10 or 11:

(1) ChebNet [Hammond et al., 2011] introduced truncated Chebyshev polynomial for estimating wavelet in graph signal processing. Based on this polynomial approximation, Defferrard et al. [Defferrard et al., 2016] designed ChebNet which embeds a novel neural network layer for the convolution operator. Specifically, ChebNet is written as:

where denotes the Chebyshev polynomial and is the Chebyshev coefficient. is the coefficient after expansion and reorganization. Since , we have:

which can be reorganized as:

which is exactly Eqn. 11.

(2) Diffusion Convolutional Neural Network (DCNN) [At- wood and Towsley, 2016] consider using a degree-normalized transition matrix, i.e., renormalized adjacency matrix :

where denotes a tensor containing the power series of , and the operator represents element-wise multiplication. It can be transformed as:

which equals to Eqn. 11.

(3) Simple Graph Convolution (SGC) [Wu et al., 2019a] removes non-linear function between neighboring graph convolution layers, and combines graph propagation in one single layer:

where is renormalized adjacency matrix, i.e., , and is degree matrix with self loop (same as in GCN). Therefore, it can be rewritten as:

3.3 Rational Propagation (A-3)

Most works merely consider label propagation to neighbors, ignoring propagation in the reverse direction. Reverse propagation on labels or features alleviates the over-smoothing issue by propagating back to the current node or restarting propagation with a certain probability. Note that A-2 can also mitigate over-smoothing issue by manually adjusting the order number, while A-3 can do it automatically. Several works explicitly or implicitly implement reverse propagation by applying rational function on the adjacency matrix [Klicpera et al., 2018; Wu et al., 2012; Li et al., 2019b; Loukas et al., 2015; Isufi et al., 2017; Levie et al., 2018; Bianchi et al., 2019; Chen et al., 2018b]. Formally, this subcategory can be represented as:

where P and Q are two different polynomial functions, and the bias of Q is often set to 1.

(1) Auto-Regressive Label Propagation [Zhu et al., 2003; Zhou et al., 2004; Bengio et al., 2006] is a widely used methodology for graph-based learning. The objective of Label Propagation (LP) is two-fold: one is to extract embeddings that match with the label, the other is to become similar to neighboring nodes. The label can be treated as part of node features, so we have:

which is equivalent to the form of Eqn. 18, i.e., P = I and .

Table 1: Frequency response functions grouped by approximation theory. Parameter notations have been defined in section 3.

(2) Personalized PageRank (PPNP) [Klicpera et al., 2018] obtains node’s representation via a teleport or restart probability which is the ratio of keeping the original representation X, i.e., reverse or no propagation. (1-) is the ratio of performing the normal label propagation:

where is random walk normalized adjacency matrix with self loop.

(3) ARMA filter [Bianchi et al., 2019] utilizes ARMA filter for approximating the desired filter response function, which can be written in the spatial domain as:

Note that ARMA filter is an unnormalized version of PPNP. When setting a + b = 1, ARMA equals to PPNP.

4 Spectral-based GNN (B-0)

Spectral-based GNN models are built on spectral graph theory which applies eigen-decomposition and analyzes the weight-adjusting function (i.e., frequency response function) on eigenvalues of graph matrices. Based on spectral operation, we propose a new taxonomy of GNNs, categorizing spectral-based GNN into three subcategories below, and use the same GNN examples in Section 3.

4.1 Linear Approximation (B-1)

Many existing works are proved to be low-pass filters [Li et al., 2019b], which means that only low-frequency components are emphasized. All the works that can be classified into A-1 can be understood as adjusting weights of frequency component during aggregation. Specifically, a linear function of g is defined so that:

where is the i-th eigenvector, is the frequency filter function controlled by parameters , and l lowest frequency components are aggregated. The goal of g is to change the weights of eigenvalues to fit the target output. The same examples in A-1 can be rewritten in the spectral domain. For example, GCN can be written as

where is renormalization of . Therefore, the frequency response function is . All the rewritten equations in B-0 are listed in Table 1.

4.2 Polynomial of Approximation (B-2)

Considering higher order on eigenvalues, frequency response function is a polynomial function and thereby can approximate more complex signal compared with linear approximation. Formally, B-2 can be written as:

where is a polynomial function of eigenvalues.

4.3 Rational Approximation (B-3)

The rational approximation is theoretically proved to be more powerful than polynomial approximation [Achieser, 2013; Trefethen, 2013; Petrushev and Popov, 2011; Cohen, 2011; Pachon, 2010], and it can handle the non-smooth signal. Rational kernel based methods are written as:

where is a rational function, and P, Q are two independent polynomial functions.

Remark: Note that the examples above in B-3 all have constraints on parameters. For instance, the numerator () in PPNP is equal to its bias of denominator which may render the rational function fail to obtain the optimal parameters. Therefore, one option is to learn all the separate parameters in Eqn. 22 [Chen et al., 2018b].

5 Performance and Relationship Analysis

As analyzed above, there is a close correspondence between the spatial and spectral methods. Also, there is a trade-off relationship between them. In quantum mechanics, Heisenberg’s uncertainty principle [Folland and Sitaram, 1997] describe a limit to the accuracy for certain pairs of complementary variables or canonically conjugate variables of a particle, implying that predicting the value of a quantity with arbitrary certainty is impossible. Specifically,

where and denote time spread and frequency spread respectively, and there is a trade-off between time and frequency concentration for a signal. Inspired by quantum

Table 2: Comparison on Time Complexity and Expressive Power

uncertainty principle, spectral graph analogy is developed [Agaskar and Lu, 2013], showing a trade-off between the spatial (A-0) and spectral (B-0) domain. Time Complexity and Expressive Power: A-1/B-1 have a time complexity of due to the matrix multiplication of A X. Accordingly, polynomial and rational method are analyzed in Table 2 where K is the order number. To compare their expressive powers, the convergence rate on challenging jump signal is employed as a benchmark [Chen et al., 2018b] (the simple signal cannot distinguish them). As shown in Table 2, A-3/B-3 converge exponentially faster than A-1/B-1, and A-2/B-2 converge linearly faster than A-1/B-1. Therefore, there is a trade-off between expressive power and computational efficiency. A-1/B-1 have the best efficiency but only capture the linear relationship. A-3/B-3 consume the most considerable overhead but could tackle more challenging signals.

Many recent works improve GNNs by tackling over-smoothing and large scale issue. In subsections 5.1 and 5.2, we will show that all the improvements are still inside our framework.

5.1 Sampling Point of View

To handle large graph, the sampling mechanism is introduced as a spatial-based method. Popular methodologies include random walk and subgraph sampling.

DeepWalk [Perozzi et al., 2014] first draws a group of random paths from the graph and applies a skip-gram algorithm to extract node features. Assuming the number of samples is large enough, then the transfer probability of random walk is, i.e., . If the training is sufficient and samples are adequate, the node will converge to its neighbors. Let the window size of skip-gram be 2t + 1 and the index of current node is t + 1. Therefore, the updated representation is as follows:

which is exactly polynomial propagation A-2. Node2Vec [Grover and Leskovec, 2016] defines a 2nd order random walk to control the balance between Breath First Search (BFS) and Depth First Search (DFS). Assuming the random walk is sufficiently sampled, Node2Vec can be rewritten in matrix form:

where transition probabilities is random walk normalized adjacency matrix, and Node2Vec also belongs to A-2. FastGCN [Chen et al., 2018a] and GraphSAGE [Hamil- ton et al., 2017a] sample subgraph randomly, which is equivalent to employing important sampling for each batch. The transition probability follows random walk normalization ( ), making it belong to A-1.

Remark: No sampling method belongs to A-3 or B-3. This is reasonable since A-3/B-3 has dramatically higher computational complexity.

5.2 Over-smoothing Point of View

Most GNNs perform poorly when stacking many layers, which is called the over-smoothing issue. Many related works aiming to solve the over-smoothing issue can be reduced to one category of the proposed framework.

[Zhu et al., 2020] proposed a method that combines direct neighbors and higher-order, which equivalent to polynomial propagation (A-2) or approximation (B-2). Deep GCN [Li et al., 2019a] developed a model with a residue module, dense connection, and dilated aggregation, which learns the weights of all different orders of neighbors. This is equivalent to polynomial propagation (A-2) or approximation (B-2). JKNet [Xu et al., 2018] also follows the same residue methodology as Deep GCN. PairNorm [Zhao and Akoglu, 2019] presents a two-step method that includes centering and re-scaling, which can neutralize the aggregation from graph convolution. PairNorm is similar to rational propagation (A-3) or approximation (B-3) since re-scaling is similar to teleport or restart operation. [Bo et al., ] design an adaptive method to dynamically adjust the weights between low-frequency and high-frequency components, resulting in two peaks in the spectral domain. This could also be modeled by rational propagation (A-3) or approximation (B-3) with its accuracy in jump signals. DropEdge [Rong et al., 2019] randomly drops a certain number of edges to avoid over-smoothing, which can be categorized as rational propagation (A-3) or approximation (B-3) since dropping edge provides a probability of keeping the original values of nodes. Remark: No stat-of-the-art method belongs to A-1 or B-1, which implies that A-1 and B-1 are fragile to the over-smoothing issue.

6 Conclusion

This paper proposes a unified framework that summarizes the state-of-the-art GNNs, providing a new perspective for understanding GNNs with different mechanisms. By analytically categorizing current GNNs into the spatial and spectral domains and further dividing them into subcategories, our analysis reveals that the subcategories are strongly connected by generalization, specialization, and equivalence relation. Even for currently trending topics such as over-smoothing studies, our proposed method can generalize them as one of those subcategories.

References

[Abu-El-Haija et al., 2019] Sami Abu-El-Haija, Bryan Perozzi, Amol Kapoor, Hrayr Harutyunyan, Nazanin Alipourfard, Kristina Lerman, Greg Ver Steeg, and Aram Galstyan. Mixhop: Higher-order graph convolution architectures via sparsified neighborhood mixing. International Conference on Machine Learning, 2019.

[Achieser, 2013] Naum I Achieser. Theory of approximation. Courier Corporation, 2013.

[Agaskar and Lu, 2013] Ameya Agaskar and Yue M Lu. A spectral graph uncertainty principle. IEEE Transactions on Information Theory, 59(7):4338–4356, 2013.

[Atwood and Towsley, 2016] James Atwood and Don Towsley. Diffusion-convolutional neural networks. In NeurlPS, pages 1993–2001, 2016.

[Bengio et al., 2006] Yoshua Bengio, Olivier Delalleau, and Nicolas Le Roux. 11 label propagation and quadratic criterion. 2006.

[Bianchi et al., 2019] Filippo Maria Bianchi, Daniele Grattarola, Lorenzo Livi, and Cesare Alippi. Graph neural networks with convolutional arma filters. CoRR, 2019.

[Bo et al., ] Deyu Bo, Xiao Wang, Chuan Shi, and Huawei Shen. Beyond low-frequency information in graph convolutional networks. arXiv preprint arXiv:2101.00797.

[Bollob´as, 2004] B´ela Bollob´as. Extremal graph theory. Courier Corporation, 2004.

[Bronstein et al., 2017] Michael M Bronstein, Joan Bruna, Yann LeCun, Arthur Szlam, and Pierre Vandergheynst. Geometric deep learning: going beyond euclidean data. IEEE Signal Processing Magazine, 34(4):18–42, 2017.

[Bruna et al., 2014] Joan Bruna, Wojciech Zaremba, Arthur Szlam, and Yann Lecun. Spectral networks and locally connected networks on graphs. In International Conference on Learning Representations (ICLR2014), CBLS, April 2014, pages http–openreview, 2014.

[Chen et al., 2018a] Jie Chen, Tengfei Ma, and Cao Xiao. Fastgcn: fast learning with graph convolutional networks via importance sampling. International Conference on Learning Representations, 2018.

[Chen et al., 2018b] Zhiqian Chen, Feng Chen, Rongjie Lai, Xuchao Zhang, and Chang-Tien Lu. Rational neural networks for approximating jump discontinuities of graph convolution operator. ICDM, 2018.

[Cohen, 2011] Harold Cohen. Numerical approximation methods. Springer, 2011.

[Defferrard et al., 2016] Micha¨el Defferrard, Xavier Bresson, and Pierre Vandergheynst. Convolutional neural networks on graphs with fast localized spectral filtering. In NeurlPS, pages 3844–3852, 2016.

[Folland and Sitaram, 1997] Gerald B Folland and Alladi Sitaram. The uncertainty principle: a mathematical survey. Journal of Fourier analysis and applications, 3(3):207– 238, 1997.

[Gilmer et al., 2017] Justin Gilmer, Samuel S Schoenholz, Patrick F Riley, Oriol Vinyals, and George E Dahl. Neural message passing for quantum chemistry. In Proceedings of

the 34th International Conference on Machine LearningVolume 70, pages 1263–1272. JMLR. org, 2017.

[Grover and Leskovec, 2016] Aditya Grover and Jure Leskovec. node2vec: Scalable feature learning for networks. In Proceedings of the 22nd ACM SIGKDD, pages 855–864. ACM, 2016.

[Hamilton et al., 2017a] Will Hamilton, Zhitao Ying, and Jure Leskovec. Inductive representation learning on large graphs. In NeurlPS, pages 1024–1034, 2017.

[Hamilton et al., 2017b] William L Hamilton, Rex Ying, and Jure Leskovec. Representation learning on graphs: Methods and applications. arXiv preprint arXiv:1709.05584, 2017.

[Hammond et al., 2011] David K Hammond, Pierre Vandergheynst, and R´emi Gribonval. Wavelets on graphs via spectral graph theory. Applied and Computational Harmonic Analysis, 30(2):129–150, 2011.

[Hinton et al., 2012] Geoffrey Hinton, Li Deng, Dong Yu, George Dahl, Abdel-rahman Mohamed, Navdeep Jaitly, Andrew Senior, Vincent Vanhoucke, Patrick Nguyen, Brian Kingsbury, et al. Deep neural networks for acoustic modeling in speech recognition. IEEE Signal processing magazine, 29, 2012.

[Isufi et al., 2017] E. Isufi, A. Loukas, A. Simonetto, and G. Leus. Autoregressive moving average graph filtering. IEEE Transactions on Signal Processing, 65(2):274–288, Jan 2017.

[Johnson and Zhang, 2007] Rie Johnson and Tong Zhang. On the effectiveness of laplacian normalization for graph semi-supervised learning. Journal of Machine Learning Research, 8(Jul):1489–1517, 2007.

[Kipf and Welling, 2017] Thomas N Kipf and Max Welling. Semi-supervised classification with graph convolutional networks. ICLR, 2017.

[Klicpera et al., 2018] Johannes Klicpera, Aleksandar Bojchevski, and Stephan G¨unnemann. Predict then propagate: Graph neural networks meet personalized pagerank. 2018.

[LeCun et al., 2015] Yann LeCun, Yoshua Bengio, and Geoffrey Hinton. Deep learning. nature, 521(7553):436, 2015.

[Levie et al., 2018] Ron Levie, Federico Monti, Xavier Bresson, and Michael M Bronstein. Cayleynets: Graph convolutional neural networks with complex rational spectral filters. IEEE Transactions on Signal Processing, 67(1):97– 109, 2018.

[Li et al., 2018] Qimai Li, Zhichao Han, and Xiao-Ming Wu. Deeper insights into graph convolutional networks for semi-supervised learning. In Thirty-Second AAAI Conference on Artificial Intelligence, 2018.

[Li et al., 2019a] Guohao Li, Matthias Muller, Ali Thabet, and Bernard Ghanem. Deepgcns: Can gcns go as deep as cnns? In Proceedings of the IEEE/CVF International Conference on Computer Vision, pages 9267–9276, 2019.

[Li et al., 2019b] Qimai Li, Xiao-Ming Wu, Han Liu, Xiaotong Zhang, and Zhichao Guan. Label efficient semi-supervised learning via graph filtering. In The IEEE

Conference on Computer Vision and Pattern Recognition (CVPR), June 2019.

[Loukas et al., 2015] A. Loukas, A. Simonetto, and G. Leus. Distributed autoregressive moving average graph filters. IEEE Signal Processing Letters, 22(11):1931–1935, Nov 2015.

[Luong et al., 2015] Thang Luong, Hieu Pham, and Christopher D Manning. Effective approaches to attention-based neural machine translation. In Proceedings of the 2015 Conference on Empirical Methods in Natural Language Processing, pages 1412–1421, 2015.

[Pachon, 2010] Ricardo Pachon. Algorithms for polynomial and rational approximation. PhD thesis, University of Oxford, 2010.

[Perozzi et al., 2014] Bryan Perozzi, Rami Al-Rfou, and Steven Skiena. Deepwalk: Online learning of social representations. In Proceedings of the 20th ACM SIGKDD international conference on Knowledge discovery and data mining, pages 701–710. ACM, 2014.

[Petrushev and Popov, 2011] Penco Petrov Petrushev and Vasil Atanasov Popov. Rational approximation of real functions, volume 28. Cambridge University Press, 2011.

[Redmon et al., 2016] Joseph Redmon, Santosh Divvala, Ross Girshick, and Ali Farhadi. You only look once: Unified, real-time object detection. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 779–788, 2016.

[Rong et al., 2019] Yu Rong, Wenbing Huang, Tingyang Xu, and Junzhou Huang. Dropedge: Towards deep graph convolutional networks on node classification. In International Conference on Learning Representations, 2019.

[Shuman et al., 2013] David I Shuman, Sunil K Narang, Pascal Frossard, Antonio Ortega, and Pierre Vandergheynst. The emerging field of signal processing on graphs: Extending high-dimensional data analysis to networks and other irregular domains. IEEE Signal Processing Magazine, 30(3):83–98, 2013.

[Tang et al., 2015] Jian Tang, Meng Qu, Mingzhe Wang, Ming Zhang, Jun Yan, and Qiaozhu Mei. Line: Largescale information network embedding. In Proceedings of the 24th international conference on world wide web, pages 1067–1077. International World Wide Web Conferences Steering Committee, 2015.

[Trefethen, 2013] Lloyd N Trefethen. Approximation theory and approximation practice, volume 128. Siam, 2013.

[Veliˇckovi´c et al., 2017] Petar Veliˇckovi´c, Guillem Cucurull, Arantxa Casanova, Adriana Romero, Pietro Lio, and Yoshua Bengio. Graph attention networks. arXiv preprint arXiv:1710.10903, 2017.

[Wu et al., 2012] Xiao-Ming Wu, Zhenguo Li, Anthony M So, John Wright, and Shih-Fu Chang. Learning with partially absorbing random walks. In NeurlPS, pages 3077– 3085, 2012.

[Wu et al., 2019a] Felix Wu, Amauri Souza, Tianyi Zhang, Christopher Fifty, Tao Yu, and Kilian Weinberger. Simplifying graph convolutional networks. In International Conference on Machine Learning, pages 6861–6871, 2019.

[Wu et al., 2019b] Zonghan Wu, Shirui Pan, Fengwen Chen, Guodong Long, Chengqi Zhang, and Philip S Yu. A comprehensive survey on graph neural networks. arXiv preprint arXiv:1901.00596, 2019.

[Xu et al., 2018] Keyulu Xu, Chengtao Li, Yonglong Tian, Tomohiro Sonobe, Ken-ichi Kawarabayashi, and Stefanie Jegelka. Representation learning on graphs with jumping knowledge networks. In International Conference on Machine Learning, pages 5453–5462. PMLR, 2018.

[Xu et al., 2019] Keyulu Xu, Weihua Hu, Jure Leskovec, and Stefanie Jegelka. How powerful are graph neural networks? In International Conference on Learning Representations, 2019.

[Ying et al., 2019] Rex Ying, Dylan Bourgeois, Jiaxuan You, Marinka Zitnik, and Jure Leskovec. Gnn explainer: A tool for post-hoc explanation of graph neural networks. 2019.

[Yuan et al., 2020] Hao Yuan, Haiyang Yu, Shurui Gui, and Shuiwang Ji. Explainability in graph neural networks: A taxonomic survey. arXiv preprint arXiv:2012.15445, 2020.

[Zhang et al., 2018] Ziwei Zhang, Peng Cui, and Wenwu Zhu. Deep learning on graphs: A survey. arXiv preprint arXiv:1812.04202, 2018.

[Zhao and Akoglu, 2019] Lingxiao Zhao and Leman Akoglu. Pairnorm: Tackling oversmoothing in gnns. In International Conference on Learning Representations, 2019.

[Zhou et al., 2004] Dengyong Zhou, Olivier Bousquet, Thomas N Lal, Jason Weston, and Bernhard Sch¨olkopf. Learning with local and global consistency. In NeurlPS, pages 321–328, 2004.

[Zhou et al., 2018] Jie Zhou, Ganqu Cui, Zhengyan Zhang, Cheng Yang, Zhiyuan Liu, and Maosong Sun. Graph neural networks: A review of methods and applications. arXiv preprint arXiv:1812.08434, 2018.

[Zhu and Rabbat, 2012] Xiaofan Zhu and Michael Rabbat. Approximating signals supported on graphs. In Acoustics, Speech and Signal Processing (ICASSP), 2012 IEEE International Conference on, pages 3921–3924. IEEE, 2012.

[Zhu et al., 2003] Xiaojin Zhu, Zoubin Ghahramani, and John D Lafferty. Semi-supervised learning using gaussian fields and harmonic functions. In Proceedings of the 20th International conference on Machine learning (ICML-03), pages 912–919, 2003.

[Zhu et al., 2020] Jiong Zhu, Yujun Yan, Lingxiao Zhao, Mark Heimann, Leman Akoglu, and Danai Koutra. Beyond homophily in graph neural networks: Current limitations and effective designs. NeurlPS, 33, 2020.