Node Masking: Making Graph Neural Networks Generalize and Scale Better

2020Β·arXiv

ABSTRACT

1 INTRODUCTION

Graphs are the most effective way of representing different types of entities and relationships amongst them. Several constructs inherently involve the notion of graphs, such as social networks, molecular structures, knowledge bases, recommendation systems, etc. Over the past few year, learning on graphs has become increasingly popular, applications of which can be found in domains ranging

from detection of abuse online [17] and document classification [13] to knowledge graph alignment [31] and relation extraction in texts [20]. Learning on graphs is essentially about leveraging the inductive bias imposed by their relational structures, i.e., relational inductive bias, so as to achieve better performance on tasks that can benefit from relation reasoning [2]. The ability to exploit relationships amongst entities in the data is a crucial one for advancing the state of artificial intelligence [14, 27].

A graph is defined by its set of nodes (i.e., vertices) and its set of edges. There exist two different paradigms for learning on graphs, transductive and inductive. In transductive learning, the node and edge sets remain constant across the training and prediction phases. In other words, at the training time, the learning algorithm has access to all the nodes and edges, including those for which predictions are to be made. Note that transductive learning does not support generalization to unseen nodes and edges. Figure 1 depicts node classification performed in a transductive setting.

Figure 1: Node classification in transductive settings. At the training time, the learning algorithm has access to all the nodes and edges, including those nodes for which labels are to be predicted (denoted by question marks).

In inductive learning, first a model H is learned over the training graph consisting of some nodes and edges. The learned model is then used to predict on unseen nodes and edges that may or may not be disconnected from the nodes and edges in the training graph [6]. Note that some works [11, 29] have instead interpreted inductive learning to mean that the model is first trained on a set of graphs and then applied to a separate set of graphs. But the former interpretation subsumes the latter in that a set of graphs can be treated as a single graph with multiple disconnected components. Figure 2 depicts node classification in inductive settings.

Figure 2: Node classification in inductive settings. Once learned, the model can be applied to new unseen nodes (denoted by question marks). There may or may not exist edges between such new nodes and the nodes used for training.

Deep learning has brought advancements to several areas within AI. That said, deep learning on graphs is a rather challenging task to perform with traditional architectures like Convolutional Neural Networks or Recurrent Neural Networks. In the recent years, a lot of research has been dedicated to generalizing the convolution operation to graphs [3, 11, 13, 33], which has formed the basis of all modern graph neural networks (GNNs). From the early spectral architectures that could only operate on undirected graphs in transductive settings to the current state of the art spatial ones that can apply inductively to arbitrary graphs, GNNs have undergone significant developments. This paper makes contributions towards further enhancing the capabilities of state of the art spatial GNNs.

Our contributions. We first discuss some theoretical tools to better visualize the operations performed by spatial GNNs. Using these tools, we analyze the inner workings of the state of the art spatial architectures, specifically aggregation-based GNNs, to uncover the sources of over-smoothing and over-fitting. We then propose a simple technique called Node Masking to help these GNNs generalize and scale better. We empirically uncover how this technique impacts over-smoothing and over-fitting in GNNs and validate the merits of the technique by performing several experiments on three widely-used benchmark datasets for node classification in both the transductive and inductive settings.

2 A BRIEF HISTORY OF GNNS

The concept of GNNs was first formalized in the work of Gori et al. [10]. The authors presented GNNs as an extension of recursive neural networks whereby they treat nodes as objects denoted by state vectors and edges as the relationships amongst those nodes. Their design consists of two main steps: i) iterative update of nodesβ state vectors based on the labels and state vectors of their neighbors up to a stable fixed point, and ii) back-propagation for adjustment of parameters used in the update step. This approach was further refined in the work of Scarselli et al. [21].

2.1 Spectral GNNs

Bruna et al. [3] laid the foundation for generalization of the convolution operation from regular grids to graphs. Leveraging spectral graph theory, they proposed an architecture for performing spectral convolutions on graphs. Given a graph , their architecture considers the feature vectors on nodes as multi-channel graph signals. It learns spectral filters that act on these signals in the Fourier domain defined by the eigenvectors of the Laplacian matrix of . This architecture displays limited scalability since the filters learned are not localized, i.e., they act on the whole graph, and computation of the Laplacianβs eigenvectors is itself an expensive operation.

To overcome these issues, Defferrard et al. [7] and Levie et al. [15] proposed spectral architectures, ChebNet and CayleyNet respectively, comprising localized filters approximated by Chebyshev and Cayley polynomials. Kipf and Welling [13] further simplified ChebNet by making the filters localized to 1-hop neighborhoods. By stacking multiple such filters in layers, they showed that any number of hops could be covered in the convolution operation. They called this new architecture Graph Convolutional Network (GCN). Note that all spectral architectures are inherently transductive in that the filters learned on a graph are specific to the eigenbasis of its Laplacian. This not only limits the ability of these architectures to generalize to new nodes and edges. Although GCN itself is spectral, the idea of stacking multiple layers to cover higher-order neighborhoods led to the concept of spatial GNNs.

2.2 Spatial GNNs

Spatial GNNs define the convolution operation directly on the structure of the graph. In other words, they work by learning functions to compute representations for nodes or edges that capture the features and structures of their surrounding neighborhoods. Once learned, these functions can then be inductively applied to new nodes and edges. Spatial GNNs are preferred over their spectral counterparts due to their scalability, inductive capabilities, and ability to handle myriad types of graphs [32].

2.2.1 Sampling-based. GraphSAGE [11] was one of the first spatial GNNs. For a node randomly samples a fixed number of nodes from its -hop neighborhood and learns to computes a representation for based on its own features plus the features of its sampled neighbors. Note that this design does not exhibit structural invariance, i.e., the GNN cannot accommodate varying neighborhood structures but rather constrains nodes to a fixed number of neighbors only. A clear drawback of such a sampling-based design is that a lot of neighborhood information is wasted.

2.2.2 Aggregation-based. Aggregation-based spatial GNNs eliminate the need for sampling fixed number of neighbors. They work by iteratively computing representations for nodes based on those of their respective neighbors. A -layer aggregation-based GNN sequentially performs such updates times, consequently computing a representation for every node that captures its -hop neighborhood. The update operation performed by the layer for a node can be stated as:

where denotes the set of neighbors of , and is the input feature vector of . The aggregate function aggregates representations of neighbors, and the combine function combines the aggregated representation with that of itself. This formulation is general enough to cover the various aggregation-based GNNs that exist. All of them mainly differ in their choice of and . Note that the compute required to capture -hop neighborhood grows linearly with in aggregation-based GNNs as opposed to exponentially in sampling-based GNNs.

Velickovic et al. [29] proposed Graph Attention Networks (GAT) wherein a nodeβs representation is iteratively updated by aggregating the representations of its neighbors combining them with that of the nodeβs as per coefficients allocated by a self-attention mechanism. The defines the update operation in the

where is the attention coefficient of node with respect to node from the attention head, is the weight matrix for the attention head, and denotes concatenation over the heads.

Xu et al. [33] recently introduced the Graph Isomorphism Network (GIN) whose theoretical foundations allows it to be maximally powerful amongst the various spatial GNNs. The GIN-architecture defines the update operation as:

where represents a multi-layer perceptron.

In this paper, we focus on aggregation-based GNNs given that they yield state of the art [32] performance. We work with the GAT and GIN architectures only but our techniques can be applied to other aggregation-based GNNs.

3 THEORETICAL FRAMEWORK

Hereon, we assume that the graphs we consider are undirected, implying that an edge can be traversed from either endpoints. That said, the work presented in this paper is trivially applicable to directed graphs too. We also assume that all nodes within a given graph are uniquely identifiable.

3.1 Aggregation trees

We discuss the concept of aggregation trees as the theoretical tool for visualizing aggregation-based GNNs. Aggregation trees have been explored in the past as graph kernels [33] under names like tree-walks [1] or subtree patterns [25].

Definition 3.1. Given a graph with a node in it, let the set of all possible hops, starting at Walks are paths with possibly repeated nodes. The is the smallest arborescence [8] rooted at that is a path from the root of to a leaf in it if and only if . Here, smallness is by the number of nodes. Note that is of height with all the leaves at the same depth. We refer to itself as the

Figure 3 shows an example of a graph along with the various aggregation trees of the node in it. Figure 4 shows some trees that have the same root node and the same set of paths from the root to leaves but are not valid 2-aggregation trees of node . We would like to point out to the reader that the two assumptions we made for graphs are not upheld in the case of aggregation trees:

β’ Aggregation trees can contain multiple replicas of the same node [25]. For example, the 2-aggregation tree figure 3 has node as the root and also as two of the leaves. The replicas are treated as distinct nodes so that aggregation trees remain acyclic, but all of them correspond to the same node in the base graph.

β’ Aggregation trees are directed, making the notion of neighbor set different in their case than in the case of undirected graphs. In an aggregation tree, the neighbor set only contains those nodes that have incoming edges from , but not those that have outgoing edges to . Therefore, in the context of aggregation trees, we refer to the neighbor set of a node as its set of children or child set for clarity.

The two lemmas given below highlight some core properties of aggregation trees.

Lemma 3.2. Given a -aggregation tree , let a subtree of defined as any node along with all its descendants up to some depth (). Then every subtree of is also an aggregation tree.

Proof. Let us assume that the subtree at some node an aggregation tree. Then there exists a path in from the root to some leaf in it via that is not a -length walk starting at node in the base graph, or vice versa. But this contradicts the definition of aggregation trees. Hence, such a subtree cannot exist.

Lemma 3.3. Given a graph , let be any node in it without loss of generality. The child set of any non-leaf node in is equal to the neighbor set of the corresponding node in

Proof. We prove this also by contradiction. A non-leaf node inmust be reachable in at most 1 hops from . Consequently, if there exists a non-leaf node insuch that its child set is not equal to its corresponding neighbor set in , then there exists a path from the root of to a leaf in it that is not a -length walk in versa. However, this contradicts the definition of -aggregation trees. Hence, such a non-leaf node cannot exist.

3.2 Connection to aggregation-based GNNs

We now establish the connection between aggregation trees and aggregation-based GNNs.

Figure 3: Sub-figure (a) shows an undirected graph, i.e., edges can be walked in either directions. Sub-figures (b-d) show some aggregation trees . Every path from the root of a to some leaf in it is a valid -hop walk in the graph starting at node , and vice versa.

Figure 4: Trees having the required paths from the root to leaves, but not the valid 2-aggregation tree of . Sub-figure (a) is not a valid arborescence since there should only be one path from the root to any node in an arborescence [9]. Subfigure (b) is a valid arborescence but not the smallest.

Theorem 3.4. Given a graph , let be any node in it without loss of generality. The -aggregation tree of denotes the structure captured by the representation computed using a -layer aggregation-based GNN for . Alternatively, a -layer aggregation-based GNN, when applied to , computes the same representation as it does for node when applied to

Proof. We prove this by induction on . As per equation 1, when 1, the computation of by a 1-layer aggregation-based GNN is given by:

By lemma 3.3, we have that the child set of root in is identical to the neighbor set of in . Furthermore, are the input feature vectors for nodes; any node in has the same as its corresponding node in . So, the theorem trivially holds for 1. Assume that the theorem also holds for some computed for root of by a -layer aggregation-based GNN is identical to computed for node . Now, the computation of 1-layer aggregation-based GNN is given by:

As before, we have that the child set of root in is identical to the neighbor set of . Additionally, the representations for root and its children capture the respective subtrees of depth under them. Since, these subtrees are actually -aggregation trees (lemma 3.2), following our assumption, must be the same for root and its children as for the corresponding nodes in . So, the theorem holds for 1 when it holds for because inputs to the computation of are the same in the case of and . Hence, the theorem holds for

Essentially, the -aggregation tree is a visual depiction of the representation computed by a -layer aggregation-based GNN. Every subtree in is the depiction of some representation () computed intermediately.

4 ANALYSIS OF AGGREGATION-BASED GNNS

Aggregation trees surface two important issues stemming from the way that aggregation-based GNNs operate.

The first one concerns the generalization ability of these architectures. When computing the representation , a -layer aggregation-based GNN with 1 aggregates from the same set of nodes multiple times, especially from itself. An example of this can be seen in figure 3 where the node appears multiple times in its own 2-aggregation tree . Such repetitions allow the GNN to amplify the interactions (i.e., the message passing) between nodes and their neighbors, which has two known downsides [19]: 1) it facilitates over-smoothing whereby representations of nodes within a neighborhood become indistinguishable despite belonging to different classes, and 2) it results in over-fitting whereby the GNN cannot generalize well beyond training. In fact, Wang et al. [30] even present empirical observations of over-smoothing in GAT.

The second issue pertains to -layer aggregation-based GNN where 1. In many real-world settings, there is a large graph that keeps growing with time. A prime example is social networks where new users join from time to time. In such settings, learning follows the inductive paradigm whereby we train the GNN on a snapshot of the graph in time, and then predict on new nodes that enter the graph after that. Figure 5(a) depicts such a setting where is a node entering the graph after training. Figure 5(b) shows the -aggregation tree , i.e., the structure that will be captured by the representation . Essentially, the GNN requires input vectors for all the nodes within hops of to compute . This can be up to representations in total, where is the average degree of a node.

Figure 5: Depiction of the scenario where a node graph at prediction time after training has finished. The aggregation tree of node is also shown.

Storing representations from layers of a GNN has been explored as an optimization before [5]. In the example above, the representations and can be cached at the end of training since they capture the relevant aggregation trees ofrespectively. Then the representation Λcan be approximated at prediction time as:

where Λdenotes the cached representations, and Λsignifies that the computed representation is an approximation of . The -aggregation tree for equation 4 is shown in figure 6. Note that, unlike 5(b), now only appears as the root, making and Λdifferent. This is because was not present during training, and consequently, not covered by the cached representations.

Essentially, with caching, the GNN only requires the tations Λ, . . . , Λfor every 1-hop neighbor of to approximate , i.e., up to representations in total. While this is a substantial gain in efficiency, we show empirically in section 6 that the performance at prediction suffers given that the structure of aggregation trees differs between training and prediction times.

5 NODE MASKING

We propose node masking as a novel yet simple training phase technique to address the issues highlighted in the previous section.

5.1 Description

We begin by formalizing the notion of a masking function for generic countable sets.

Figure 6: The -aggregation tree of node as per equation 4. The dotted boxes encapsulate the aggregation trees captured by cached representations Λ

Definition 5.1. Let S be a set of elements; we assume S is countable. Let be the set of outputs of Bernoulli trials (probability of success. We define to be a bijective mapping from , and refer to it as the Bernoulli select function. There can be ! different can be defined for

Here, all the elements with 0 are said to be masked. Next, we demonstrate how we inculcate this masking function M in the computations of aggregation-based GNNs to tackle the issues we discussed in the previous section. Let be the Bernoulli select function over the set of nodes of a given graph . We propose node masking as the following modification to equation 1 that defines the update operation in the layer of an aggregation-based GNN:

equation 6 resembles equation 1, i.e., node masking has no effect. Thus, we say node masking is

The actual implementation of node masking may vary across the different aggregation-based GNNs. Equations 7 and 8 lay out the implementations of node masking in GAT and GIN-0 [33] architectures respectively:

In both cases, any node with 0 is effectively maskedsince its contribution to the sum is nullified.

Essentially, given a graph and a -layer aggregation-based GNN, if a node in is masked, then the representations () are discarded during the computations performed by the GNN. The equivalent effect in aggregation trees is that is is not included in the child set of other nodes in any aggregation tree since it cannot contribute to representations of other nodes. Going back to the example graph in figure 3, figure 7 shows some possible 2-aggregation trees of the node depending on the nodes that are masked. Note, even if a node is masked, it can still appear as the root of its own aggregation tree since root is not in any child set.

Figure 7: Some possible 2-aggregation trees the base graph shown by figure 3(a). The aggregation trees vary based on the nodes that are masked. If a node is masked, it is excluded from the child set of other nodes but can still appear as the root.

When a -layer aggregation-based GNN with active node masking is trained on some graph , a Bernoulli select function is randomly sampled in every training epoch, allowing the GNN to see many different for every node . This has two advantages. First, the GNN is discouraged from simply associating a node and its neighbors together. Second, if 0 in an epoch, then has the same structure as the aggregation tree in figure 6, i.e., no repetition of in . This reduces the tendency of the GNN to focus heavily on βs own features and also sensitizes the GNN to that structure of the aggregation trees that it may encounter at prediction time if caching is used.

In summary, node masking can be seen a training phase technique for GNNs that stresses the relational inductive biases in the data, while also having a regularizing effect that prevents aggregation-based GNNs from easily amplifying the interactions between nodes and their neighbors, hence alleviating over-fitting and over-smoothing. Once training has finished, node masking can easily be inactivated by setting the node masking rate

5.2 Node masking vs. Dropout techniques

Recently, quite a few works have explored ideas around stochastically dropping nodes and edges within GNN layers using Dropout [26]. Rong et al. [19] proposed the concept of DropEdge in spectral GNNs to alleviate the problem of over-fitting and over-smoothing. DropEdge randomly drops a subset of edges from the graph in every training epoch. In the GAT paper [29], the authors employed a similar technique by applying dropout to the attention coefficients within each GAT layer, which made the learning process more robust to over-fitting. However, neither of these techniques systematically addresses the problem of repetition in aggregation trees and neither of them facilitates caching. In fact, we show in our experiments that the GAT architecture performs much better when node masking is used instead of dropout on attention coefficients.

6 EXPERIMENTS

To empirically verify the theory we have discussed up till now, we conduct over 200 experimental runs, covering both the aspects we highlighted, i.e., generalization and scalability.

6.1 Datasets

We work with three widely-used benchmark datasets for node classification: the Cora and PubMed citation networks [22] and the social network graph of Reddit posts [11]. We pick these datasets since they are the ones used across several works that explore over-fitting and over-smoothing in GNNs [4, 16, 19, 30].

Cora. The Cora citation network dataset consists of 2, 708 nodes and 5, 429 edges. Nodes denote scientific publications and edges denote the citation relationships amongst them. Note that the edges are undirected for the purpose of the dataset even though citations are not symmetric [13, 29]. The publications are represented as binary bag-of-words vectors consisting of 1433 features each. Every node belongs to one of the seven classes, indicating the area of publication, e.g., Genetic Algorithms or Reinforcement Learning.

PubMed. The PubMed diabetes dataset consists of 19, 717 nodes and 44, 338 edges. Nodes denote scientific publications on diabetes and edges denote the citation relationships amongst them. As above, the edges are undirected for the purpose of the dataset. The publications are represented as TF-IDF weighted bag-of-words feature vectors with 500 features each. Every node belongs to one of the three classes, indicating the type of diabetes that the publication is about, e.g., Diabetes Mellitus Type 1 or Diabetes Mellitus Type 2.

Reddit. The Reddit posts graph consists of 232, 965 nodes denoting posts from 41 different sub-reddits. An undirected edge is present between two posts if the same user(s) commented on the two posts. Each post is represented by a 602-dimensional feature vector formed by concatenating distributional semantic features and count-based features for title and comments.

6.2 Models and configurations

We experiment with both of the aggregation-based GNNs that we have discussed up till now, i.e., GAT and GIN. For the GIN architecture, we utilize the formulation specified by equation 7. If node masking is inactive, this formulation behaves exactly like the GIN-0 architecture from the original paper [33] that was shown to have state of the art performance. For the GAT architecture, we use the original formulation [29] specified by equation 2. Additionally, we define a variant of GAT that we refer to as simple GAT or SGAT:

When node masking is inactive, this variant behaves exactly like the original GAT architecture except that the attention coefficients are now simply the inverse of a nodeβs degree.

6.3 Experimental settings

We use PyTorch [18] for modeling. For the GAT and SGAT architectures, we set the exact same hyper-parameters as in the original paper [29] for the Cora dataset. We do not experiment with GAT and SGAT on the PubMed and Reddit datasets because that requires support for operations like sparse softmax [29], which are not stable in PyTorch yet. Akin to GAT, we found a 2-layer GIN to be optimal for all the datasets. We set the maximum number of epochs to 1, 000 with an early stopping patience of 50 epochs. We use the Adam optimizer [12] to update the parameters. If node masking is active, then in every training epoch, we randomly sample a Bernoulli select function over the nodes in the train set. Node masking is always inactive outside training. We submit our code for further reference.

We experiment with both transductive and inductive learning paradigms. In the transductive setting, we make the entire graph from the dataset available at training time, i.e., all the nodes and edges. That said, the the loss for back-propagation is calculated using labels on the nodes in the train set only. In the inductive setting, we only make available at training time the graph G formed by nodes in the train set plus the edges amongst them. At validation and test times, we introduce the relevant nodes from the dataset into G along with the corresponding edges.

Unlike previous works, which test their hypotheses only on a single split of the datasets, we experiment with multiple splits of the three datasets in both the transductive and inductive settings. Specifically, for each dataset, we experiment with 10%, 20%, 25%, 33%, 50%, 75%, and 90% of the nodes as the train set. In all the cases, the remaining data forms our test set except for a small part that we designate as the validation set for evaluating the early stopping criterion in the training phase. For every split, we perform stratified partitioning of the data to ensure similar class distribution.

Note that the metrics we present from our experiments are all in fact mean metrics over 10 trials with random initializations of the parameters. Unlike previous works that use micro F1 and accuracy, we report macro F1 since we strongly consider it to be a more appropriate metric as it provides an equal evaluation of effectiveness on all classes [28]. That said, we noted that all the trends are consistent with micro F1 and accuracy too.

6.4 Generalization

To show that node masking helps aggregation-based GNNs generalize better, we compare the performances yielded by the models when node masking is active versus when it is inactive. We denote the configurations where node masking is active by β+NMβ and use a consistent rate of 5. That said, is a hyper-parameter that can be adjusted for further gains. Appendix A explores the change in performance as and number of layers in the GNN vary.

Table 1 shows the comparisons between GAT and GAT+NM and between SGAT and SGAT+NM on the Cora dataset. GAT+NM and SGAT+NM outperform their counterparts across several splits in both the transductive and inductive settings. As expected, the gains are more pronounced in the inductive setting and amongst

Table 1: Macro F1 scores on the Cora dataset in inductive and transductive settings for various sizes s of the train set (in percentage). Numbers in bold are significantly better than their counterparts with under paired t-test.

smaller sizes of the train set. Note that in GAT+NM and SGAT+NM, we do not use any dropout on the attention coefficients. So, the wins over GAT and SGAT clearly indicate that node masking is more effective than stochastically dropping edges because the latter does not systematically address the issue of repetition of nodes in aggregation trees. More specifically, dropout-based techniques simply make the aggregation trees sparser but do not strongly impede amplified interactions between nodes and their neighbors.

We further validate our reasoning by calculating the Mean Average Distance (MAD) values for the GAT and SGAT models with and without node masking. MAD is a metric recently proposed by Chen et al. [4] that quantifies the smoothing caused by a GNN by computing a cosine distance based scalar measure over the representations generated for all the nodes in a graph. The lower the MAD value, the higher the smoothing caused by the GNN. Table 2 presents the MAD values on the Cora dataset. We see that node masking is significantly better at alleviating over-smoothing than the stochastic exclusion of edges. These results are in line with the observations of Wang et al. [30] who found the dropout on attention coefficients to not be a very effective tool against over-smoothing.

Table 2: MAD values on the Cora dataset in inductive and transductive settings with a train set of size 10%. Numbers in bold are significantly better than their counterparts with under paired t-test.

Tables 3(a), 3(b) and 3(c) compare GIN and GIN+NM on the Cora, PubMed and Reddit datasets respectively. Again, node masking helps boost performance across almost all splits in both the transductive and inductive settings. Note that, due to the size of the Reddit dataset, we could not run the experiments in transductive setting or inductive setting with train set > 50% on our machines with NVIDIA P100 GPUs. Other works [19] have noted the same.

Table 3: Macro F1 scores on the Cora, PubMed and Reddit datasets in inductive and transductive settings for various sizes s of the train set (in percentage). Numbers in bold are significantly better than their counterparts with < 0.05 under paired t-test.

We found the MAD values for GIN and GIN+NM to be almost identical, e.g., 0.65 for Cora with a train set of size 10%. This is intuitive given that GIN is a maximally powerful GNN that, in theory, can achieve injective mapping from nodes to representations [33]. However, injectivity can lead to over-fitting [23], which node masking counters by stochastically augmenting the training graph to stress the relational inductive biases. We validate this by presenting some loss curves in appendix B on the Cora and PubMed datasets with and without node masking under transductive and inductive settings. They highlight the regularizing effect of node masking.

6.5 Scalability

Tables 4(a) and 4(b) compare the performances of GIN on the Cora and PubMed datasets respectively with and without caching. For both the settings, each table also shows the number of unique nodes involved in the computations done by the model to generate predictions for the nodes introduced into the graph at the test time. Caching consistently reduces the number of unique nodes involved because only those nodes from the train set need to be involved that are 1-hop neighbors of the nodes introduced at the test time. The gains in efficiency from caching are more pronounced when the number of nodes present at the training time is significantly more than the number of nodes entering at the test time, a common scenario in the real-world. However, as conjectured before, the model performs worse with caching because the structure of aggregation trees it captures at the test time differs from the structure it learned to capture at the training time. That said, when node masking is used, the performance in fact exceeds that of GIN without caching.

Table 4: Macro F1 scores on the Cora and PubMed datasets in inductive setting with and without caching for various sizes s of the train set (in percentage). F1 scores in bold are significantly better than their counterparts with 0.05 under paired t-test.

Here, one might seek a comparison with GraphSAGE [11] which reduces the number of unique nodes involved in the computation of for a node by sub-sampling its -hop neighborhood. Such sub-sampling, however, makes GraphSAGE lose a lot of information, rendering it worse than aggregation-based GNNs on these benchmark datasets [24]. We instead reduce the number of unique nodes involved in the computation carried out by aggregation-based GNNs by approximating using the cached representations of

1-hop neighbors of . This approximation does not degrade the performance when node masking is employed.

7 CONCLUSION

In this paper, we introduced node masking, a novel technique that significantly improves the performance of state of the art graph neural networks (GNNs). We first discussed some theoretical tools to better visualize the operations performed by spatial aggregation-based GNNs. Using these tools, we highlighted the issues that limit the ability of such GNNs to generalize and scale. Finally, we empirically demonstrated the effectiveness of node masking in enhancing the performance of aggregation-based GNNs on three widely-used benchmark datasets for node classification, the Cora and PubMed citation network and the social network graph of Reddit posts. The observed trends also hold for the CiteSeer dataset but we omitted it for brevity. Previous works utilizing these datasets had tested their hypotheses only on a single split of the datasets. We instead showed the efficacy of node masking on a range of splits under both the transductive setting as well as the inductive setting, hence laying down strong benchmarks for future research.

REFERENCES

[1] Francis R. Bach. 2008. Graph Kernels between Point Clouds. In Proceedings of the 25th International Conference on Machine Learning (Helsinki, Finland) . Association for Computing Machinery, New York, NY, USA, 25β32. https://doi.org/10.1145/1390156.1390160

[2] Peter W. Battaglia, Jessica B. Hamrick, Victor Bapst, Alvaro Sanchez-Gonzalez, VinΓcius Flores Zambaldi, Mateusz Malinowski, Andrea Tacchetti, David Raposo, Adam Santoro, Ryan Faulkner, Γaglar GΓΌlΓ§ehre, H. Francis Song, Andrew J. Ballard, Justin Gilmer, George E. Dahl, Ashish Vaswani, Kelsey R. Allen, Charles Nash, Victoria Langston, Chris Dyer, Nicolas Heess, Daan Wierstra, Pushmeet Kohli, Matthew Botvinick, Oriol Vinyals, Yujia Li, and Razvan Pascanu. 2018. Relational inductive biases, deep learning, and graph networks. CoRR abs/1806.01261 (2018). arXiv:1806.01261 http://arxiv.org/abs/1806.01261

[3] Joan Bruna, Wojciech Zaremba, Arthur Szlam, and Yann LeCun. 2014. Spectral Networks and Locally Connected Networks on Graphs. In 2nd International Conference on Learning Representations, ICLR 2014, Banff, AB, Canada, April 14-16, 2014, Conference Track Proceedings, Yoshua Bengio and Yann LeCun (Eds.). http://arxiv.org/abs/1312.6203

[4] Deli Chen, Yankai Lin, Wei Li, Peng Li, Jie Zhou, and Xu Sun. 2020. Measuring and Relieving the Over-Smoothing Problem for Graph Neural Networks from the Topological View. Proceedings of the AAAI Conference on Artificial Intelligence 34, 04 (Apr. 2020), 3438β3445. https://doi.org/10.1609/aaai.v34i04.5747

[5] Jianfei Chen, Jun Zhu, and Le Song. 2018. Stochastic Training of Graph Convolutional Networks with Variance Reduction. In International Conference on Machine Learning. 941β949.

[6] Wei-Lin Chiang, Xuanqing Liu, Si Si, Yang Li, Samy Bengio, and Cho-Jui Hsieh. 2019. Cluster-GCN: An Efficient Algorithm for Training Deep and Large Graph Convolutional Networks. In ACM SIGKDD Conference on Knowledge Discovery and Data Mining (KDD). https://arxiv.org/pdf/1905.07953.pdf

[7] MichaΓ«l Defferrard, Xavier Bresson, and Pierre Vandergheynst. 2016. Convolutional neural networks on graphs with fast localized spectral filtering. In NIPS. 3844β3852.

[8] Jean-Claude Fournier. 2013. Graphs theory and applications: with exercises and problems. John Wiley & Sons.

[9] Gary Gordon and Elizabeth McMahon. 1989. A greedoid polynomial which distinguishes rooted arborescences. Proc. Amer. Math. Soc. 107, 2 (1989), 287β298.

[10] M. Gori, G. Monfardini, and F. Scarselli. 2005. A new model for learning in graph domains. In Proceedings. 2005 IEEE International Joint Conference on Neural Networks, 2005., Vol. 2. 729β734 vol. 2. https://doi.org/10.1109/IJCNN.2005.1555942

[11] Will Hamilton, Zhitao Ying, and Jure Leskovec. 2017. Inductive representation learning on large graphs. In Advances in Neural Information Processing Systems. 1024β1034.

[12] Diederik P. Kingma and Jimmy Ba. 2015. Adam: A Method for Stochastic Optimization. In Proceedings of the 3rd International Conference on Learning Representations (San Diego, California)

[13] Thomas N. Kipf and Max Welling. 2017. Semi-Supervised Classification with Graph Convolutional Networks. In Proceedings of the 5th International Conference . https://arxiv.org/pdf/1609.02907.pdf

[14] Brenden M. Lake and Marco Baroni. 2017. Still not systematic after all these years: On the compositional skills of sequence-to-sequence recurrent networks. In Proceedings of the 3rd International Conference on Learning Representations. https://openreview.net/pdf?id=H18WqugAb

[15] Ron Levie, Federico Monti, Xavier Bresson, and Michael M Bronstein. 2017. Cayleynets: Graph convolutional neural networks with complex rational spectral filters. IEEE Transactions on Signal Processing 67, 1 (2017), 97β109.

[16] Qimai Li, Zhichao Han, and Xiao ming Wu. 2018. Deeper Insights Into Graph Convolutional Networks for Semi-Supervised Learning. https://www.aaai.org/ ocs/index.php/AAAI/AAAI18/paper/view/16098

[17] Pushkar Mishra, Marco Del Tredici, Helen Yannakoudakis, and Ekaterina Shutova. 2019. Abusive Language Detection with Graph Convolutional Networks. In Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies (Minneapolis, Minnesota). Association for Computational Linguistics. https://arxiv.org/abs/1904. 04073

[18] Adam Paszke, Sam Gross, Francisco Massa, Adam Lerer, James Bradbury, Gregory Chanan, Trevor Killeen, Zeming Lin, Natalia Gimelshein, Luca Antiga, Alban Desmaison, Andreas Kopf, Edward Yang, Zachary DeVito, Martin Raison, Alykhan Tejani, Sasank Chilamkurthy, Benoit Steiner, Lu Fang, Junjie Bai, and Soumith Chintala. 2019. PyTorch: An Imperative Style, High-Performance Deep Learning Library. In Advances in Neural Information Processing Systems 32, H. Wallach, H. Larochelle, A. Beygelzimer, F. dβ AlchΓ©-Buc, E. Fox, and R. Garnett (Eds.). Curran Associates, Inc., 8024β8035. https://arxiv.org/abs/1912.01703

[19] Yu Rong, Wenbing Huang, Tingyang Xu, and Junzhou Huang. 2020. DropEdge: Towards Deep Graph Convolutional Networks on Node Classification. In International Conference on Learning Representations. https://openreview.net/forum? id=Hkx1qkrKPr

[20] Sunil Kumar Sahu, Fenia Christopoulou, Makoto Miwa, and Sophia Ananiadou. 2019. Inter-sentence Relation Extraction with Document-level Graph Convolutional Neural Network. In Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics. Association for Computational Linguistics, Florence, Italy, 4309β4316. https://doi.org/10.18653/v1/P19-1423

[21] F. Scarselli, M. Gori, A. C. Tsoi, M. Hagenbuchner, and G. Monfardini. 2009. The Graph Neural Network Model. IEEE Transactions on Neural Networks 20, 1 (Jan 2009), 61β80. https://doi.org/10.1109/TNN.2008.2005605

[22] Prithviraj Sen, Galileo Namata, Mustafa Bilgic, Lise Getoor, Brian Galligher, and Tina Eliassi-Rad. 2008. Collective Classification in Network Data. AI Magazine 29, 3 (Sep. 2008), 93. https://doi.org/10.1609/aimag.v29i3.2157

[23] Younjoo Seo, Andreas Loukas, and NathanaΓ«l Perraudin. 2019. Discriminative structural graph classification. CoRR abs/1905.13422 (2019). arXiv:1905.13422 http://arxiv.org/abs/1905.13422

[24] Oleksandr Shchur, Maximilian Mumme, Aleksandar Bojchevski, and Stephan GΓΌnnemann. 2018. Pitfalls of Graph Neural Network Evaluation. Relational Representation Learning Workshop, NeurIPS 2018 (2018).

[25] Nino Shervashidze, Pascal Schweitzer, Erik Jan van Leeuwen, Kurt Mehlhorn, and Karsten M. Borgwardt. 2011. Weisfeiler-Lehman Graph Kernels. J. Mach. Learn. Res. 12, null (Nov. 2011), 2539β2561.

[26] Nitish Srivastava, Geoffrey Hinton, Alex Krizhevsky, Ilya Sutskever, and Ruslan Salakhutdinov. 2014. Dropout: A Simple Way to Prevent Neural Networks from Overfitting. Journal of Machine Learning Research 15, 56 (2014), 1929β1958. http://jmlr.org/papers/v15/srivastava14a.html

[27] Joshua B Tenenbaum, Charles Kemp, Thomas L Griffiths, and Noah D Goodman. 2011. How to grow a mind: Statistics, structure, and abstraction. science 331, 6022 (2011), 1279β1285.

[28] Vincent Van Asch. 2013. Macro-and micro-averaged evaluation measures [[basic draft]]. Computational Linguistics & Psycholinguistics, University of Antwerp, Belgium (2013). https://www.clips.uantwerpen.be/~vincent/pdf/microaverage. pdf

[29] Petar VeliΔkoviΔ, Guillem Cucurull, Arantxa Casanova, Adriana Romero, Pietro LiΓ², and Yoshua Bengio. 2018. Graph Attention Networks. In International Conference on Learning Representations. https://openreview.net/forum?id=rJXMpikCZ accepted as poster.

[30] Guangtao Wang, Rex Ying, Jing Huang, and Jure Leskovec. 2019. Improving Graph Attention Networks with Large Margin-based Constraints. arXiv:1910.11945 [cs.LG]

[31] Zhichun Wang, Qingsong Lv, Xiaohan Lan, and Yu Zhang. 2018. Cross-lingual Knowledge Graph Alignment via Graph Convolutional Networks. In Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing. Association for Computational Linguistics, Brussels, Belgium, 349β357. https: //doi.org/10.18653/v1/D18-1032

[32] Zonghan Wu, Shirui Pan, Fengwen Chen, Guodong Long, Chengqi Zhang, and Philip S. Yu. 2019. A Comprehensive Survey on Graph Neural Networks. CoRR abs/1901.00596 (2019). arXiv:1901.00596 http://arxiv.org/abs/1901.00596

[33] Keyulu Xu, Weihua Hu, Jure Leskovec, and Stefanie Jegelka. 2019. How Powerful are Graph Neural Networks?. In International Conference on Learning Representations. https://openreview.net/forum?id=ryGs6iA5Km

A VARYING THE NODE MASKING RATE

The analyses presented in this section are with the Cora dataset but the observed trends hold across all the datasets.

Figures 8 and 9 explore the change in performance of GIN+NM, GAT+NM, and SGAT+NM as the node masking rate increases from 0.1 up to 0.9. We conduct the experiments on the Cora dataset with a train set of size 10%.

As seen in figure 8, GIN+NM achieves the best performance under both the transductive setting as well as the inductive setting at 5. This suggests that fine-tuning the node masking rate can yield even higher gains than we report in the paper. On the hand, figure 9 suggests that GAT+NM and SGAT+NM achieve the best performance at

Figure 8: Results on the Cora dataset with GIN+NM under both the inductive and transductive settings for various node masking rates . The train set size is 10%. Y-axis denotes macro F1. We see that the best performance is not necessarily with

Figure 10 further explores how the performance of GIN+NM changes with varying node masking rates when the number of GNN layers, i.e., the depth of the model, is increased from 2 to 4, 8 or 16. We note that as the depth increases, the at which the optimal performance is achieved decreases. This is intuitive given that more layers means more parameters, which in turn means denser aggregation trees are required for optimal training.

Figure 9: Results on the Cora dataset with GAT+NM and SGAT+NM under inductive setting for various node masking rates . The train set size is 10%. Y-axis denotes macro F1. We see that the best performance is with

Figure 10: Results on the Cora dataset with GIN+NM under the transductive setting for various node masking rates [0.1, 0.9] and depths of the model. The train set size is 10%. Y-axis denotes macro F1. We see that the at which the optimal performance is achieved decreases with increasing depth.

B ANALYSIS OF LOSS CURVES

Figures 11 and 12 present some loss curves from the training and validation phases for the GIN model with (green) and without (red) node masking under both the transductive and inductive settings.

As can be noted, node masking leads to higher training losses in both the transductive and inductive settings. This is typical of a regularization effect. Additionally, we also observe that the validation losses go up sharply after a point when node masking is not used, clearly indicating that the model has over-fit. With node masking, such an over-fitting phenomenon is neither observed in the transductive setting nor in the inductive setting. Therefore, the model is able to achieve a lower validation loss meaning that it generalizes better. This is typical of a higher inductive bias effect.

Figure 11: Loss curves on the Cora dataset for the GIN model with (w/ NM) and without (w/o NM) node masking.

Figure 12: Loss curves on the PubMed dataset for the GIN model with (w/ NM) and without (w/o NM) node masking.

Designed for Accessibility and to further Open Science