An End-to-End Graph Convolutional Kernel Support Vector Machine

2020·Arxiv

Abstract

Abstract

A novel kernel-based support vector machine (SVM) for graph clas-sification is proposed. The SVM feature space mapping consists of a sequence of graph convolutional layers, which generates a vector space representation for each vertex, followed by a pooling layer which generates a reproducing kernel Hilbert space (RKHS) representation for the graph. The use of a RKHS offers the ability to implicitly operate in this space using a kernel function without the computational complexity of explicitly mapping into it. The proposed model is trained in a supervised end-to-end manner whereby the convolutional layers, the kernel function and SVM parameters are jointly optimized with respect to a regularized classification loss. This approach is distinct from existing kernel-based graph classification models which instead either use feature engineering or unsupervised learning to define the kernel function. Experimental results demonstrate that the proposed model outperforms existing deep learning baseline models on a number of datasets.

1 Introduction

The world contains much implicit structure which can be modelled using a graph. For example, an image can be modelled as a graph where objects (e.g. person, chair) are modelled as vertices and their pairwise relationships (e.g. sitting) are modelled as edges [21]. This representation has led to useful solutions for many vision problems including image captioning and visual question answering [2]. Similarly, a street network can be modelled as a graph where locations are modelled as vertices and street segments are modelled as edges. This representation has led to useful solutions for many transportation problems including the placement of electrical vehicle charging stations [8].

Given the ubiquity of problems which can be modelled in terms of graphs, performing machine learning on graphs represents an area of great research interest. Advances in the application of deep learning or neural networks to sequence spaces in the context of natural language processing and fixed dimensional vector spaces in the context of computer vision has led to much interest in applying deep learning to graphs. There exist many types of machine learning tasks one may wish to perform on graphs. These include vertex classification, graph classification, graph generation [48] and learning implicit/hidden structures [7]. In this work we focus on the task of graph classification. Examples of graph classification tasks include human activity recognition where human pose is modelled using a skeleton graph [44], visual scene understanding where the scene is modelled using a scene graph [41] and semantic segmentation of three dimensional point clouds where the point cloud is modelled as a graph of geometrically homogeneous elements [22].

Graph convolutional is the most commonly used deep learning architecture applied to graphs. This architecture consists of a sequence of convolutional layers where each layer iteratively updates a vector space representation of each vertex. In their seminal work, Gilmer et al. [9] demonstrated that many differ-ent convolutional layers can be formulated in terms of a framework containing two steps. In the first step, message passing is performed where each vertex receives messages from adjacent vertices regarding their current representation. In the second step, each vertex performs an update of its representation which is a function of its current representation and the messages it received in the previous step. In order to perform graph classification given a sequence of convolutional layers, the set of vertex representations output from this sequence must be integrated to form a graph representation. This graph representation can subsequently be used to predict a corresponding class label. We refer to this task of integrating vertex representations as vertex pooling and it represents the focus of this article. Note that, Gilmer et al. [9] refers to this task as readout.

Performing vertex pooling is made challenging by the fact that different sets of vertex representations corresponding to different graphs may contain different numbers of elements. Furthermore, the elements in a given set are unordered. Therefore one cannot directly apply a feed-forward or recurrent architecture because these require an input lying in a vector space or sequence space respectively. To overcome this challenge most solutions involve mapping the sets of vertex representations to either a vector or sequence space which can then form the input to a feed-forward or recurrent architecture respectively. There exists a wide array of such solutions ranging from computing simple summary statistics such as mean vertex representation to more complex clustering based methods [46].

In this article we propose a novel binary graph classification model which performs vertex pooling by mapping a set of vertex representations to an element in a reproducing kernel Hilbert space (RKHS). A RKHS is a function space for which there exists a corresponding kernel function equalling the dot product in this space. Being a function space where the domain of functions in this space is a Euclidean Space, the RKHS in question is of infinite dimension and in turn has high model capacity. However, the infinite nature of this space makes it challenging to work directly in this space. To overcome this challenge, we use the corresponding kernel function which allows us to implicitly compute the dot product in this space without explicitly mapping to the space in question. This is a commonly used strategy known as the kernel trick. More specifically, the kernel corresponding to the RKHS is used within a support vector machine (SVM) to perform binary graph classification. A useful feature of the proposed pooling method is that the mapping to a RKHS is parameterized by a scale parameter which controls the degree to which different sets of vertex representations can be discriminated.

The proposed graph classification model is trained in a supervised end-to-end manner where the convolutional layers, the kernel function and SVM parameters are jointly optimized with respect to a regularized classification loss. This approach is distinct from existing kernel-based models which instead use feature engineering or unsupervised learning to define the kernel function and only optimize the parameters of the classification method in a supervised manner [45]. Using feature engineering can result in diagonal dominance whereby a graph is determined to only be similar to itself, but not to any other graph [45]. Although unsupervised learning can overcome this problem and improve performance, the kernel may not be optimal for the task at hand given it was learned in an unsupervised as opposed to supervised manner [14]. The proposed solution of optimizing in an end-to-end manner overcomes these limitations.

The remainder of this paper is structured as follows. Section 2 reviews related work on graph kernels and vertex pooling methods. Section 3 describes the proposed graph classification model. Section 4 presents an evaluation of this model through comparison to 12 baseline models on 4 datasets. Finally, section 5 draws some conclusions from this work and discusses possible future research directions.

2 Related Work

In this work we propose a novel vertex pooling method which performs vertex pooling by mapping to a RKHS. In the following two sections we review related work on vertex pooling methods and graph kernels.

2.1 Vertex Pooling

As discussed in the introduction to this article, existing vertex pooling methods generally map the set of vertex representations to a fixed dimensional vector space or sequence space. The simplest methods for performing vertex pooling compute a summary statistic of the set of vertex representations. Commonly used summary statistics include mean, max and sum [4]. Despite the simple nature of these methods, a recent study by Luzhnica et al. [24] demonstrated that in some cases they can outperform more complex methods. Zhang et al. [49] proposed a vertex pooling method which first performs a sorting of vertex representations based on the Weisfeiler-Lehman graph isomorphism algorithm. A subset of these vertex representations are then selected based on this ranking, where the size of this subset is a user specified parameter. Tarlow et al. [23] proposed a vertex pooling method which outputs an element in sequence space. Gilmer et al. [9] proposed to perform vertex pooling by applying the set2set model from Vinyals et al. [38]. The set2set model maps the set of vertex representations to fixed dimensional vector space representation which is invariant to the order of elements in the set. Ying et al. [46] proposed a vertex pooling method which uses clustering to iteratively integrate vertex representations and outputs an element in a fixed dimensional vector space. Kearnes et al. [15] proposed a vertex pooling method which creates a fuzzy histogram of the vertex representations and outputs an element in a fixed dimensional vector space.

2.2 Graph Kernels

As described in the introduction to this article, existing kernel-based graph classification methods use either feature engineering or unsupervised learning to define the kernel. We now review each of these approaches in turn.

The most common approach for feature engineering kernels is the R-convolution framework where the kernel function of two graphs is defined in terms of the similarity of their respective substructures [12]. This framework is similar to the bag-of-words framework used in natural language processing. Substructures used in the R-convolution framework to define kernels include graphlets [34], shortest path properties [1] and random walk properties [36].

The Weisfeiler-Lehman framework is a framework for feature engineering kernels which is inspired by the Weisfeiler-Lehman test of graph isomorphism. In this framework the vertex representations of a given graph are iteratively updated in a similar manner to graph convolution to give a sequence of graphs. A kernel is then defined with respect to this sequence by summing the application of a given kernel, known as the base kernel, to each graph in the sequence. Shervashidze et al. [33] proposed a family of kernels using this framework by considering a set of base kernels including one which measures the similarity of shortest path properties. Rieck et al. [30] proposed a kernel using this framework by considering a base kernel which measures the similarity of topological properties.

Kriege et al. [20] proposed another framework for feature engineering kernels known as assignment kernels which computes an optimal assignment between graph substructures and sums over a kernel applied to each correspondence in the assignment. The authors proposed a number of kernels using this framework including one based on the Weisfeiler-Lehman graph isomorphism algorithm. Kondor et al. [19] proposed a multiscale kernel which considers vertex features plus topological information through the graph Laplacian. Zhang et al. [50] proposed a kernel-based on the return probabilities of random walks. The authors used an approximation of the kernel function so that the method can be applied to large datasets [29].

To overcome the limitations of feature engineering and improve performance, recent works in the field of graph kernels have considered unsupervised learning techniques. These methods generally learn a graph representation in an unsupervised manner and subsequently use this representation to define a kernel. Yanardag et al. [45] proposed a kernel which uses the R-convolution framework to define a set of substructures and subsequently learns an embedding of these substructures in an unsupervised manner using a word2vec type model. Ivanov et al. [14] proposed a kernel which determines two graphs to be similar if their vertices have similar neighbourhoods measured in terms of anonymous walks which are a generalization of random walks. Learning is performed in an unsupervised manner using a word2vec type model. Nikolentzos et al. [26] proposed a graph kernel which first computes sets of vertex representations corresponding to the graphs in question in an unsupervised manner. The similarity of these sets are then computed using the earth mover’s distance. The authors noted that these similarities do not yield a positive semidefinite kernel matrix preventing it from being used in some kernel-based classification methods. To overcome this issue the authors use a version of the support vector machine for indefinite kernel matrices. Similar to Nikolentzos et al. [26], Wu et al. [39] proposed a graph kernel which first computes sets of vertex representations corresponding to the graphs in questions in an unsupervised manner. The resulting set of embeddings are in turn used to embed the graph in question by measuring the disturbance distance to sets of embeddings corresponding to random graphs. Finally, this graph representation is used to define a kernel. Nikolentzos et al. [25] proposed a method that performs an unsupervised clustering of the input graph into components and subsequently learns a kernel function which takes as input these components.

3 Methodology

The proposed graph classification model consists of the following three steps. In the first step, a sequence of graph convolutional layers are applied to the graph in question to generate a corresponding set of vertex representations. In the second step, this set of vertex representations is mapped to a RKHS. In the final step, graph classification is performed using a SVM. Each of these three steps are described in turn in the first three subsections of this section. In the final subsection we describe how the parameters of each step are optimized jointly in an end-to-end manner. Before that, we first introduce some notation and formally define the problem of graph classification.

A graph is a tuple (V, E) where V is a set of vertices and ) is a set of edges. Let G denote the space of graphs. Let Σ denote a vertex labelling function. In this work we assume that Σ is a finite set. Let denote a set of n graphs and denote a corresponding set of graph labels. In this work we assume that graph labels take elements in the set {0, 1}. In this work we consider the problem of binary graph classification where given G and Y we wish to learn a map .

3.1 Graph Convolution Layers

A large number of different graph convolutional layers have been proposed. Broadly speaking a graph convolutional layer will update the representation of each vertex in a given graph where this update is a function of the current representation of that vertex plus the representations of its adjacent neighbours. In this section we only briefly review existing graph convolutional layers but the interested reader can find a more indepth analysis in the following review papers [51, 40].

Gilmer et al. [9] showed that many different convolutional layers may be reformulated in terms of a framework called Message Passing Neural Networks defined in terms of a message function M and an update function U. In this framework vertex representations are updated according to Equation 1 where denotes the representation of vertex v output from the t-th convolutional layer and N(v) denotes the set of vertices adjacent to v. Each vertex representation is an element of where the dimension m may vary from layer to layer. For the input layer, that is t = 1, vertex representations equal a one-hot encoding of the vertex labelling function l and therefore the corresponding dimension is . For all subsequent layers the corresponding dimension is a model hyper-parameter.

In the proposed graph classification model we use the functions M and U originally proposed by Hamilton et al. [11] and defined in Equation 2. Here CONCAT is the horizontal vector concatenation operation, and are the weights and biases respectively for the t-th convolutional layer, and ReLU is the real valued rectified linear unit non-linearity.

A sequence of two convolutional layers were used in the proposed model. A number of studies have found that the use of two layers empirically gives the best performance [18]. This sequence of layers will map a graph ) to a set of |V | points in where m is the dimension of the final convolutional layer. Since the number of vertices in a graph may vary the number of points in may in turn vary. Let us denote by Set the space of sets of points in . Given this, the sequence of convolutions layers defines a map Set.

3.2 Mapping to RKHS

The output from the sequence of convolutional layers defined in the previous subsection is an element in the space Set. In this section we propose a method for mapping elements in this space to a reproducing kernel Hilbert space (RKHS). We in turn define a kernel between elements in this space.

A Hilbert space is a vector space with an inner product such that the induced norm turns the space into a complete metric space. A positive-semidefinite kernel on a set X is a function such that there exists a feature space H and a map such that where and denotes the dot product in H. Equivalently, a function is a kernel if and only if for every subset , the matrix K with entries ) is positive semi-definite [31]. Given a kernel k, one can define a map as Equation 3 where codomain of this map is the space of real valued functions on X. Such a space is called a function space. Given this, it can be proven that . By virtue of this property, is called a reproducing kernel Hilbert space (RKHS) corresponding to the kernel k [31].

Let : be the Gaussian kernel function defined in Equation 4 which is parameterized by .

Given , we define a map F : Set in Equation 5 where is the space of real valued functions on . To illustrate this map consider the element of Set displayed in Figure 1(a) where the dimension m equals 2. Recall that elements in the space Set correspond to sets of points in . Figures 1(b) and 1(c) display the elements of resulting from applying the map F to this element of Set with parameter values of 0.001 and 0.0005 respectively.

Figure 1: An element of the space Set is displayed in (a) where the dimension m equals 2 and each point in is represented by a red dot. The elements of , which are themselves functions , resulting from applying the map F to this element with parameter values of 0.001 and 0.0005 are displayed in (b) and (c) respectively.

The parameter of the map F is a scale parameter and may be interpreted as follows. As the value of approaches 0, ) becomes a sum of a set indicator functions applied to x. In this case distinct elements of the space Set map to distinct elements of where the distance between these functions measured by the norm is greater than zero. On the other hand, as approaches , differences between the functions are gradually smoothed out and in turn the distance between the functions gradually reduces. Therefore, one can view the parameter as controlling the discrimination power of the method.

Given the map F defined in Equation 5, we define the kernel : in Equation 6. Note that, the final equality in this equation follows from the reproducing property of the RKHS related to and the bilinearity of the inner product [28]. By examination of Equation 6, we see that the kernel equals the dot product between elements in the codomain of the map F which is an infinite dimensional function space. That is, the kernel allows us to operate in this codomain without the computational complexity of explicitly mapping into it. In Theorem 1 we prove that is a valid positive-semidefinite kernel.

Theorem 1. The kernel is a positive-semidefinite kernel.

Proof. The kernel is a positive-semidefinite kernel because it is defined in Equation 6 to equal the dot product in the space .

The kernel has a specific scale which is specified by . In order to adopt a multi-scale approach we consider a set of to define a corresponding set of kernels . We combine these kernels using a linear combination defined in Equation 7 where . In Theorem 2 we prove that is a valid positive-semidefinite kernel.

Theorem 2. The kernel is a positive-semidefinite kernel.

Proof. The kernel is a positive-semidefinite kernel because it is the sum of positive-semidefinite kernels and the coefficients are all positive (see proposition 13.1 in [31]).

3.3 SVM

Recall that we consider the problem of graph classification whereby given G = and we wish to learn a map .

Let f : Set be a map from which we obtain a decision function by sgn(f). That is, if f returns a positive value we classify the graph in question as 1 and otherwise we classify it as 0. We determine a suitable map f lying in the RKHS H corresponding to the kernel by Equation 8. Note that, the first term in this sum corresponds to the soft margin loss [31] and the second term is a regularization term.

By the representer theorem any solution to Equation 8 can be written in the form of Equation 9 where [28].

Substituting this into Equation 8 we obtain Equation 10 where optimization of the function f is performed with respect to . Here is the elementwise multiplication operator (Hadamard product), 0 is a vector of zeros of size n and 1 is a vector of ones of size n.

3.4 End-to-End Optimization

As described in the previous subsections, the proposed classification model contains three steps with each having corresponding parameters which require optimization with respect to the objective function defined in Equation 10. The parameters in question are the sets of convolutional layer parameters and defined in Equation 2, the sets of kernel parameters and defined in Equation 7, and the set of SVM parameters defined in Equation 9. All of these parameters are unconstrained real values apart from the sets of kernel parameters and which are constrained to be positive real values. As such, the optimization problem in question is a constrained optimization problem. In this work we wish to optimize all the above model parameters jointly in an end-to-end manner. We refer to this as the end-to-end optimization problem. Note that, if only the SVM parameters were optimized and all other parameters were fixed, the optimization problem could be formulated as a quadratic program by taking the dual and solved in closed-form [31]. This is the most commonly used method for optimizing the parameters of an SVM.

In order to solve the end-to-end optimization problem we use a gradient based optimization method. Such methods are the most commonly used methods for optimizing neural network parameters [10]. There are two main approaches that can be used to apply a gradient based optimization method to a constrained optimization problem. The first approach is to project the result of each gradient step back into the feasible region. The second approach is to transform the constrained optimization problem into an unconstrained optimization problem and solve this problem. Such a transformation can be achieved using the Karush-Kuhn-Tucker (KKT) method [27]. In this work we use the former approach. In practice this reduces to passing the parameters and through the function max(0) after each gradient step. The above optimization can be used in conjunction with any gradient based optimization method such as stochastic gradient descent. In this work the Adam method was used [17].

4 Evaluation

In this section we present an evaluation of the proposed end-to-end graph clas-sification model with respect to current state-of-the-art models. This section is structured as follows. Section 4.1 provides implementation details for the proposed model. Section 4.2 describes the baseline models used to compare the proposed model against. Finally, section 4.3 describes the datasets used in this evaluation and compares the performance of all models on these datasets.

4.1 Implementation Details

The parameters of the proposed model were initialized as follows. The convolutional layer weights and biases in Equation 2 were initialized using Kaiming initialization [13] and to a value of 0 respectively. The kernel parameters and and s in Equation 7 were all initialized to a value of 1.

The model hyper-parameters were set as follows. The dimension of the convolutional hidden layers was set equal to 25. The Adam optimizer learning rate was set to its default value of 0.001 and training was performed for 300 epochs. The hyper-parameters in Equation 10 and s in Equation 7 were selected from the sets {0.0, 0.5, 1.0, 1.5, 2.0, 2.5, 3.0} and {1, 2} respectively by considering classification accuracy on a validation set. Larger values for the hyper-parameter s were not considered to ensure scalability of the model to medium sized datasets.

The time and space complexity of classifying a given graph is O(n) where n is the number of graphs in the training dataset. This is a consequence of the summation in Equation 9 over all training examples. The time and space complexity of performing an update of the method parameters using backprop is O() because this step computes the complete kernel matrix K in Equation 10. Note that, the above time complexity analysis assumes that each element in the kernel matrix K can be computed in constant time. In reality, if we assume that each graph contains m vertices the time complexity of computing each element in K is (see Equation 6). In this case the time complexity of classifying a given graph is O() while the time complexity of performing an update of the method parameters using backprop is O().

4.2 Baseline Methods

As described in the related work section of this paper, existing models for graph classification belong to two main categories of feature engineered kernel and end-to-end deep learning models. For the purposes of this evaluation, we compared the model proposed in this work to baseline models in each of these categories.

A set of 17 baseline models were considered where this set contains 5 feature engineered kernel models and 12 end-to-end deep learning models. We considered so many baseline models to ensure we were comparing to state of the art; many existing models claim to outperform each other so it is difficult to determine which models are in fact state of the art. The end-to-end deep learning baseline models considered in the evaluation are end-to-end models but not are kernel-based models. The proposed model is the first end-to-end kernel-based model for graphs.

In order to cover the breadth of different feature engineered kernel models, we considered three kernel functions in the R-convolution framework, one kernel function in the Weisfeiler-Lehman framework and one kernel function which uses unsupervised learning. The kernel functions in question are entitled Graphlet by Shervashidze et al. [34], Shortest Path by Borgwardt et al. [1], Vertex Histogram by Sugiyama et al. [36], Weisfeiler Lehman by Shervashidze et al. [33] and Pyramid Match by Nikolentzos et al. [26]. These kernel functions are described in the Appendix section of this article. For each kernel function, classification was performed using a Support Vector Machine (SVM) with a kernel matrix precomputed using the kernel function in question. Implementations for the kernel functions were obtained from the GraKeL Python library [35].

The end-to-end deep learning baseline models considered are entitled GCN, GCNWithJK, GIN, GIN0, GINWithJK, GIN0WithJK, GraphSAGE, GraphSAGEWithJK, DiffPool, GlobalAttentionNet, Set2SetNet and SortPool. The architectures of these models are described in the Appendix section of this article. Implementations for these models were obtained from the PyTorch Geometric Python library [6]; these can be downloaded directly from the benchmark section of the PyTorch Geometric website . For each end-to-end deep learning baseline model the corresponding model parameters were optimized using the Adam optimizer with the default learning rate of 0.001 and run for 300 epochs. In all cases a negative log likelihood loss function was used. Model hyper-parameters corresponding to the number and dimension of hidden layers were selected from the sets {1, 2, 3, 4, 5} and {16, 32, 64, 128} respectively by considering the loss on a validation set.

4.3 Datasets and Results

To evaluate the proposed graph classification model we considered five commonly used graph classification datasets obtained from the TU Dortmund University graph dataset repository [16] . Summary statistics for each of these datasets are displayed in Table 1. The MUTAG dataset contains graphs corresponding to chemical compounds and the binary classification problem concerns predicting a particular characteristic of the chemical [3]. The PTC MR dataset contains graphs corresponding to chemical compounds and the binary classifi-cation problem concerns predicting a carcinogenicity property. The BZR MD dataset contains graphs corresponding to chemical compounds and the binary classification problem concerns predicting a particular characteristic of the chemical [37]. The PTC FM dataset contains graphs corresponding to chemical compounds and the binary classification problem and concerns predicting a carcinogenicity property. The COX2 dataset contains graphs corresponding to molecules and the binary classification problem concerns predicting if a given molecule is active or inactive [37].

Stratified k-folds cross-validation with a k value of 10 was used to split the data into training, validation and testing sets. During each of the k training steps, one of the 1 folds in the training set was randomly selected to be a validation set and classification accuracy on this set was used to select model hyper-parameters. It has been shown that different training, validation and

Table 1: Summary statistics for each dataset used in the evaluation.

testing set splits of the data can lead to quite different rankings of graph clas-sification models [32]. However averaging the performance of k different splits, as done in this work, helps to reduce this instability. In our analysis the same training, testing and validation splits were used for all graph classification models considered. This is an important point because the performance of a given model may vary as a function of the split used. For each dataset we computed the mean accuracy on the test sets for each method. The results of this analysis are displayed in Table 2. For two of the five datasets, the proposed graph clas-sification model achieved the best mean performance and outperformed most other models by a significant margin. For the remaining three datasets, the proposed method achieved a better mean performance than many but not all baseline methods. These positive results demonstrate the utility of the proposed model. In most cases the proposed model achieved best performance on the validation set with the hyper-parameter s having a value of 2. Recall, from Equation 7, that this hyper-parameter equals the number of individual kernels integrated by the model. This demonstrates the utility of integrating multiple kernels.

Table 2: For each dataset, the mean classification accuracy plus standard devia- tion of 10-fold cross validation for each graph classification model are displayed.

It is important to note that the proposed method was compared against a large number of benchmark methods (17). This makes it challenging for any single method to perform best on all datasets. It is difficult to interpret exactly why one deep learning architecture performs better or worse than another on a particular dataset. However, one limitation of the proposed method that may limit its ability to accurately discriminate is that it only models the distribution of node embeddings and not the position of these nodes in the graph. The recent work by You et al. [47] suggests position information is important. The DiffPool method which performed best on the BZR MD dataset actually uses node position information when performing clustering in the pooling step (this is illustrated in Figure 1 of the original paper by Ying et al. [46]). We hypothesize that position information may not be important for some graph classification tasks while being important for others. This may explain why the proposed method does not uniformly outperform all others. It is also worth noting that the proposed method achieved similar performance to the GIN method on the BZR MD dataset. In a recent paper by Errica et al. [5], the authors found the GIN method to achieve best results on a number of datasets. Finally, it is interesting to note that the end-to-end deep learning models did not uniformly outperform the feature engineered kernel models. In fact, the best mean performance on the BZR MD dataset was achieved by a feature engineered kernel model.

5 Conclusions and Future Work

This article proposes a novel kernel-based support vector machine (SVM) for graph classification. Unlike existing kernel-based models, the proposed model is trained in a supervised end-to-end manner whereby the convolutional layers, the kernel function and SVM parameters are jointly optimized. The proposed model outperforms existing deep learning models on a number of datasets which demonstrates the utility of the model.

Despite these positive results, the proposed model is not a suitable candidate solution for all graph classification problems. Like all kernel-based models, the proposed model does not natively scale to large datasets. This is a consequence of the fact that training the model requires computation and storing of the kernel matrix whose size is quadratic in the number of training examples. This limitation may potentially be overcome by performing an approximation of the kennel function [29]. The authors plan to investigate this research direction in future work.

Acknowledgements

The author would like to acknowledge the many useful discussions he had with Bertrand Gauthier concerning kernel methods.

References

[1] Karsten M Borgwardt and Hans-Peter Kriegel. Shortest-path kernels on graphs. In IEEE international conference on data mining, pages 8–pp,

2005.

[2] Vincent Chen, Paroma Varma, Ranjay Krishna, Michael Bernstein, Christopher Re, and Li Fei-Fei. Scene graph prediction with limited labels. In International Conference on Computer Vision, 2019.

[3] Asim Kumar Debnath, Rosa L Lopez de Compadre, Gargi Debnath, Alan J Shusterman, and Corwin Hansch. Structure-activity relationship of mutagenic aromatic and heteroaromatic nitro compounds. correlation with molecular orbital energies and hydrophobicity. Journal of medicinal chemistry, 34(2):786–797, 1991.

[4] David Duvenaud, Dougal Maclaurin, Jorge Iparraguirre, Rafael Bombarell, Timothy Hirzel, Alan Aspuru-Guzik, and Ryan Adams. Convolutional networks on graphs for learning molecular fingerprints. In Advances in neural information processing systems, pages 2224–2232, 2015.

[5] Federico Errica, Marco Podda, Davide Bacciu, and Alessio Micheli. A fair comparison of graph neural networks for graph classification. arXiv preprint arXiv:1912.09893, 2019.

[6] Matthias Fey, , and Jan E. Lenssen. Fast graph representation learning with PyTorch Geometric. In ICLR Workshop on Representation Learning on Graphs and Manifolds, 2019.

[7] Luca Franceschi, Mathias Niepert, Massimiliano Pontil, and Xiao He. Learning discrete structures for graph neural networks. In International Conference on Machine Learning, pages 1972–1982, 2019.

[8] Andrei Gagarin and Padraig Corcoran. Multiple domination models for placement of electric vehicle charging stations in road networks. Computers & Operations Research, 96:69–79, 2018.

[9] Justin Gilmer, Samuel S Schoenholz, Patrick F Riley, Oriol Vinyals, and George E Dahl. Neural message passing for quantum chemistry. In International Conference on Machine Learning, pages 1263–1272, 2017.

[10] Ian Goodfellow, Yoshua Bengio, and Aaron Courville. Deep learning. MIT press, 2016.

[11] Will Hamilton, Zhitao Ying, and Jure Leskovec. Inductive representation learning on large graphs. In Advances in Neural Information Processing Systems, pages 1024–1034, 2017.

[12] David Haussler. Convolution kernels on discrete structures. Technical re- port, 1999.

[13] Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. Delving deep into rectifiers: Surpassing human-level performance on imagenet classifi-cation. In Proceedings of the IEEE international conference on computer vision, pages 1026–1034, 2015.

[14] Sergey Ivanov and Evgeny Burnaev. Anonymous walk embeddings. In International Conference on Machine Learning, volume 80, pages 2191– 2200, Stockholmsmassan, Stockholm Sweden, 10–15 Jul 2018.

[15] Steven Kearnes, Kevin McCloskey, Marc Berndl, Vijay Pande, and Patrick Riley. Molecular graph convolutions: moving beyond fingerprints. Journal of computer-aided molecular design, 30(8):595–608, 2016.

[16] Kristian Kersting, Nils M. Kriege, Christopher Morris, Petra Mutzel, and Marion Neumann. Benchmark data sets for graph kernels, 2016.

[17] Diederik P Kingma and Jimmy Ba. Adam: A method for stochastic opti- mization. arXiv preprint arXiv:1412.6980, 2014.

[18] Thomas N. Kipf and Max Welling. Semi-supervised classification with graph convolutional networks. In International Conference on Learning Representations, 2017.

[19] Risi Kondor and Horace Pan. The multiscale laplacian graph kernel. In Advances in Neural Information Processing Systems, pages 2990–2998, 2016.

[20] Nils M Kriege, Pierre-Louis Giscard, and Richard Wilson. On valid optimal assignment kernels and applications to graph classification. In Advances in Neural Information Processing Systems, pages 1623–1631, 2016.

[21] Ranjay Krishna, Yuke Zhu, Oliver Groth, Justin Johnson, Kenji Hata, Joshua Kravitz, Stephanie Chen, Yannis Kalantidis, Li-Jia Li, David A Shamma, et al. Visual genome: Connecting language and vision using crowdsourced dense image annotations. International Journal of Computer Vision, 123(1):32–73, 2017.

[22] Loic Landrieu and Martin Simonovsky. Large-scale point cloud semantic segmentation with superpoint graphs. In The IEEE Conference on Computer Vision and Pattern Recognition, June 2018.

[23] Yujia Li, Daniel Tarlow, Marc Brockschmidt, and Richard Zemel. Gated graph sequence neural networks. In International Conference on Learning Representations, 2016.

[24] Enxhell Luzhnica, Ben Day, and Pietro Li`o. On graph classification net- works, datasets and baselines. In ICML Workshop on Learning and Reasoning with Graph-Structured Representations, California, United States, 2019.

[25] Giannis Nikolentzos, Polykarpos Meladianos, Antoine Jean-Pierre Tixier, Konstantinos Skianis, and Michalis Vazirgiannis. Kernel graph convolutional neural networks. In International Conference on Artificial Neural Networks, pages 22–32. Springer, 2018.

[26] Giannis Nikolentzos, Polykarpos Meladianos, and Michalis Vazirgiannis. Matching node embeddings for graph similarity. In AAAI Conference on Artificial Intelligence, 2017.

[27] Jorge Nocedal and Stephen Wright. Numerical optimization. Springer Science & Business Media, 2006.

[28] Vern I Paulsen and Mrinal Raghupathi. An introduction to the theory of reproducing kernel Hilbert spaces, volume 152. Cambridge University Press, 2016.

[29] Ali Rahimi and Benjamin Recht. Random features for large-scale kernel machines. In Advances in neural information processing systems, pages 1177–1184, 2008.

[30] Bastian Rieck, Christian Bock, and Karsten Borgwardt. A persistent weisfeiler-lehman procedure for graph classification. In International Conference on Machine Learning, pages 5448–5458, 2019.

[31] Bernhard Sch¨olkopf, Alexander J Smola, Francis Bach, et al. Learning with kernels: support vector machines, regularization, optimization, and beyond. MIT press, 2002.

[32] Oleksandr Shchur, Maximilian Mumme, Aleksandar Bojchevski, and Stephan G¨unnemann. Pitfalls of graph neural network evaluation. arXiv preprint arXiv:1811.05868, 2018.

[33] Nino Shervashidze, Pascal Schweitzer, Erik Jan van Leeuwen, Kurt Mehlhorn, and Karsten M Borgwardt. Weisfeiler-lehman graph kernels. Journal of Machine Learning Research, 12(Sep):2539–2561, 2011.

[34] Nino Shervashidze, SVN Vishwanathan, Tobias Petri, Kurt Mehlhorn, and Karsten Borgwardt. Efficient graphlet kernels for large graph comparison. In Artificial Intelligence and Statistics, pages 488–495, 2009.

[35] Giannis Siglidis, Giannis Nikolentzos, Stratis Limnios, Christos Giatsidis, Konstantinos Skianis, and Michalis Vazirgiannis. Grakel: A graph kernel library in python. Journal of Machine Learning Research, 21(54):1–5, 2020.

[36] Mahito Sugiyama and Karsten Borgwardt. Halting in random walk kernels. In Advances in neural information processing systems, pages 1639–1647, 2015.

[37] Jeffrey J Sutherland, Lee A O’brien, and Donald F Weaver. Spline-fitting with a genetic algorithm: A method for developing classification structure-activity relationships. Journal of chemical information and computer sciences, 43(6):1906–1915, 2003.

[38] Oriol Vinyals, Samy Bengio, and Manjunath Kudlur. Order matters: Se- quence to sequence for sets. In International Conference on Learning Representations, San Juan, Puerto Rico, 2016.

[39] Lingfei Wu, Ian En-Hsu Yen, Zhen Zhang, Kun Xu, Liang Zhao, Xi Peng, Yinglong Xia, and Charu Aggarwal. Scalable global alignment graph kernel using random features: From node embedding to graph embedding. In Proceedings of the 25th ACM SIGKDD International Conference on Knowledge Discovery & Data Mining, pages 1418–1428. ACM, 2019.

[40] Zonghan Wu, Shirui Pan, Fengwen Chen, Guodong Long, Chengqi Zhang, and Philip S Yu. A comprehensive survey on graph neural networks. arXiv preprint arXiv:1901.00596, 2019.

[41] Danfei Xu, Yuke Zhu, Christopher Choy, and Li Fei-Fei. Scene graph gener- ation by iterative message passing. In The IEEE Conference on Computer Vision and Pattern Recognition, 2017.

[42] Keyulu Xu, Weihua Hu, Jure Leskovec, and Stefanie Jegelka. How power- ful are graph neural networks? In International Conference on Learning Representations, 2019.

[43] Keyulu Xu, Chengtao Li, Yonglong Tian, Tomohiro Sonobe, Ken-ichi Kawarabayashi, and Stefanie Jegelka. Representation learning on graphs with jumping knowledge networks. In International Conference on Machine Learning, volume 80, pages 5453–5462, Stockholmsmassan, Stockholm Sweden, 10–15 Jul 2018.

[44] Sijie Yan, Yuanjun Xiong, and Dahua Lin. Spatial temporal graph convolu- tional networks for skeleton-based action recognition. In AAAI Conference on Artificial Intelligence, 2018.

[45] Pinar Yanardag and SVN Vishwanathan. Deep graph kernels. In ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, pages 1365–1374, 2015.

[46] Zhitao Ying, Jiaxuan You, Christopher Morris, Xiang Ren, Will Hamilton, and Jure Leskovec. Hierarchical graph representation learning with differ-entiable pooling. In Advances in Neural Information Processing Systems, pages 4800–4810, 2018.

[47] Jiaxuan You, Rex Ying, and Jure Leskovec. Position-aware graph neural networks. In Kamalika Chaudhuri and Ruslan Salakhutdinov, editors, Proceedings of the 36th International Conference on Machine Learning, volume 97 of Proceedings of Machine Learning Research, pages 7134–7143, Long Beach, California, USA, June 2019.

[48] Jiaxuan You, Rex Ying, Xiang Ren, William Hamilton, and Jure Leskovec. Graphrnn: Generating realistic graphs with deep auto-regressive models. In International Conference on Machine Learning, pages 5694–5703, 2018.

[49] Muhan Zhang, Zhicheng Cui, Marion Neumann, and Yixin Chen. An end- to-end deep learning architecture for graph classification. In AAAI Conference on Artificial Intelligence, 2018.

[50] Zhen Zhang, Mianzhi Wang, Yijian Xiang, Yan Huang, and Arye Neho- rai. Retgk: Graph kernels based on return probabilities of random walks. In Advances in Neural Information Processing Systems, pages 3964–3974, 2018.

[51] Ziwei Zhang, Peng Cui, and Wenwu Zhu. Deep learning on graphs: A survey. arXiv preprint arXiv:1812.04202, 2018.

Appendix

We briefly describe the kernel functions corresponding to the feature engineered kernel baseline models considered in the work.

Graphlet - This is a kernel in the R-convolution framework which uses substructures based on Graphlets and was proposed by Shervashidze et al. [34].

Shortest Path - This is a kernel in the R-convolution framework which uses substructures based on shortest paths and was proposed by Borgwardt et al. [1].

Vertex Histogram - This is a kernel in the R-convolution which uses substructures based on random walks and was proposed by Sugiyama et al. [36].

Weisfeiler Lehman - This is a kernel in the Weisfeiler-Lehman framework which uses the Vertex Histogram Kernel as the base kernel [36] and was proposed by Shervashidze et al. [33].

Pyramid Match - This kernel uses unsupervised learning and was proposed by Nikolentzos et al. [26].

We briefly describe the architectures corresponding to the end-to-end deep learning baseline models considered in the work. More specific implementation details can be found at the benchmark section of the PyTorch Geometric website. GCN - This model consists of graph convolutional layers proposed by Kipf et al. [18], followed by mean pooling, followed by a non-linear layer, followed by a dropout layer, followed by a linear layer, followed by a softmax layer.

GCNWithJK - This model is equal to GCN but with the addition of jump or skip connections before mean pooling as proposed by Xu et al. [43].

GIN - This model consists of the graph convolutional layers proposed by Xu et al. [42], followed by mean pooling, followed by a non-linear layer, followed by a dropout layer, followed by a linear layer, followed by a softmax layer. The convolution layer in question has a parameter which is learned.

GIN0 - This model is equal to GIN with the exception that the parameter is not learned and instead is set to a value of 0.

GINWithJK - This model is equal to GIN but with the addition of jump or skip connections before mean pooling as proposed by Xu et al. [43].

GIN0WithJK - This model is equal to GIN0 but with the addition of jump or skip connections before mean pooling as proposed by Xu et al. [43].

GraphSAGE - This model consists of the graph convolutional layers proposed by Hamilton et al. [11], followed by a mean pooling layer, followed by a non-linear layer, followed by a dropout layer, followed by a linear layer, followed by a softmax layer.

GraphSAGEWithJK - This model is equal to GraphSAGE but with the addition of jump or skip connections before mean pooling as proposed by Xu et al. [43].

DiffPool - This model consists of the graph convolutional layers proposed by Hamilton et al. [11], followed by the pooling method proposed by Ying et al. [46], followed by a non-linear layer, followed by a dropout layer, followed by a linear layer, followed by a softmax layer.

GlobalAttentionNet - This model consists of the graph convolutional layers proposed by Hamilton et al. [11], followed by the pooling layer proposed by Li et al. [23], followed by a dropout layer, followed by a non-linear layer, followed by a linear layer, followed by a softmax layer.

Set2SetNet - This model consists of the graph convolutional layers proposed by Hamilton et al. [11], followed by the pooling layer proposed by Vinyals et al. [38], followed by a non-linear layer, followed by a dropout layer, followed by a linear layer, followed by a softmax layer.

SortPool - This model consists of the graph convolutional layers proposed by Hamilton et al. [11], followed by the pooling layer proposed by Zhang et al. [49], followed by a non-linear layer, followed by a dropout layer, followed by linear layer, followed by a softmax layer.