Embeddings and Representation Learning for Structured Data

2019·Arxiv

Abstract

Abstract

Performing machine learning on structured data is complicated by the fact that such data does not have vectorial form. Therefore, multiple approaches have emerged to construct vectorial representations of structured data, from kernel and distance approaches to recurrent, recursive, and convolutional neural networks. Recent years have seen heightened attention in this demanding field of research and several new approaches have emerged, such as metric learning on structured data, graph convolutional neural networks, and recurrent decoder networks for structured data. In this contribution, we provide an high-level overview of the state-of-the-art in representation learning and embeddings for structured data across a wide range of machine learning fields.

Traditional machine learning has mostly focused on the question of how to solve problems like classification or regression for fixed, manually engineered data representations [Bengio et al., 2013]. By contrast, representation learning focuses on the challenge of obtaining a vectorial representation in the first place, such that subsequent problems become easy to solve [Bengio et al., 2013]. Such an alternative view is particularly helpful for processing structured data, i.e. sequences, trees, and graphs, where vectorial representations are not immediately available [Hamilton et al., 2017b].

A wide range of machine learning fields has attempted to construct such vectorial representations for structured data. In this contribution, we provide a high-level overview of these approaches, highlighting shared foundations and properties. Thus, we hope to provide readers with a rich toolbox to handle structured data and sufficient context knowledge to select the fitting method for any given situation.

We begin by introducing key concepts of structured data and vectorial representations before we dive into the various approaches which have been proposed to achieve such representations. This paper concludes with an overview of the contributions in this special session.

1 Background

In this section, we introduce terms that are shared among all methods for representation learning and embeddings for structured data. We begin by defining structured data itself. In particular, we define a graph as a triple is a finite set of nodes, is a finite set of edges, and is a mapping which assigns some vectorial label to each node

We call a node v a parent of another node . Conversely, we call u a child of v. We denote the set of all parents of , the set of all children of , and we define the

We call a graph a tree if exactly one node exists which has no parents (which we call the root) and every other node has exactly one parent. We call all nodes without children leaves. As a notational shorthand, we also denote trees in a recursive fashion. In particular, if is the root of a tree and are its children, then the recursive tree notation is is the recursive tree notation for the subtree rooted at

graph . As a notational shorthand, we denote a sequence as

call a mapping -dimensional representation of x.

2 Approaches for Embeddings of Structured Data

Approaches for embeddings of structured data can be distinguished along multiple axes. For example, we can distinguish according to the kind of structured data that a method can process - sequences, trees, or full graphs -, whether it generates an implicit or an explicit representation, whether it generates fixed or learned representations, whether nodes or entire structures are embedded, and whether decoding is possible or not.

We begin our list with kernels and distances, which compute pairwise measures of proximity between structured data based on a pre-defined and fixed algorithm that implicitly corresponds to a vectorial representation. By contrast, neural networks learn explicit vectorial representations which are learned from data. In particular, recurrent neural networks are designed to process sequential data, recursive neural networks for tree-structured data, and graph convolutional neural networks for nodes of general graphs. However, there also exist extensions to process nodes of graphs via recurrent neural networks or entire graphs via recursive neural networks.

Note that all of these methods are initially limited to encoding a structured datum into a vectorial representation and can not decode a vector back into structured data. Our final section covers recent approaches to perform such decodings. A quick overview of methods surveyed in this paper is in Table 1.

Kernels: We call a mapping over some set X a kernel iff an embedding exists (for possibly infinite n), such that for all . Therefore, every kernel on structured data relies - implicitly or explicitly - on an embedding for structured data. In the past decade, a diverse range of structure kernels have emerged, but the conceptual basis is typically the same. A structure kernel defines a class of m (possibly infinite) characteristic substructures and defines the embedding as the sum of the embeddings for all these substructures. More precisely, if is a mapping of structured data to histograms over the selected substructures and f : is an embedding for the pre-defined set of substructures, then the overall embedding is given as . Examples of such substructures include string n-grams, substrings, random walks, shortest paths, or subtrees [Lodhi et al., 2002, Aiolli et al., 2015, Borgwardt et al., 2007, Shervashidze et al., 2011]. Most recently, kernels

Table 1: An overview of the approaches surveyed in this paper. Methods are sorted into columns according to the kind of structured data they process - either sequences, trees, graphs, or nodes within graphs. Each block in the table marks a different class of method, either kernels, distances, or neural networks. In kernels and distances, three rows distinguish between fixed representations, learned representations, and meta-representations built on pre-existing representations. For neural networks, we distinguish between networks that focus on encoding and networks which decode. Contributions of this special session are highlighted via bold print.

have also been constructed based on reservoir activations for nodes [Bacciu et al., 2016] or learned substructures, such as Markov Model hidden states for nodes [Bacciu et al., 2018b]. The embedding f for the substructures can be as simple as mapping the ith substructure to the ith unit vector, i.e. just counts the substructures. Multiple works are specifically devoted to constructing node-specific kernels based on the graph Laplacian or comparing node neighborhoods [Kondor and Pan, 2016, Navarin and Sperduti, 2017].

Note that, in most cases, it is infeasible to explicitly compute histograms over the substructures due to large or infinite m. Therefore, most kernels are computed directly as , e.g. via some dynamic programming scheme [Shervashidze et al., 2011]. As such, the embedding remains implicit and can not be directly exploited for subsequent learning.

Instead of summing up embeddings of base kernels, one can also concatenate such embeddings, which is the basis for multiple kernel learning (MKL). Given a set of base embeddings for graphs , MKL learns factors for these embeddings and defines the overall , such that the resulting learned kernel is given as [Gönen and Alpaydın, 2011, Aiolli and Donini, 2015].

Distances: Distance measures on structured data typically quantify distance in terms of effort that is needed to transform one datum into another by means of discrete edit operations such as node deletions, insertions, or replacements. This framework includes measures like the string edit distance, dynamic time warping, alignment distances, or the tree edit distance [Levenshtein, 1965, Vintsyuk, 1968, Gotoh, 1982, Zhang and Shasha, 1989]. These measures are all non-negative, self-equal, and symmetric, but do not necessarily conform to the triangular inequality and are thus not necessarily proper metrics [Nebel et al., 2017]. As with kernels, distances are related to embeddings but are typically computed directly via dynamic programming. In particular, it can be shown that for any self-equal and symmetric function d there exist two embeddings , such that for all [Pekalska and Duin, 2005]. We can make this embedding explicit by dimensionality reduction methods such as multi-dimensional scaling or t-SNE [Sammon, 1969, van der Maaten and Hinton, 2008].

It is worth noting that edit distances can be learned in a supervised fashion by manipulating the costs of single edits to facilitate classification [Bellet et al., 2014, Paaßen et al., 2018].

Recurrent Neural Networks: A recurrent neural network (RNN) maps sequential data to a representation by means of the recursive equation

where f is some mapping is typically defined as the zero vector (also refer to Figure 1, left).

In recurrent neural networks, the function f is a neural network layer, e.g. a classic sigmoid layer of the form are weight matrices and is a sigmoid function such as the tanh function. A challenge in learning such networks are vanishing or exploding gradients over time, which can be addressed, for instance, by the following strategies. First, one can decide to adapt neither either U nor W but to initialize them in a randomized or deterministic fashion [Jaeger and Haas, 2004, Rodan and Tiňo, 2012], also in a deep setting [Gallicchio et al., 2017]. Second, one can replace a standard sigmoid layer with a gated architecture that can ignore irrelevant

Figure 1: An illustration of the three encoder networks presented in the paper, namely a recurrent network which encodes the sequence a, b, c via Equation 1 (left), a recursive network that encodes the tree c(a, b) via Equation 2 (center), and a two-layer graph convolutional neural network that encodes the nodes of the graph G = ({a, b, c}, {(a, b), (a, c), (b, c)}) via Equation 3 (right). Function applications/neurons are indicated by circles. The computational flow is indicated by arrows.

parts of the sequence and thus maintain memory over longer time without being unstable [Hochreiter and Schmidhuber, 1997, Cho et al., 2014, Greff et al., 2017]. For example, gated recurrent units [Cho et al., 2014] define the recurrent function denotes the element-wise product and where are so-called gates, computed via standard sigmoid layers as above.

The objective function of recurrent neural networks is typically to map the input sequence to an output sequence by means of an output sigmoid layer such that . However, recurrent neural networks can also be trained to auto-encode sequences or to decode to other sequences [Sutskever et al., 2014].

Note that RNNs can also be applied to embed nodes in a graph by considering the embedding as part of a state vector in a RNN [Scarselli et al., 2009, Gallicchio and Micheli, 2010, Li et al., 2016]. More precisely, let denote the embedding of node v at time t. Then, we can construct the node embedding at time t + 1 via the equation or, similarly, via the equation for some mapping In both cases, this collapses to a recurrent neural network according to Equation 1 if we consider the concatenation of all node labels as the input for each time step t and the concatenation of all embeddings as state vector at time t. The training of this network is typically guided by some supervised objective as in regular RNNs [Scarselli et al., 2009, Gallicchio and Micheli, 2010, Li et al., 2016].

As [Scarselli et al., 2009] have shown, letting this network run is guaranteed to converge to a fix point if is a contractive map. In other words, one can let the network run for a sufficiently large time and then use the resulting embedding at that time as an approximation of the fix point and thus as an embedding of the nodes [Li et al., 2016].

Recursive Neural Networks: Recursive neural networks are an extension of recurrent neural networks for tree structured data. Given a tree , a recursive neural

network is defined by the recursive equation

where is typically a sigmoid layer (also refer to Figure 1, center) [Sperduti and Starita, 1997, Hammer, 2002, Gallicchio and Micheli, 2013]. In other words, the encoding starts with the leaves of the tree and then processes the tree bottom-up until an embedding for the entire tree is obtained at the root. Extensions of recursive neural networks to the treatment of directed postional acyclic graphs (DPAGs) have been introduced in [Micheli et al., 2004, Hammer et al., 2005].

Note that the construction of by be challenging of the children have no clear positional order or if the number of children is not consistent among nodes. Such problems can be addressed by using order-invariant operators like sum or product to aggregate child embeddings, to fill missing children with special tokens like zero vectors, to normalize the trees to binary structure, or to learn specific functions for different kinds of nodes with different number of children.

Graph Convolutional Neural Networks: Graph Convolutional Neural Networks (GCNs) generate embeddings of nodes in graphs similarly to RNNs but via a layered feedforward architecture. In particular, let denote the embedding of node v in layer l of the network, where . Then, the embedding in layer l + 1 is obtained via the equation

where is a sigmoid layer and is some connectivity factor depending on the graph structure (also refer to Figure 1, right) [Kipf and Welling, 2017]. Note that this equation differs from the RNN equation of [Scarselli et al., 2009, Gallicchio and Micheli, 2010, Li et al., 2016] in that we use different parameters for each layer. Also note that the embedding in the lth layer integrates information from nodes up to distance l in the graph.

The idea to treat the mutual dependencies (graph cycles) through different neural network layers and to extend the nodes embedding by composition of the information of previous layers was originally introduced (and formally proved) in the context of constructive approaches [Micheli, 2009]. Therein, the concept/terminology of visiting (the nodes) of the graphs corresponds to the terminology of convolution over (the nodes) of the graphs used in GCN. Indeed, the main differences between the model in [Micheli, 2009] (NN4G) and GCN are related to the use of an incremental construction of the deep NN for NN4G instead of the end-to-end training of GCN (with advantage for NN4G in terms of divide et impera automatic design and layer by layer learning). A recent model proposal using the costruction of NN4G in the context of generative models is in [Bacciu et al., 2018a].

As with RNNs, GCNs are trained in a supervised fashion where the last layer of a GCN is considered as the output of the network. Further, we can train GCNs semi-supervised by augmenting the supervised loss with a term that forces neighboring nodes to have similar encodings [Kipf and Welling, 2017]. While vanilla GCNs are limited to graphs for which the structure is known a priori, multiple authors have recently extended GCNs to unknown structure, either by normalizing the neighborhood or by using attention mechanisms [Hamilton et al., 2017a, Veličković et al., 2018].

Importantly, the embeddings of GCNs can also be aggregated to achieve an embedding for the entire graph by iteratively clustering nodes to coarser structures and aggregating the embeddings inside structures by a pooling network [Ying et al., 2018].

Decoder Networks: Decoding vectorial representations back into structured data poses a significant challenge as decoding trees or full graphs is provably harder compared to encoding them [Hammer, 2002]. Therefore, present decoding approaches focus on decoding sequential data via recurrent neural networks.

Most prominently, sequence-to-sequence (seq2seq) learning first encodes a sequence as a vector via a recurrent neural and then applies a second recurrent neural network which decodes the sequence step by step until it returns a special end-of-sentence token [Sutskever et al., 2014].

Given the success of this scheme for hard machine learning tasks such as machine translation or caption generation [Sutskever et al., 2014, Xu et al., 2015], researchers have also attempted to apply it to trees or graphs by encoding these structures as sequences. In particular, we can re-write graphs as a sequences of nodes and edges if we impose a order on the graph’s nodes, e.g. via breadth-first-search [Liu et al., 2018, You et al., 2018]. After this re-representation, we can train a recurrent neural network to generate one edge at a time and thus reconstruct the entire graph until an end-of-sentence token is generated [Liu et al., 2018, You et al., 2018].

Alternatively, we can exploit grammatical knowledge about the domain. If our data can be described by a context-free grammar (as in the case of chemical molecules or computer programs), generating a structured datum reduces to a sequence of grammar rule applications. Therefore, we can train a recurrent neural network which outputs the current sequence of grammar rules to decode a given datum, which will then also be guaranteed to be syntactically correct [Alvarez-Melis and Jaakkola, 2017, Jin et al., 2018, Kusner et al., 2017]. Indeed, grammatical structures can even be used to impose semantic constraints, such as chemical bond properties [Dai et al., 2018].

3 Special Session Contributions

The contributions in this special session cover a wide range of approaches for representation learning and embeddings for structured data, namely two kernel approaches [Kriege, 2019, Navarin et al., 2019], one distance approach [Bakker and Bunte, 2019], one sequence encoding approach via recurrent neural networks [Mirus et al., 2019], one graph decoding approach via recurrent neural networks [Bacciu et al., 2019], and an extension of non-negative matrix factorization to uncover structure in high-dimensional vectorial data [Hautecoeur and Glineur, 2019] (also refer to Table 1).

In more detail, [Kriege, 2019] presents a variation of the Weisfeiler-Lehman graph kernel [Shervashidze et al., 2011] which combines the concept of optimal assignments with multiple kernel learning [Aiolli and Donini, 2015]. In particular, they define the kernel between two graphs is the set of all possible bijections between the nodes of is a weighted Weisfeiler-Lehman base kernel over nodes. Recall that the Weisfeiler-Lehman kernel counts subtree patterns in the neighborhood of a node. The kernel variation employed by [Kriege, 2019] applies weights to these subtree patterns and then optimizes these weights via multiple kernel learning.

[Navarin et al., 2019] propose a scheme which generates a more expressive feature space from a base node kernel by means of a sum of outer products. In particular, the encoding is defined as is the embedding of the base node kernel and -hop neighborhood of node v. They also consider a version of this kernel where features are selected based on their discriminative value in a linear classifier.

[Bakker and Bunte, 2019] perform metric learning to weigh text features obtained via tfidf and latent semantic analysis and obtain an explicit, low-dimensional embedding via t-SNE

[van der Maaten and Hinton, 2008].

[Mirus et al., 2019] encode a snapshot of a driving scene incorporating a variable number of vehicles via the semantic pointer architecture [Eliasmith, 2013] and encode a sequence of such snapshots via long short-term memory networks [Hochreiter and Schmidhuber, 1997, Greff et al., 2017] in order to predict the future movement of a single vehicle.

With the the aim is to provide an adpative approach to graph generation from arbitrary distributions, [Bacciu et al., 2019] first represent graphs as sequences of edges, where the edges are ordered according to their starting node, and auto-encode graphs as vectors by means of a sequence-to-sequence [Sutskever et al., 2014] network consisting of two gated recurrent units [Cho et al., 2014], where the former encodes an edge sequence as a vector and the second decodes the edge sequence from the code vector.

Finally, [Hautecoeur and Glineur, 2019] proposes a variation of hierarchical alternating least squares (HALS) to infer non-negative polynomial signals which best explain a given data set of sequences in the sense that the Euclidean distance between the observed sequences and the sequences produced by a linear combination of non-negative polynomials on the same interval is as small as possible.

Overall, the contributions in this special session push the boundaries of embeddings for structured data forward across a wide range of approaches. This reflects the more intense recent interest in such embeddings in the research community overall and gives hope for further progress in the future.

References

Fabio Aiolli and Michele Donini. EasyMKL: a scalable multiple kernel learning algorithm. Neurocomputing, 169:215–224, 2015. doi:10.1016/j.neucom.2014.11.078.

Fabio Aiolli, Giovanni Da San Martino, and Alessandro Sperduti. An efficient topological distance-based tree kernel. IEEE Transactions on Neural Networks and Learning Systems, 26(5):1115–1120, 2015. doi:10.1109/TNNLS.2014.2329331.

David Alvarez-Melis and Tommi Jaakkola. Tree-structured decoding with doubly-recurrent neural networks. In Proc. ICLR, 2017. URL https://openreview.net/forum?id=HkYhZDqxg.

Davide Bacciu, Claudio Gallicchio, and Alessio Micheli. A reservoir activation kernel for trees. In Proc. ESANN, pages 29–34, 2016. URL http://www.elen.ucl.ac.be/Proceedings/esann/esannpdf/es2016-172.pdf.

Davide Bacciu, Federico Errica, and Alessio Micheli. Contextual graph Markov model: A deep and generative approach to graph processing. In Proc. ICML, pages 294–303, 2018a. URL http://proceedings.mlr.press/v80/bacciu18a.html.

Davide Bacciu, Alessio Micheli, and Alessandro Sperduti. Generative kernels for tree-structured data. IEEE Transactions on Neural Networks and Learning Systems, 29(10): 4932–4946, 2018b. doi:10.1109/TNNLS.2017.2785292.

Davide Bacciu, Alessio Micheli, and Marco Podda. Graph generation by sequential edge prediction. In Proc. ESANN, 2019. Special Session Contribution.

Jelle Bakker and Kerstin Bunte. Efficient learning of email similarities for customer support. In Proc. ESANN, 2019. Special Session Contribution.

Aurélien Bellet, Amaury Habrard, and Marc Sebban. A survey on metric learning for feature vectors and structured data. arXiv, abs/1306.6709, 2014.

Yoshia Bengio, Aaron Courville, and Pascal Vincent. Representation learning: A review and new perspectives. IEEE Transactions on Pattern Analysis and Machine Intelligence, 35(8): 1798–1828, 2013. doi:10.1109/TPAMI.2013.50.

Karsten M. Borgwardt, Nicol N. Schraudolph, and S.v.n. Vishwanathan. Fast computation of graph kernels. In Proc. NIPS, pages 1449–1456, 2007. URL http://papers.nips.cc/paper/2973-fast-computation-of-graph-kernels.pdf.

Kyunghyun Cho, Bart van Merrienboer, Çaglar Gülçehre, Fethi Bougares, Holger Schwenk, and Yoshua Bengio. Learning phrase representations using RNN encoder-decoder for statistical machine translation. In Proc. EMNLP, pages 1724–1734, 2014. URL http://arxiv.org/abs/1406.1078.

Hanjun Dai, Yingtao Tian, Bo Dai, Steven Skiena, and Le Song. Syntaxdirected variational autoencoder for structured data. In Proc. ICLR, 2018. URL https://openreview.net/forum?id=SyqShMZRb.

Chris Eliasmith. How to build a brain: A neural architecture for biological cognition. Oxford University Press, 2013.

Claudio Gallicchio and Alessio Micheli. Graph echo state networks. In Proc. IJCNN, pages 1–8, 2010. doi:10.1109/IJCNN.2010.5596796.

Claudio Gallicchio and Alessio Micheli. Tree echo state networks. Neurocomputing, 101:319 – 337, 2013. doi:10.1016/j.neucom.2012.08.017.

Claudio Gallicchio, Alessio Micheli, and Luca Pedrelli. Deep reservoir computing: A critical experimental analysis. Neurocomputing, 268:87–99, 2017.

Mehmet Gönen and Ethem Alpaydın. Multiple kernel learning algorithms. Journal of Machine Learning Research, 12(Jul):2211–2268, 2011. URL http://www.jmlr.org/papers/v12/gonen11a.html.

Osamu Gotoh. An improved algorithm for matching biological sequences. Journal of Molecular Biology, 162(3):705–708, 1982. doi:10.1016/0022-2836(82)90398-9.

K. Greff, R. K. Srivastava, J. Koutník, B. R. Steunebrink, and J. Schmidhuber. LSTM: A search space odyssey. IEEE Transactions on Neural Networks and Learning Systems, 28 (10):2222–2232, 2017. doi:10.1109/TNNLS.2016.2582924.

Will Hamilton, Zhitao Ying, and Jure Leskovec. Inductive representation learning on large graphs. In Proc. NIPS 2017, pages 1024–1034, 2017a. URL http://papers.nips.cc/paper/6703-inductive-representation-learning-on-large-graphs.

William L Hamilton, Rex Ying, and Jure Leskovec. Representation learning on graphs: Meth- ods and applications. Bulletin of the Technical Committee on Data Engineering, 40(3): 52–74, 2017b. URL http://sites.computer.org/debull/A17sept/p52.pdf.

Barbara Hammer. Recurrent networks for structured data - A unifying approach and its properties. Cognitive Systems Research, 3(2):145 – 165, 2002. doi:10.1016/S1389-0417(01)00056-0.

Barbara Hammer, Alessio Micheli, and Alessandro Sperduti. Universal approximation capa- bility of cascade correlation for structures. Neural Computation, 17(5):1109–1159, 2005. doi:10.1162/0899766053491878.

Cécile Hautecoeur and François Glineur. Nonnegative matrix factorization with polynomial signals via hierarchical alternating least squares. In Proc. ESANN, 2019. Special Session Contribution.

Sepp Hochreiter and Jürgen Schmidhuber. Long short-term memory. Neural Computation, 9 (8):1735–1780, 1997. doi:10.1162/neco.1997.9.8.1735.

Babak Hosseini and Barbara Hammer. Efficient metric learning for the analysis of motion data. In Proc. DSAA, pages 1–10, 2015. doi:10.1109/DSAA.2015.7344819.

Herbert Jaeger and Harald Haas. Harnessing nonlinearity: Predicting chaotic systems and saving energy in wireless communication. Science, 304(5667):78–80, 2004. doi:10.1126/science.1091277.

Wengong Jin, Regina Barzilay, and Tommi Jaakkola. Junction tree variational autoencoder for molecular graph generation. In Proc. ICML, pages 2323–2332, 2018. URL http://proceedings.mlr.press/v80/jin18a.html.

Thomas Kipf and Max Welling. Semi-supervised classification with graph convolutional net- works. In Proc. ICLR, 2017. URL https://openreview.net/forum?id=SJU4ayYgl.

Risi Kondor and Horace Pan. The multiscale laplacian graph kernel. In Proc. NIPS, pages 2990–2998, 2016. URL http://papers.nips.cc/paper/6135-the-multiscale-laplacian-graph-kernel.

Nils Kriege. Deep weisfeiler-lehman assignment kernels via multiple kernel learning. In Proc. ESANN, 2019. Special Session Contribution.

Matt J. Kusner, Brooks Paige, and José Miguel Hernández-Lobato. Grammar variational autoencoder. In Proc. ICML, pages 1945–1954, 2017. URL http://proceedings.mlr.press/v70/kusner17a.html.

Vladimir Levenshtein. Binary codes capable of correcting deletions, insertions, and reversals. Soviet Physics Doklady, 10(8):707–710, 1965.

Yujia Li, Daniel Tarlow, Marc Brockschmidt, and Richard S. Zemel. Gated graph sequence neural networks. In Proc. ICLR, 2016. URL http://arxiv.org/abs/1511.05493.

Huma Lodhi, Craig Saunders, John Shawe-Taylor, Nello Cristianini, and Chris Watkins. Text classification using string kernels. Journal of Machine Learning Research, 2(Feb):419–444, 2002. URL http://www.jmlr.org/papers/v2/lodhi02a.html.

A. Micheli, D. Sona, and Alessandro Sperduti. Contextual processing of structured data by recursive cascade correlation. IEEE Transactions on Neural Networks, 15(6):1396–1410, Nov 2004. doi:10.1109/TNN.2004.837783.

Alessio Micheli. Neural network for graphs: A contextual constructive approach. IEEE Transactions on Neural Networks, 20(3):498–511, 2009. doi:10.1109/TNN.2008.2010350.

Florian Mirus, Peter Blouw, Terrence Stewart, and Jörg Conradt. Predicting vehicle behaviour using lstms and a vector power representation for spatial positions. In Proc. ESANN, 2019. Special Session Contribution.

Nicolò Navarin and Alessandro Sperduti. Approximated neighbours MinHash graph node kernel. In Michel Verleysen, editor, Proc. ESANN, pages 281–286, 2017. URL http://www.elen.ucl.ac.be/Proceedings/esann/esannpdf/es2017-134.pdf.

Nicolò Navarin, Dinh van Tran, and Alessandro Sperduti. On the definition of complex structured feature spaces. In Proc. ESANN, 2019. Special Session Contribution.

David Nebel, Marika Kaden, Andrea Villmann, and Thomas Villmann. Types of (dis-)similarities and adaptive mixtures thereof for improved classification learning. Neurocomputing, 268:42–54, 2017. doi:10.1016/j.neucom.2016.12.091.

Benjamin Paaßen, Claudio Gallicchio, Alessio Micheli, and Barbara Hammer. Tree Edit Distance Learning via Adaptive Symbol Embeddings. In Proc. ICML, pages 3973–3982, 2018. URL http://proceedings.mlr.press/v80/paassen18a.html.

Benjamin Paaßen, Claudio Gallicchio, Alessio Micheli, and Alessandro Sperduti. Embeddings and representation learning for structured data. In Michel Verleysen, editor, Proceedings of the 27th European Symposium on Artificial Neural Networks, Computational Intelligence and Machine Learning (ESANN 2019), pages 85–94, 2019.

Elzbieta Pekalska and Robert Duin. The Dissimilarity Representation for Pattern Recognition. World Scientific Publishing Co., Inc., River Edge, NJ, USA, 2005. ISBN 9812565302.

Ali Rodan and Peter Tiňo. Simple deterministically constructed cycle reservoirs with regular jumps. Neural Computation, 24(7):1822–1852, 2012. doi:10.1162/NECO_a_00297.

John W. Sammon. A nonlinear mapping for data structure analysis. IEEE Transactions on Computers, 18(5):401–409, 1969. doi:10.1109/T-C.1969.222678.

F. Scarselli, M. Gori, A. C. Tsoi, M. Hagenbuchner, and G. Monfardini. The graph neural network model. IEEE Transactions on Neural Networks, 20(1):61–80, 2009. doi:10.1109/TNN.2008.2005605.

Nino Shervashidze, Pascal Schweitzer, Erik Jan van Leeuwen, Kurt Mehlhorn, and Karsten M Borgwardt. Weisfeiler-Lehman graph kernels. Journal of Machine Learning Research, 12 (Sep):2539–2561, 2011. URL http://www.jmlr.org/papers/v12/shervashidze11a.html.

Martin Simonovsky and Nikos Komodakis. Towards variational generation of small graphs. In Proc. ICLR, 2018. URL https://openreview.net/forum?id=rJY7hoi8z.

Alessandro Sperduti and A. Starita. Supervised neural networks for the classification of struc- tures. IEEE Transactions on Neural Networks, 8(3):714–735, 1997. doi:10.1109/72.572108.

Ilya Sutskever, Oriol Vinyals, and Quoc V Le. Sequence to sequence learning with neural networks. In Proc. NIPS, pages 3104–3112, 2014. URL http://papers.nips.cc/paper/5346-sequence-to-sequence-learning-with-neural.

Laurens van der Maaten and Geoffrey Hinton. Visualizing data using tSNE. Journal of Machine Learning Research, 9:2579–2605, 2008. URL http://www.jmlr.org/papers/v9/vandermaaten08a.html.

Petar Veličković, Guillem Cucurull, Arantxa Casanova, Adriana Romero, Pietro Liò, and Yoshua Bengio. Graph attention networks. In Proc. ICLR, 2018. URL https://openreview.net/forum?id=rJXMpikCZ.

T. K. Vintsyuk. Speech discrimination by dynamic programming. Cybernetics, 4(1):52–57, 1968. doi:10.1007/BF01074755.

Kelvin Xu, Jimmy Ba, Ryan Kiros, Kyunghyun Cho, Aaron Courville, Ruslan Salakhudi- nov, Rich Zemel, and Yoshua Bengio. Show, attend and tell: Neural image caption generation with visual attention. In Proc. ICML, pages 2048–2057, 2015. URL http://proceedings.mlr.press/v37/xuc15.html.

Jiaxuan You, Rex Ying, Xiang Ren, William Hamilton, and Jure Leskovec. GraphRNN: Generating realistic graphs with deep auto-regressive models. In Proc. ICML, pages 5708– 5717, 2018. URL http://proceedings.mlr.press/v80/you18a.html.

Kaizhong Zhang and Dennis Shasha. Simple fast algorithms for the editing distance between trees and related problems. SIAM Journal on Computing, 18(6):1245–1262, 1989. doi:10.1137/0218082.

designed for accessibility and to further open science