Molecule Property Prediction and Classification with Graph Hypernetworks

2020·Arxiv

Abstract

Abstract

Graph neural networks are currently leading the performance charts in learning-based molecule property prediction and classification. Computational chemistry has, therefore, become the a prominent testbed for generic graph neural networks, as well as for specialized message passing methods. In this work, we demonstrate that the replacement of the underlying networks with hypernetworks leads to a boost in performance, obtaining state of the art results in various benchmarks.

A major difficulty in the application of hypernetworks is their lack of stability. We tackle this by combining the current message and the first message. A recent work has tackled the training instability of hypernetworks in the context of error correcting codes, by replacing the activation function of the message passing network with a low-order Taylor approximation of it. We demonstrate that our generic solution can replace this domain-specific solution.

I. INTRODUCTION

The field of learning-based prediction of molecule properties holds the promise of delivering accurate predictions at a fraction of the complexity that is required by the Density Functional Theory (DFT) models, while not being tied to the assumptions and approximations of this theory. This order of magnitude reduction in runtime supports not only the rapid screening of molecule banks, with important applications in medicine, manufacturing, and environmental science, but also the automatic design of new materials.

Molecules are often represented as graphs. Similarly to other application fields, such as computer vision, computational chemistry has benefited both from the development of powerful generic (graph) neural networks, as well as the development of specialized methods that are developed for the specific prediction tasks. For example, the state of the art NMP-Edge method of [1] is a sophisticated domain-specific method, with many significant algorithmic choices, which generalizes the method of [2] and incorporates ideas from the work of [3] and [4].

In this work, we propose a generic way to improve graph neural networks and demonstrate that it is able to improve molecule property prediction and classification in both specialized networks and in more generic methods that are applied to computational chemistry datasets. The value of our scheme stems from its ability to improve upon a diverse set of already optimized state of the art methods.

Our method employs hypernetworks, also known as dynamic networks. In such neural networks, the weights of at least some of the layers vary dynamically based on the input. A hypernetwork can be seen as a composite network in which one network predicts the weights of another network. In our case, both networks receive the messages that are passed in the graph neural network as inputs.

Since the weights of the message generating network in the hypernetworks change dynamically during inference, training it is a challenge. We tackle this with a specific way of incorporating the incoming messages into the hypernetwork. Instead of passing the current message, we pass a linear combination of the current message and the first message. This simple modification is enough to ensure an improvement in performance. Without it, the hypernetwork would not typically outperform the original network.

Our experiments show that the same scheme is able to improve the predictions provided by three state of the art methods: the NMP-Edge network, the Invariant Graph Network of [5], and the Graph Isomorphism Network of [6]. In addition, we evaluate our method in the domain of error correcting codes, in which a very recent contribution by [7] employed hypernetworks to improve the accuracy of a message passing scheme. We are able to show that our modification of the input messages is able to replace the method used there to stabilize the network. Taking both methods together, the results further improve.

II. RELATED WORK

Graph Networks The topic of graph neural networks has drawn considerable attention, both in the context of specific applications, such as text analysis [8] and computer vision [9] and as a generic tool. Earlier work employed recursive neural networks [10]–[13], where the information flows once in a network that is generated based on the graph. Most current methods, including graph spectra methods [14]–[16] can be cast as message passing algorithms [3], [4], [6], [13], [17]– [19].

The generic message passing methods employ three networks: one that pools the hidden states from the neighborhood graph vertices, another that updates the hidden states based on the aggregated representation of the neighbouring vertices, and one that reads the information from the entire graph in order to generate the final classification. In such a network, the messages are the hidden states of the nodes. In molecule prediction, conditioning the messages also on the receiving graph node and storing a hidden state for the linking edge improves performance [1], [3], [4].

An alternative to message passing techniques, is presented by permutation equivariant operators on the tensors that represent k-order interactions between graph nodes [5], [20]–[22]. We are able to demonstrate that applying our method to both message passing methods and equivariant operator methods improved performance. Molecule property prediction While in the past, feature engineering was the main route to applying machine learning in chemistry [23]–[27], neural networks have become increasingly popular. The NMP-Edge model of [1] described in Sec. III-A is an example of a specialized model. It follows the basic architecture of [3], in which the messages are conditioned on the nodes across both sides of the edge and on the hidden representation of the edge. In the NMP-Edge model, however, similar to SchNet [2], the message is an elementwise product of a network that encodes the sending node and a network that encodes the edge (not encoding the receiving edge directly). Also similar to SchNet, an RBF initialization and a soft-plus activation are used. Unlike SchNet, the edge embedding is being updated in time, following the Weave network proposed by [4].

An example of a generic graph network solution that also excels on the popular QM9 benchmark [28], [29] is the Invariant Graph Network (IGN) of [5], presented in Sec. III-B. Hypernetworks Dynamic layers, also known as gating layers, are layers in which the weights are determined by a separate neural network. Such networks were introduced by [30], [31] for visual tasks that require an adaptation of the input image. More recently, the term hypernetworks was coined to refer to a composite neural network in which a network f is trained to predict the weights of another network g. The shift from specific layers to entire networks was presented by [32], who employed hypernetworks for video frames and stereo views prediction. The usage of hypernetworks for recurrent neural networks was presented by [33]. [34] have presented a Bayesian formulation of hypernetworks, and such networks have become prominent in meta-learning following [35], who studied transfer learning between multiple few-shot learning tasks.

Since the weights of network g are generated instantaneously by network f, [36] have used hypernetworks for searching over the space of possible network architectures. In this case, a lengthy backpropagation optimization is replaced by the feed forward prediction of network f. More related to our work is that of [37], who use hypernetworks on graphs, also in the domain of network architecture search. In this work, the weight generating network f is a graph network that operates on the graph that captures the generated architecture.

Another recent application of hypernetworks to graphs is the work of [7], where an MLP generates the weights of a message passing network that decodes error correcting codes. This generalizes earlier attempts in the domain of network decoders including [38]–[43] and is shown to improve performance. The input to both the weight generating network f and the message generation network g is the incoming message, where for the first network, the absolute value is used. It is shown that training hypernetworks suffers from severe initialization challenges and would often lead to the explosion of the weights. [7], therefore, present a new activation function that is more stable than the arctanh activation typically used in message passing decoders. In our work, we employ conventional activations, and do not employ the absolute value for molecule prediction. We demonstrate that a combination of the initial message (from the first iteration) with the last message is an effective way to stabalize the training of the graph hypernetwork and do not employ dedicated activation functions.

III. GRAPH HYPERNETWORKS

We extend three leading architectures for graph neural networks that were either designed for the molecule inference task or shown to excel on it. In each case, we add a hypernetwork scheme in which the input is a linear combination of the first message passed in the network and the current message. As our experiments show, in all three cases, sizable gains in performance are obtained, in comparison to the underlying method. In addition, in order to compare ourselves with a recent hypernetwork message passing scheme, we modify the decoding method of [7].

A. Extending the NMP-Edge network by [1]

In order to describe how hypernetworks are applied to the NMP-Edge network, we rely on the original notation of [1]. Let be the hidden state of a node associated with a specific atoms at iteration t, and be the hidden state representation of an edge, which denotes either a chemical link between atom or spatial proximity. The hidden states of the atoms are initialized using a look-up table and the hidden state of the edges are initialized using an RBF function with multiple scales, following [1], [2].

The message passing scheme of the original NMP-Edge network takes the form:

where are the messages aggregated at node v at time t, and are the message and transition networks for iteration t. These networks are dynamic (vary between iterations) but are independent of the inputs. The earlier work by [19] uses a similar set of networks which do not change between the iterations.

In our modified network, we replace the state transition function with the hypernetwork f and g as follows:

where c is a learned damping factor, which is clipped to be in the range [0,1] and is initialized with a uniform distribution, and the weights of network g are given by . For t = 0 , Eq. 3 becomes . Note that f is a fixed function. However, vary in time, since the set of weights change as the input to f changes.

The readout function is the same as in [1], which is two layer neural network that pools from all of the network atoms. The network , the readout network, and network of the original architecture employ a shifted-soft-plus network, following [2], while f, g employ tanh. Bias terms are not used. The number of layers is two in both g and has four layers (we believe that the architecture of is locally optimal and adding layers to it did not improve the accuracy in our experiments).

B. Extending the Invariant Graph Network of [5]

We follow the original notations of IGN, in order to enable a quick reference to the original IGN work. The original model with d blocks has the form where, h is an invariant layer [44], m is a MLP, and the blocks are defined as:

where denote the input tensor to the block, denotes element-wise multiplication, the square brackets denote concatenation along the last tensor dimension, and the three MLPs and are applied to each of the elements of the input tensor individually along the third tensor dimension. The dimension of the block’s output is, therefore, .

The modified IGN network has the form:

where are hyper blocks. Let be the input tensor to the hyper block . The hyper block performs the following computation:

where the damping factor c is a learned parameter, initialized from the uniform [0,1] distribution and clipped to remain in this range. The input tensors for f and g are an aggregation of and with the damping factor c. Each layer in g is applied to each feature of the input tensor independently along the third dimension. As before, f and g are neural networks with the tanh activation with f having four layers and g two.

We use the same suffix networks per benchmark as [5]. For QM9 h of Eq. 6 is an invariant max pooling, which is followed by a MLP m with three layers. For the classification datasets, h is an invariant max pooling layer from every block output in the original work) followed by a single layer. These outputs are then summed to produce the network output.

C. Extending the Graph Isomorphism Network of [6]

We now turn to the notation used in [6] to introduce the modified GIN model. G(V, E) is the graph with node feature vector for vertices and edges E. In the graph classification problem, one is given a set of graphs with matching labels . The GNN model calculates representation vectors for each node v in an iterative manner. After convergence, a readout function calculates the global graph embedding , from which the label is predicted for a MLP M. In GIN, the readout takes the form of a summation followed by concatenation:

where K is number of message passing iterations. The update function of the hidden node representation for iteration k is given by a MLP that is specific to this iteration:

where is either a learned parameter or a fixed scalar, depending on the experiment.

The modified GIN model we propose modifies the update step, without changing the final readout:

h

where is calculated from Eq. 10 with k = 0 as , where is the vector of input features of node v. c is a damping factor that is learned during training. f and g are neural networks with three and two hidden layers respectively with the tanh activation. Note that the network g changes between iterations and across nodes, depending on the input to f. However, the entire hypernetwork is fixed between the iterations.

D. Extending the Decoding Hypernetwork of [7]

Since [7] have proposed to extend an existing network using a hypernetwork, we modify their work in order to compare our way of converting a graph network to a hypernetwork with theirs. Specifically, it is reported that hypernetworks cannot train for the task of decoding error correcting codes, unless a dedicated activation function is used, since any other activation function attempted in their experiments leads to a divergence of the weights. [7] modify the belief propagation algorithm of [38], which is given by:

where is the log likelihood ratios of the input bits, is the computed edge message for the edge e = (c, v) in a Tanner graph, which is a bidirectional graph that has variable nodes v on one side and check nodes c on the other. Let H be the parity check matrix. Each variable node is indexed by an edge e = (c, v) on the Tanner graph and N(v) = {(c, v)|H(c, v) = 1}, i.e, the set of all edges in which v participates. is the message from the previous iteration.

The hypernetwork model of [7] has the following update equations for odd j:

where N(v, c) is a vector that contains the elements of that correspond to the indices N(v) \ {(c, v)}. For even j the update equation takes the form:

where q is the degree of the Taylor approximation of arctanh.

We modified the model of [7] with the following update equations. For odd j:

where is the output of one iteration from Eq. 13, and c is the damping factor which is learned during training.

For an even j we either use Eq. 16 (Taylor approximated arctanh), or consider the conventional arctanh activation, as in Eq. 13. The readout function is not modified.

IV. EXPERIMENTS

We evaluate our model on regression and classification for predicting molecule proprieties and for decoding linear block codes. For regression we use the Quantum Machines 9 (QM9) dataset [28], [29] and Open Quantum Materials Database (OQMD) [45], [46]. The QM9 dataset has 133, 885 molecules. Each molecule has 12 properties for predicting. When comparing our results to NMP-Edge and IGN methods, we use the same train-validation-test split as [1] or [5], respectively. While based on the same dataset, these two benchmarks cannot be directly compared for many of the properties. The OQMD dataset contain 435, 582 inorganic structures. We use the same train-validation-test split as [1].

For classification we employ the four bioinformatics dataset of [47], which contains protein structures or chemical compounds: MUTAG, PROTEINS, PTC and NCI1. We use the original train folds, which are also used by [5] and [6].

For decoding error correcting codes, we use the parity check metrics of [48]. We use three classes of linear block codes: Low Density Parity Check (LDPC) codes [49], Polar codes [50] and Bose-Chaudhuri-Hocquenghem (BCH) codes [51].

We compare with various baseline method on top of the methods that we modify. For the QM9 and OQMD datasets we compare with V-RF [52], SchNet [2], enn-s2s [3], Cormorant [53], Incidence [54], 123-gnn [55] and the two methods by [56]: DTNN and MPNN. The baseline method for the classification datasets are WL subtree [57], DCNN [58], PATCHY-SAN [59], DGCNN [60], AWL [61], GCN [16], and GraphSAGE [62]. Implementation details The various hyperparameters were selected based on the validation set (where vary between experiments) or set arbitrarily based on the underlying architecture (where fixed). NMP-edge network For QM9, we trained the models with the ADAM optimizer, with a learning rate set to . The number of iterations was 4. The number of neurons in network f was 64 and the number of neurons in network g was 128. The learning rate decreases by a factor of 0.96 every 100, 000 gradient steps. The minibatch size was 32. The node embedding size was 256 to all the parameters. For OQMD, we use an ADAM optimizer, with a learning rate set to . The learning rate decreases by a factor of 0.96 every 400, 000 gradient steps. We use a minibatch of 32 examples. The number of iterations was 3. The number of neurons per layer in network f was 64 and that number in network g was 128. Invariant graph network For QM9, we use the same configuration as [5], except for the following hyper parameters. When training one model to predict all molecule parameters f has four layers with 128 neurons, whereas g has two layers with 128 neurons. When we trained a separate model for each molecule parameter, which calls for a smaller capacity, f has four layers with 64 neurons and g has two layers with 64 neurons. For the classification datasets we trained the models with the following hyperparameters, learning rate was for MUTAG, PTC and NCI1, and was for PROTEINS. The number of channels in blocks was 400 for all datasets except for PROTEINS which has 128 channels. The number of layers in the MLP m was 2 for all datasets. For all datasets the number of neurons in network f was 64 and the number of neurons in network g was 64, except for PROTEINS which has 32 neurons for f and g. Graph isomorphism network For the classification datasets, we use the same training procedure as [6], who train the model for 10 folds and choose the number of epochs based on the cross validation accuracy over the folds. We train the models with the following hyper-parameters. Learning rate was , the models run for 5 iterations and the number of epochs was 180 for all datasets. The minibatch sizes were 512, 256, 32 and 16 for NCI1, PTC, MUTAG and PROTEINS, respectively. In all datasets, f has three layers and 64 neurons, g has two layers with 64 neurons each. Decoding

hyper-network We use the same hyper-parameters as in [7].

A. Results

a) Graph regression: The results for the QM9 dataset are reported in Tab. I for the NMP-Edge compatible splits and units and in Tab. III for the benchmark version used by IGN. Our model based on the NMP-Edge architecture achieves state of the art performance on 9 out of 12 parameters, and in only one parameter it is outperformed by the original NMP-Edge model (and another tie).

The result for formation energy predictions OQMD, based on the NMP-Edges architecture, is provided in Tab. II. We obtain state of the art performance in this benchmark as well.

The QM9 model that is based on IGN obtains state of the art performance on 7 out of 12 parameters when training the model for each parameter. Furthermore, we improve the results of 9 out of 12 parameters when comparing to IGN model that is trained for each parameter separately. When training one model to predict all the parameters, we improve 12 out of 12 parameters, compared to the IGN model. b) Graph classification: The results for the classification datasets are provided in Tab. IV. As can be observed, our modified versions of IGN and GIN improve the baseline IGN and GIN models in almost all cases (in one case we tie). Note that the GIN model has many variants and our modification is based on the GIN-model. There is no Graph Neural Network model that outperforms our results, and for the PTC and PROTEINS datasets, our method outperforms all literature baselines. c) Error Correcting Codes results: In Fig. 1 we provide the BER-SNR results for multiple linear block codes. Our method improves on [7] across codes. We get an improvement range between 0.09dB and 0.12dB for large SNR. Moreover, in all three cases, we are able to improve the baseline results, even without the Taylor approximation of [7]. Since [7] fail to train without the approximation (using arctanh their runs always diverge), this shows that our method stabilizes the training process for error correcting codes. We can also observe that our results with and without this approximation are almost identical.

B. Ablation Analysis

In Tab. V we provide an ablation analysis on the QM9 benchmark. For the NMP-Edge model, we can observe degradation of 11 out of 12 parameters when training without in Eq. 3 and the associated damping factor. Moreover, when training without the hypernetwork, but with (Eq. 2 becomes ) we get a degradation of 7 out of 12 parameters.

For the IGN model, we can observe degradation in 9 out of the 12 parameters when training without the damping factor and in Eq. 7, 8. Moreover, when training without the hypernetwork but with the added first message (become in Eq. 5), we get a degradation of 8 out of the 12 parameters.

V. CONCLUSIONS

Graph neural networks are becoming the dominant tool in molecule prediction and classification tasks. Here we show that by employing hypernetworks with a stabilization mechanism, significant performance gains are obtained. In order to demonstrate the advantage of our stabilizing mechanism over a recently proposed hypernetwork scheme, we also show improved performance in the field of decoding linear block codes.

REFERENCES

[1] P. B. Jørgensen, K. W. Jacobsen, and M. N. Schmidt, “Neural message passing with edge updates for predicting properties of molecules and materials,” arXiv preprint arXiv:1806.03146, 2018.

[2] K. Sch¨utt, P.-J. Kindermans, H. E. S. Felix, S. Chmiela, A. Tkatchenko, and K.-R. M¨uller, “SchNet: A continuous-filter convolutional neural network for modeling quantum interactions,” in Advances in Neural Information Processing Systems, 2017, pp. 991–1001.

[3] J. Gilmer, S. S. Schoenholz, P. F. Riley, O. Vinyals, and G. E. Dahl, “Neural message passing for quantum chemistry,” in Proceedings of the 34th International Conference on Machine Learning-Volume 70. JMLR. org, 2017, pp. 1263–1272.

[4] S. Kearnes, K. McCloskey, M. Berndl, V. Pande, and P. Riley, “Molecular graph convolutions: moving beyond fingerprints,” Journal of computer-aided molecular design, vol. 30, no. 8, pp. 595–608, 2016.

[5] H. Maron, H. Ben-Hamu, H. Serviansky, and Y. Lipman, “Provably powerful graph networks,” arXiv preprint arXiv:1905.11136, 2019.

[6] K. Xu, W. Hu, J. Leskovec, and S. Jegelka, “How powerful are graph neural networks?” in International Conference on Learning Representations, 2019. [Online]. Available: https://openreview.net/ forum?id=ryGs6iA5Km

[7] E. Nachmani and L. Wolf, “Hyper-graph-network decoders for block codes,” in Advances in Neural Information Processing Systems (NeurIPS), 2019.

[8] R. Socher, A. Perelygin, J. Wu, J. Chuang, C. D. Manning, A. Ng, and C. Potts, “Recursive deep models for semantic compositionality over a sentiment treebank,” in Proceedings of the 2013 conference on empirical methods in natural language processing, 2013, pp. 1631–1642.

[9] J. Johnson, A. Gupta, and L. Fei-Fei, “Image generation from scene graphs,” in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2018, pp. 1219–1228.

[10] C. Goller and A. Kuchler, “Learning task-dependent distributed rep- resentations by backpropagation through structure,” in Proceedings of International Conference on Neural Networks (ICNN’96), vol. 1. IEEE, 1996, pp. 347–352.

[11] M. Gori, G. Monfardini, and F. Scarselli, “A new model for learning in graph domains,” in Proceedings. 2005 IEEE International Joint Conference on Neural Networks, 2005., vol. 2. IEEE, 2005, pp. 729– 734.

[12] F. Scarselli, M. Gori, A. C. Tsoi, M. Hagenbuchner, and G. Monfardini, “The graph neural network model,” IEEE Transactions on Neural Networks, vol. 20, no. 1, pp. 61–80, 2008.

[13] Y. Li, D. Tarlow, M. Brockschmidt, and R. Zemel, “Gated graph sequence neural networks,” in ICLR, 2016.

[14] J. Bruna, W. Zaremba, A. Szlam, and Y. LeCun, “Spectral networks and locally connected networks on graphs,” arXiv preprint arXiv:1312.6203, 2013.

[15] M. Defferrard, X. Bresson, and P. Vandergheynst, “Convolutional neural networks on graphs with fast localized spectral filtering,” in Advances in neural information processing systems, 2016, pp. 3844–3852.

[16] T. N. Kipf and M. Welling, “Semi-supervised classification with graph convolutional networks,” arXiv preprint arXiv:1609.02907, 2016.

[17] D. K. Duvenaud, D. Maclaurin, J. Iparraguirre, R. Bombarell, T. Hirzel, A. Aspuru-Guzik, and R. P. Adams, “Convolutional networks on graphs for learning molecular fingerprints,” in Advances in neural information processing systems, 2015, pp. 2224–2232.

[18] P. Battaglia, R. Pascanu, M. Lai, D. J. Rezende et al., “Interaction networks for learning about objects, relations and physics,” in Advances in neural information processing systems, 2016, pp. 4502–4510.

[19] K. T. Sch¨utt, F. Arbabzadah, S. Chmiela, K. R. M¨uller, and A. Tkatchenko, “Quantum-chemical insights from deep tensor neural networks,” Nature communications, vol. 8, p. 13890, 2017.

[20] R. Kondor, H. T. Son, H. Pan, B. Anderson, and S. Trivedi, “Co- variant compositional networks for learning graphs,” arXiv preprint arXiv:1801.02144, 2018.

[21] H. Maron, H. Ben-Hamu, N. Shamir, and Y. Lipman, “Invariant and equivariant graph networks,” arXiv preprint arXiv:1812.09902, 2018.

[22] R. L. Murphy, B. Srinivasan, V. Rao, and B. Ribeiro, “Relational pooling for graph representations,” arXiv preprint arXiv:1903.02541, 2019.

[23] D. Rogers and M. Hahn, “Extended-connectivity fingerprints,” Journal of chemical information and modeling, vol. 50, no. 5, pp. 742–754, 2010.

[24] M. Rupp, A. Tkatchenko, K.-R. M¨uller, and O. A. Von Lilienfeld, “Fast and accurate modeling of molecular atomization energies with machine learning,” Physical review letters, vol. 108, no. 5, p. 058301, 2012.

[25] G. Montavon, K. Hansen et al., “Learning invariant representations of molecules for atomization energy prediction,” in Advances in Neural Information Processing Systems, 2012, pp. 440–448.

[26] K. Hansen, F. Biegler et al., “Machine learning predictions of molecular properties: Accurate many-body potentials and nonlocality in chemical space,” The journal of physical chemistry letters, vol. 6, no. 12, pp. 2326–2331, 2015.

[27] B. Huang and O. A. Von Lilienfeld, “Communication: Understanding molecular representations in machine learning: The role of uniqueness and target similarity,” 2016.

[28] R. Ramakrishnan, P. O. Dral, M. Rupp, and O. A. Von Lilienfeld, “Quantum chemistry structures and properties of 134 kilo molecules,” Scientific data, vol. 1, p. 140022, 2014.

[29] L. Ruddigkeit, R. Van Deursen, L. C. Blum, and J.-L. Reymond, “Enumeration of 166 billion organic small molecules in the chemical universe database gdb-17,” Journal of chemical information and modeling, vol. 52, no. 11, pp. 2864–2875, 2012.

[30] B. Klein, L. Wolf, and Y. Afek, “A dynamic convolutional layer for short range weather prediction,” in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2015, pp. 4840–4848.

[31] G. Riegler, S. Schulter, M. Rther, and H. Bischof, “Conditioned regres- sion models for non-blind single image super-resolution,” in 2015 IEEE International Conference on Computer Vision (ICCV), Dec 2015, pp. 522–530.

[32] X. Jia, B. De Brabandere, T. Tuytelaars, and L. V. Gool, “Dynamic filter networks,” in Advances in Neural Information Processing Systems, 2016, pp. 667–675.

[33] D. Ha, A. Dai, and Q. V. Le, “Hypernetworks,” arXiv preprint arXiv:1609.09106, 2016.

[34] D. Krueger, C.-W. Huang, R. Islam, R. Turner, A. Lacoste, and A. Courville, “Bayesian hypernetworks,” arXiv preprint arXiv:1710.04759, 2017.

[35] L. Bertinetto, J. F. Henriques, J. Valmadre, P. Torr, and A. Vedaldi, “Learning feed-forward one-shot learners,” in Advances in Neural Information Processing Systems, 2016, pp. 523–531.

[36] A. Brock, T. Lim, J. Ritchie, and N. Weston, “SMASH: One-shot model architecture search through hypernetworks,” in International Conference on Learning Representations, 2018. [Online]. Available: https://openreview.net/forum?id=rydeCEhs-

[37] C. Zhang, M. Ren, and R. Urtasun, “Graph hypernetworks for neural architecture search,” in International Conference on Learning Representations, 2019. [Online]. Available: https://openreview.net/ forum?id=rkgW0oA9FX

[38] E. Nachmani, Y. Be’ery, and D. Burshtein, “Learning to decode linear codes using deep learning,” in 2016 54th Annual Allerton Conference on Communication, Control, and Computing (Allerton). IEEE, 2016, pp. 341–346.

[39] H. Kim, Y. Jiang, S. Kannan, S. Oh, and P. Viswanath, “Deepcode: Feedback codes via deep learning,” in Advances in Neural Information Processing Systems (NIPS), 2018, pp. 9436–9446.

[40] T. Gruber, S. Cammerer, J. Hoydis, and S. ten Brink, “On deep learning-based channel decoding,” in 2017 51st Annual Conference on Information Sciences and Systems (CISS). IEEE, 2017, pp. 1–6.

[41] C.-F. Teng, C.-C. Liao, C.-H. Chen, and A.-Y. A. Wu, “Polar feature based deep architectures for automatic modulation classification considering channel fading,” in 2018 IEEE Global Conference on Signal and Information Processing (GlobalSIP). IEEE, 2018, pp. 554–558.

[42] S. Cammerer, T. Gruber, J. Hoydis, and S. ten Brink, “Scaling deep learning-based decoding of polar codes via partitioning,” in GLOBE-

COM 2017-2017 IEEE Global Communications Conference. IEEE, 2017, pp. 1–6.

[43] B. Vasi´c, X. Xiao, and S. Lin, “Learning to decode ldpc codes with finite- alphabet message passing,” in 2018 Information Theory and Applications Workshop (ITA). IEEE, 2018, pp. 1–9.

[44] H. Maron, E. Fetaya, N. Segol, and Y. Lipman, “On the universality of invariant networks,” in International Conference on Machine Learning, 2019, pp. 4363–4371.

[45] J. E. Saal, S. Kirklin, M. Aykol, B. Meredig, and C. Wolverton, “Materials design and discovery with high-throughput density functional theory: the open quantum materials database (oqmd),” Jom, vol. 65, no. 11, pp. 1501–1509, 2013.

[46] S. Kirklin, J. E. Saal, B. Meredig, A. Thompson, J. W. Doak, M. Aykol, S. R¨uhl, and C. Wolverton, “The open quantum materials database (oqmd): assessing the accuracy of dft formation energies,” npj Computational Materials, vol. 1, p. 15010, 2015.

[47] P. Yanardag and S. Vishwanathan, “Deep graph kernels,” in Proceedings of the 21th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining. ACM, 2015, pp. 1365–1374.

[48] M. Helmling, S. Scholl, F. Gensheimer, T. Dietz, K. Kraft, S. Ruzika, and N. Wehn, “Database of Channel Codes and ML Simulation Results,” www.uni-kl.de/channel-codes, 2019.

[49] R. Gallager, “Low-density parity-check codes,” IRE Transactions on information theory, vol. 8, no. 1, pp. 21–28, 1962.

[50] E. Arikan, “Channel polarization: A method for constructing capacity- achieving codes,” in 2008 IEEE International Symposium on Information Theory. IEEE, 2008, pp. 1173–1177.

[51] R. C. Bose and D. K. Ray-Chaudhuri, “On a class of error correcting binary group codes,” Information and control, vol. 3, no. 1, pp. 68–79, 1960.

[52] L. Ward, R. Liu, A. Krishna, V. I. Hegde, A. Agrawal, A. Choudhary, and C. Wolverton, “Including crystal structure attributes in machine learning models of formation energies via voronoi tessellations,” Physical Review B, vol. 96, no. 2, p. 024104, 2017.

[53] B. Anderson, T.-S. Hy, and R. Kondor, “Cormorant: Covariant molecular neural networks,” arXiv preprint arXiv:1906.04015, 2019.

[54] M. Albooyeh, D. Bertolini, and S. Ravanbakhsh, “Incidence networks for geometric deep learning,” arXiv preprint arXiv:1905.11460, 2019.

[55] C. Morris, M. Ritzert, M. Fey, W. L. Hamilton, J. E. Lenssen, G. Rattan, and M. Grohe, “Weisfeiler and leman go neural: Higher-order graph neural networks,” in Proceedings of the AAAI Conference on Artificial Intelligence, vol. 33, 2019, pp. 4602–4609.

[56] Z. Wu, B. Ramsundar, E. N. Feinberg, J. Gomes, C. Geniesse, A. S. Pappu, K. Leswing, and V. Pande, “Moleculenet: a benchmark for molecular machine learning,” Chemical science, vol. 9, no. 2, pp. 513– 530, 2018.

[57] N. Shervashidze, P. Schweitzer, E. J. v. Leeuwen, K. Mehlhorn, and K. M. Borgwardt, “Weisfeiler-lehman graph kernels,” Journal of Machine Learning Research, vol. 12, no. Sep, pp. 2539–2561, 2011.

[58] J. Atwood and D. Towsley, “Diffusion-convolutional neural networks,” in Advances in Neural Information Processing Systems, 2016, pp. 1993– 2001.

[59] M. Niepert, M. Ahmed, and K. Kutzkov, “Learning convolutional neural networks for graphs,” in International conference on machine learning, 2016, pp. 2014–2023.

[60] M. Zhang, Z. Cui, M. Neumann, and Y. Chen, “An end-to-end deep learning architecture for graph classification,” in Thirty-Second AAAI Conference on Artificial Intelligence, 2018.

[61] S. Ivanov and E. Burnaev, “Anonymous walk embeddings,” in Proceedings of the 35th International Conference on Machine Learning, ser. Proceedings of Machine Learning Research, J. Dy and A. Krause, Eds., vol. 80. Stockholmsmssan, Stockholm Sweden: PMLR, 10–15 Jul 2018, pp. 2186–2195. [Online]. Available: http://proceedings.mlr.press/v80/ivanov18a.html

[62] W. Hamilton, Z. Ying, and J. Leskovec, “Inductive representation learning on large graphs,” in Advances in Neural Information Processing Systems, 2017, pp. 1024–1034.

TABLE I MEAN ABSOLUTE ERROR FOR QM9 MOLECULE PARAMETERS PREDICTION, USING THE NMP-EDGE SPLITS, UNITS AND OUR MODIFIED ARCHITECTURE. THE LOWEST ERROR IS IN BOLD. THE RESULTS OF THE INCIDENCE AND CORMORANT NETWORKS ARE FROM [54] AND [53], RESPECTIVELY. THE REST OF THE RESULTS FROM [1].

TABLE II OQMD - NMP-EDGE. MEAN ABSOLUTE ERROR FOR FORMATION ENERGY PREDICTIONS. THE RESULTS OF THE VARIOUS BASELINES ARE FROM [1]

TABLE III MEAN ABSOLUTE ERROR FOR QM9 MOLECULE PARAMETERS PREDICTION FOR THE IGN SPLITS, UNITS, AND OUR MODIFIED ARCHITECTURE OF IT. THE RESULTS OF THE VARIOUS DATASETS ARE TAKEN FROM [5]. FOR IGN AND OUR MODIFICATIONS, WE SHOW RESULTS OF A SINGLE NETWORK PREDICTING ALL VALUES AND OF DEDICATED NETWORKS.

TABLE IV TEST SET CLASSIFICATION ACCURACY (%) FOR MUTAG, PROTEINS, PTC, NCI1 DATASETS. THE HIGHEST RESULTS ON BOLD. THE RESULTS OF IGN ARE BY [5], THE REST OF THE RESULTS ARE FROM [6].

TABLE V ABLATION ANALYSIS ON THE TWO QM9 DATASET VIEWS. MEAN ABSOLUTE ERROR IS REPORTED IN BOTH.

Fig. 1. BER for various values of SNR for various codes. (a) BCH (63,51), (b) POLAR(64,48), (c) LDPC ARRAY (121,80).