CoulGAT: An Experiment on Interpretability of Graph Attention Networks

2019·Arxiv

Abstract

Abstract

We present an attention mechanism inspired from definition of screened Coulomb potential. This attention mechanism was used to interpret the Graph Attention (GAT) model layers and training dataset by using a flexible and scalable framework (CoulGAT) developed for this purpose. Using CoulGAT, a forest of plain and resnet models were trained and characterized using this attention mechanism against CHAMPS dataset. The learnable variables of the attention mechanism are used to extract node–node and node–feature interactions to define an empirical standard model for the graph structure and hidden layer. This representation of graph and hidden layers can be used as a tool to compare different models, optimize hidden layers and extract a compact definition of graph structure of the dataset.

1 Introduction

Highly irregular data structured as graphs are often encountered in social networks, e-commerce, natural language processing, knowledge databases (e.g., citation networks), quantum chemistry and molecular biology. As the relationships between nodes in a graph get more complex and irregular in size, there is a need for machine learning methods to adapt to handle these complex relationships in a more efficient way, as conventional definitions of convolution and recurrent networks are not applicable in a straight manner as they were on grid-based images or simple sequences.

Early neural networks reported to tackle graph data used recurrent network approach on a graph node’s neighbors. These recurrent networks exchange information to update nodes’ states in a neighborhood until states reach equilibrium [1, 2, 3] when a convergence condition is satisfied. This approach was extended to use Gated Recurrent Units allowing unrolling in time for a fixed number of steps and use of backpropagation in [4]. In [5], a recurrent model scalable to large graphs was explored by employing a stochastic learning algorithm using embedding representations to achieve steady state.

Convolutional networks were also successfully applied to learn from graph data. These networks learn by taking advantage of local translational invariance of data [6] as compared to recurrent networks which learn causal relationships among nodes. In [6], it was shown that convolution operator for graph data can be defined by using spectral properties through graph Laplacian or by exploiting the locality spatially in a graph’s neighborhood. Spectral models are domain dependent and linear transformations to derive the graph Laplacian are costly. Complexity of spectral models was reduced by using localized filters and pooling in [7]. The spectral model was improved in efficiency by using recursive Chebyshev polynomials as localized filters in [8]. This was followed by a simpler and scalable model using first order approximations of spectral convolutions in a 1-node neighborhood [9]. Spatial models, on the other hand, learn by performing convolutions on their local neighborhood and do not rely on a well defined convolution operator derived from a graph Laplacian. An early implementation of efficient spatial graph convolution was presented in [10]. In [11], a diffusion-convolution operation was defined to pass information among nodes in a probabilistic manner. Message Passing Neural Network defined a general framework for spatial convolution models in the context of quantum chemistry [12]. GraphSAGE [13] used an aggregator to generate node embeddings in an inductive manner for large graphs. Graph Attention Networks [14] showed that attention is a computationally efficient mechanism that can also improve the capacity and interpretability of the model on both transductive and inductive learning tasks.

Remarkable developments in graph neural networks outlined above made these models an efficient alternative to simulations in quantum chemistry using Density Functional Theory (DFT) [12, 15, 16]. DFT is very successful in predicting chemical properties of materials and its results are highly interpretable due to the fact that it solves Kohn-Sham equations using energy functionals for the system. However, the computational complexity scales as ) making it challenging to apply larger systems and choice of a good exchange-correlation functional can be difficult. Graph neural networks can reduce the complexity and skip calculating Kohn-Sham equations completely. Graph Attention Networks developed in [14] offer an opportunity to maintain interpretability due to its flexibility in choosing an appropriate attention mechanism.

In this study, we present a scalable graph attention model framework to characterize a screened exponential Coulomb potential based attention mechanism. The framework defines building blocks for vanilla plain and resnet graph convolutional layers. Our main contribution is an attention model whose learned parameters can be used to interpret the coupling strength and interaction range for any two nodes present in a graph structure. Motivation for this interpretation was due to appearance of Coulomb potential in various contexts from describing fundamental forces of nature as in Yukawa Potential [17] to electronelectron and electron-nuclei interactions in Kohn-Sham equations [18]. We use plain and resnet variations of this model on the CHAMPS dataset for predicting scalar coupling constants [19]. We show that our framework can scale up to extremely deep number of graph layers. Scalability and simplicity of models presented here are comparable to depth of Convolutional Neural Networks using skip connections for image classification [20, 21].

In the following section, we begin first by going over attention network structure implemented in the model. We define how attention is handled within a single layer and how propagation to next layer is managed. In section 3, we go over data preparation. In section 4, we use the basic building blocks for this network to define graph models with shallow depth and variations. We finish this section by describing deeper graph models developed using residual connections. Evaluation of these models are presented in section 5.

2 Model Architecture

Our model is a graph attention network that takes as input a weighted graph G = (X, A) where X is feature matrix for F features and N nodes and A is a non-negative weighted adjacency matrix. The general formalism closely follows graph attention network described in [[14]. Main differences are in the way attention is obtained and adjacency matrix plays a more crucial role than being only a masking mechanism.

2.1 Attentional Hidden Layer

The attentional hidden layer takes as input where l is layer number. First a linear transformation is applied onto input to this layer using a weight matrix and a bias term b.

This weight matrix operates on all nodes in the input to the layer. The other input for the graph, the weighted adjacency matrix A is used to calculate attention in all layers. The weighting in A is determined by inverse squared distance between each node i and j. The entries for self edge and no edge are also defined by very high and very low values respectively. For this study, following formulation for A was defined:

The adjacency matrix defines the inverse distance dependence of the Coulomb attention model and is integral to the definition of attention mechanism. Instead of relying only on attention to define the weights between neighboring nodes, we provide a priori edge weights for all possible edge connections to the attention mechanism.

All layers see the same adjacency matrix by definition for each single input graph. We make it unique by applying a pointwise exponential operation on A with a power matrix P which is set as learnable for each layer:

This modified adjacency matrix defines the exponential inverse distance part of the Coulomb potential and it is applied on as:

is the intermediate hidden layer matrix weighted by modified adjacency matrix. This term reflects the effect of each node in the input on every other node as a sum weighted by the adjacency matrix elements. The matrix multiplication in eq. 4 replaces the concatenation operation typically used for defining the attention coefficients.

The screening contribution to the Coulomb potential is achieved by using a learnable attention matrix for finding the attention coefficients. The attention matrix elements define the screening on the potential. Similar to inverse distance contribution, it also considers weighted contributions from all nodes:

Finally, next hidden layer is formed by multiplying attention coefficients with the input and applying the activation layer.

The stability and capacity of each layer is improved by utilizing K heads for each node, therefore above operation was performed separately K times with K weight and attention factors and the K output matrices are concatenated to form a matrix as output to next layer:

After output of last attention layer is flattened to a 1-D vector, this is fed into a 3-layer fully connected layer consisting of two dense layers and final layer is equal to the number of classify labels.

2.2 Interpretability of graph attention function

Attention between nodes is modeled after screened Coulomb potential which is a potential with an exponential damping term. The damping term defines the range of the interaction between two particles. It is of the general form:

Where C is the magnitude scaling constant and M scales the range of interaction. For M = 0, it reduces to the form of electromagnetic Coulomb potential. In our model, after expanding the inner term in eq. 5, this potential is approximated as:

The attention matrix a and adjacency matrix approximate the screening coefficients and inverse distance effect of the Coulomb potential. Different from eq. 8, the exponential is defined on the inverse distance in the proposed attention function instead of e. The attention mechanism takes into account of linear combination of all neighboring nodes weighted by using a fully learnable attention matrix a and a priori provided weighted adjacency matrix . The adjacency matrix was defined to include self interactions (and non-existing interactions) such that exchange-correlation contribution of the DFT calculations can be part of the learning process.

This attention model learns the range, magnitude and distance power of interaction forces between each node in a graph. Since we did not put any limitations on the attention mechanism such as immutable constants and specific charge carriers, this model, in principle, may learn any short and long range interactions between two nodes regardless of the nature of actual (or artificially defined) forces between them. This makes the model very flexible and applicable to wide range of data that can be represented as graph, of which a weighted distance between (many) nodes can be defined. Consequently, interaction between graph nodes can be classified based on their coupling strength and interaction range by reviewing the attention coefficient values.

3 Dataset

We use a quantum chemistry dataset from CHAMPS institute made available in [19]. This dataset lists scalar coupling constants between pairs of atoms in a molecule based on their interaction type. Scalar coupling constants are magnetic interactions between atoms that can be used in NMR to understand a molecule’s structure. The dataset also provides the (x, y, z) coordinates for each atom in a molecule for train and test data.

In this dataset each scalar coupling value is provided as a single entry. The data needs preprocessing to define graphs for each molecule within the dataset. Each molecule is a graph that can be a feature input X to the graph attention model accompanied by their weighted adjacency matrix A. In the dataset, the atomindex0 and scalar coupling type are combined to form a 2–tuple. These 2–tuples are then converted to a one-hot vector for each entry. This creates a sparse vector for each node. This representation is not best for accuracy and an embedding could improve model performance. We use this sparse representation for its convenience in demonstrating the interpretability of attention mechanism. The molecules which have same atomindex1 within same molecule are summed row-wise to accumulate feature vectors for each atom in a molecule. The feature number F after one-hot vector expansion was 211 and maximum number of atoms a molecule can have was found to be 29 in this dataset. For each molecule, data is processed to extract a sparse feature matrix X with size , a flattened scalar coupling matrix as y labels of size FN. The adjacency matrix A is also processed to find the distance between each atom in a molecule as defined in eq. 2. Although X is sparse, A is not and its values are non-negative real. The preprocessing was done for both train and test data available in [19]. The models were trained on the train data after splitting train/validation/test sets in 70/15/15 split ratio for 85003 molecules processed.

4 Experimental Setup

By using the proposed coulomb attention model, deep learning models with varying architectures, filter sizes and hyperparameter settings are built and characterized. We first discuss the shallower plain models which only use attention layers and fully connected layers for prediction with large weight matrices. These models establish a stable baseline for deeper models. We then show how these shallow model components can be used to scale to much deeper models with smaller weight matrices for each layer and by using residual blocks. All models were built using CoulGAT frameworkthat was implemented with Tensorflow [22].

In the next section, we present plain graph attention models and resnet graph attention models with identity connections.

4.1 Plain Models

The structure of a plain model is shown in Fig. 1(a,b). The model accepts a batch of input graphs G = (X, A) for each epoch. Batch size was kept at 128 and train/validation set size was 59500/12750 (molecules, or graphs). The input set (X, A) first goes through at least two graph attention layers. The output of graph attention layers was flattened and fed into a 3–stage dense layer which was scaled by number of K heads and input feature count F. The last dense layer outputs NF real regression values for each (X, A) predicting the scalar coupling values for each atom pair and coupling type within that model. Since all molecules do not have all N atoms in their graph, the output is a sparse vector.

Model was regularized by employing drop-out [23] keep probabilities of 0.8 on input matrix to a hidden graph attention layer and the resulting attention coefficients in E within each hidden layer. The dense layers were also regularized with L2, and the regularization scale was kept at 0.0001 for all training sessions. Adam optimizer was used with a learning rate of 0.001. Above hyperparameters were used when MSE was used as the loss function. All elements of the sparse output vector was used in MSE loss calculation, so this version also estimated the zeros.

For log(MAE) based loss, the hyperparameters changed as follows: dropout was set to 1.0, and L2 regularization scaled by a higher value of 0.0005. Learning rate was kept same. The SCCLMAE measure used in this study takes only log MAE of the non-zero elements in the output vector and it is more pessimistic than the LMAE defined in [19] since it averages over 211 features. The zero elements of sparse output vector was kept as free variables for this loss measure.

Unless otherwise noted ReLU was used for layer activations and LeakyReLU was used for attention function. Weight, attention and power variables were initialized using glorot initialization and biases were initialized to zeros.

Table 1 shows the variations of graph attention models trained. These models are trained without batch normalization or identity connections, hence called in this paper as plain. Hidden layer feature sizes were varied from 2F to F, for F = 211. The layer depth for plain models trained had 2 or 3 graph attention layers. The head was kept at K=5 for most models and K = 2 was also trained. High drop-out rate of 0.5 and optimization without learnable power variables was also tried.

In some models last graph attention layer employed a pooling approach where concatenation operation in eq. 7 was replaced with sum divided by number of heads, K:

These averaging layers were also coupled with concatenating layer to form blocks repeated as defined in Table 1.

Figure 1: Basic building blocks of graph attention layers described in this study. (a) 2–layer attention plaing attention layer (b) 3–layer plain attention layer with pooled attention as last layer. (c) Resnet block for 2–unit attention layer with preactivation. (d) Resnet block for 3–unit attention layer with preactivation and pooling in last layer.

4.2 Resnet Models

Resnet models used the same hyperparameters for batch size, optimization and regularization depending on loss measure. Batch normalization momentum was set to 0.9 and 0.99 for MSE and SCCLMAE losses, respectively. To form deeper models, plain models are modified to include batch normalization before layer activation (preactivation), and identity connections [20, 21]. Residual blocks that are comprised of graph attention layers terminated with and without a final average-pooling graph layer as in Fig 1(c,d).

2–unit and 3–unit resnet blocks are used to form deeper learning models as summarized in Table 3. The feature size on hidden layers are changed to 50, 105 and 211 to reach deep learning models having 140, 30 and 12 hidden graph attention layers, respectively. For each resnet model, first layer is a single graph attention layer or a block attention layer (with averaging at the end) without identity connection originating from X. This first layer is terminated with a batch normalization+ReLU and its output serves as the first identity connection for the model. With resnet structure presented here, models that show a compromise between number of K heads and layer depth are also implemented featuring a 40 hidden layer model with 10 heads for hidden feature size of 50.

5 Results

In this section, we present results from a variety of plain and resnet models implemented and run for 100 epochs and 120 epochs for MSE, and 200 epochs for SCCLMAE loss with early stopping unless noted otherwise. Here, we evaluated these models up to the number of epochs trained to show the capacity, and extensibility of this graph model with screened coulomb potential attention. The hyperparameters we chose could be optimized for each model defined here to get a better performance. We compared all models with similar hyperparameters for this study as scanning all optimum hyperparameter values for each model is costly. We believe these results will present a useful baseline to create optimized and scalable model architectures for other datasets that can utilize the proposed approach.

5.1 Plain Model Evaluation

The minimum training and validation losses are summarized for each model in Table 2 for plain models. Layer parameters are shown in Table 1.

MSE Loss. MSE loss tries to predict all entries in the sparse output vector, thus being averaged over full vector length. We start with a 2 layer graph attention model in plain #1 and replace the last attention layer in plain #2 with pooling. The MSE loss did not show a significant change with pooling. For the 3 layer models, plain #3 and plain #4, we see slightly improved loss with pooling. Using plain #3 as a base, reducing capacity in #5 and #6 by reducing filter size by half and head number by 3 increased the MSE loss overall. We also tried higher dropout in plain #7 (150 epochs) and no learnable power

Table 2: Plain Model Loss Values

coefficients in plain #8. Higher dropout increased the MSE loss. The no power coefficient model has improved loss, which may indicate glorot initialization may not be the optimum choice for initialization.

SCCLMAE Loss. When loss measure is changed to SCCLMAE, the dropout was disabled and L2 regularization was increased to 0.0005 for all models to reduce overfitting. Dropout was disabled to set a baseline for resnet models with SCCLMAE loss due to detrimental effects of dropout on batch normalization [24]. SCCLMAE loss averages only on observed values of the dataset in the sparse output vector. The pooling layer implementations in plain #2 and #4 had consistently higher losses. Reducing feature size made the loss worse but reducing K by 3 units did not prove as detrimental in plain #6. For all models, the train and validation curves showed a larger gap for this loss measure.

The SCCLMAE and MSE loss curves for Plain #3 and Plain #4 are shown in Fig. 2. Loss curves for other models can be found in Appendix. For all models, test data split loss was in good agreement with minimum validation loss when evaluated using model obtained by early stopping.

Figure 2: Loss curves for plain #3 and plain #4 models detailed in Table 1. 11

5.2 Resnet Model Evaluation

The resnet models trained for this study are summarized in Table 3 and losses are shown in Table 4.

MSE Loss. The MSE train and validation loss curves converge smoothly for resnet models that ran for 120 epochs. This is attributed to the fact that loss is averaged over all elements of sparse vector, but this smoothing effect in loss definition also limits the expressive power of the model. Pooling within a res–block or as a final layer is reflected as higher validation losses in resnet #1, #2 and #4. Especially the pooling layer as a final block increase the baseline for minimum loss. For MSE loss, deeper models in resnet #6 and #7 does not improve loss and even degraded in resnet #6. As a future work, the hyperparameters could be optimized to improve performance of deeper residual models.

SCCLMAE Loss. SCCLMAE loss curves fluctuate more, especially for the validation loss. This is attributed to averaging of SCCLMAE loss only for observed values which are few in number. There is a gap between training and validation curves which seem to level in a band around 8. Pooling effects are more clear in resnet #1, #2 and #4 with SCCLMAE loss. Pooling increases the minimum loss for train and validation. This is more evident in resnet #4 which also has a narrower gap between train and validation losses. Resnet #6 has the deepest number of layers up to 140 and scaling up head number and sacrificing depth also works well in resnet #7. Deepest models (#5,#6,#7) also show bigger gap between train and validation curves which can benefit from optimizing the hyperparameters further.

Table 3: Resnet Model Architectures

Table 4: Resnet Model Loss Values

Fig. 3 shows MSE loss curve for Resnet #6, #7 and SCCLMAE loss curves for Resnet #4, #6. Other loss curves can be found in Appendix.

Figure 3: Loss curves for resnet #6, #7 (MSE) and resnet #4, #6 (SCCLMAE) as detailed in Table 3.

For evaluating the Kaggle score with test data from [19], resnet #3 and resnet #7 were trained for 200 epochs with 84k molecules from training set and 1k molecules were used for validation(99/1 split ratio). The SCCLMAE validation losses were 0.727 and 0.715 for resnet #3 and Resnet #7. The Kaggle LMAE test score of these models were 0.586(resnet #3) and 0.596(resnet #7). Loss curves for these models are shown in Fig. 4.

We also evaluated dataset preprocessed using (atomindex1, type) combination as a one-hot vector representation of input features. The resulting input feature size for this representation was 131. We trained 131–feature dataset on resnet #3 using splitting dataset with 70/15/15 split ratio. Loss curve is shown in Fig. 5. The minimum validation loss was 0.754 and minimum train loss was 0.560, outperforming resnet #3 results in Table 4.

Figure 4: Loss curves for Resnet #3 (top) and #7 (bottom) trained with 99/1 dataset split ratio.

5.3 Evaluation of Learnable Attention Parameters

The learnable attention coefficients a and P of the attention function in each layer contains valuable information for generalizing node–to–node and node–

Figure 5: Loss curve for Resnet #3 trained with 131 feature dataset for 200 epochs.

to–feature interactions. As it was stated earlier, a defines a set of coupling strengths from atomicindex

index0, type) and P defines a set of interaction ranges between each atom(node) in a molecule(graph) and it is a power matrix applied to the weighted adjacency matrix consisting of inverse square of distance values (). These two learnable parameters provide an easy way to understand how a hidden layer interprets the graph structure it was trained on. P learns the node–node interactions within a graph structure. Fig. 6 shows a heatmap of P from last attentional layer and first head of resnet #3 trained using 99/1 data split from CHAMPS dataset. Although the adjacency matrix was symmetric for this dataset, P is not a symmetric matrix. The node–node relationships in this map generalizes the data between graph nodes based on their position in the adjacency matrix. The power values (scaled by 2, see eq. 2) vary in a range of [345, 1.440] in the heatmap and the distributions of these values is gaussian-like. The range shows that for some nodes the interaction scales linearly with distance rather than 1. This may seem unphysical and counterintuitive, however, such effects of linear scaling of interaction force with distance are observed as quark confinement in the nature and accounted in analysis of quarkonium using the potential expression[25, 26]:

The graph attention model learns potential terms similar to eq. 11 for the CHAMPS dataset. In the heatmap, graph node pairs can be classified as having a confinement interaction (0), coulomb interaction (0) or distance independent interaction (= 0). The distribution of 2values in Fig. 6 has nearly bell curve shape for resnet #3 with a mean value of 0.007 and standard deviation of 0.456.

Fig. 7 shows a barplot of a single columns from a for node 0 and a set

Figure 6: The heatmap (top) and distribution(bottom) of P values (scaled by 2).

of 211 features for CHAMPS data. Although the columns of input and output are sparse, column of a is not, indicating complex relationships between a single node and features. For CHAMPS data, this a column shows a range of [549, 0.858] for interaction strength for each feature. The distribution of also has close to a bell curve shape with mean value of 0.025 and standard deviation of 0.277.

When we evaluated deeper models, we reduced the hidden layer feature size to form deeper models. This classification of hidden layers from learnable attention variables makes it possible to compare two models trained on same data on a layer-by-layer basis. Fig. 8 and 9 shows the heatmap, barplot and distributions of resnet #7 trained with 99/1 data split. The range, mean and standard deviation of a and P are summarized in Table 5 for last layer and first head. The mean and standard deviation are comparable for P with resnet #3, a

Table 5: P and a distribution of single attentional layer.

distribution has a larger standard deviation for 50 hidden features as compared to 211 hidden features in the last layer for same node. This suggests that deeper models with smaller hidden feature sizes have a more uniform distribution of node–feature interaction.

The ability to compare single attention layers among very different models provides another tool for designing hidden layers with optimum node–node and node–feature interaction characteristics for a given dataset. Therefore, a screened coulomb potential attention model can be used to extract an empirical standard model of graph structure of a dataset from the model or to find the optimum representations for nodes and hidden features of a model from the dataset. These two processes can be applied iteratively to optimize the model representation and to interpret the graph structure more accurately.

Figure 7: The barplot (top) and distribution (bottom) of a values.

Figure 8: The heatmap (top) and distribution (bottom) of P values for resnet #7 (scaled by 2).

Figure 9: The barplot (top) and distribution (bottom) of a values for resnet #7.

6 Conclusions

In this paper, we presented an efficient graph attention model that interprets the relationship between a node and its neighbors by learning the coupling strength and range of interaction between nodes using an attention form inspired from definition of screened Coulomb potential. The new attention mechanism employs a weighted adjacency matrix, learnable power variables a and P that learns the interaction strength and range from dataset. CHAMPS dataset was used to characterize the capabilities of a variety of implementations of CoulGAT framework after preprocessing the dataset to have each molecule represented as a graph. Stable plain and resnet graph models having up to 140 layers and 10 heads were demonstrated and characterized with this attention mechanism.

The learnable parameters (a, P ) present a way to quantify the node–node and node-feature interactions from the hidden layer to interpret graph structure of the training dataset through a simple, empirical standard model. The learnable parameters also provide valuable information about the representation power of the hidden attentional layer which can be compared among different models and optimized using statistical distribution of node and feature interactions.

As future work, evaluation of this attention mechanism with different transductive and inductive graph datasets, better input embeddings and other high capacity models such as transformers can provide more insights on the capabilities of the screened Coulomb potential as a tool for interpretability of both hidden layers of a graph model and graph structure of a dataset.

Acknowledgments

I would like to acknowledge numerous contributions of machine learning research community on graph networks in recent years. I thank my parents for their support and patience. This research was conducted independently without support from a grant or corporation.

References

[1] A. Sperduti and A. Starita, “Supervised neural networks for the classifica-tion of structures,” IEEE Transactions on Neural Networks, vol. 8, pp. 714– 735, May 1997.

[2] F. Scarselli, M. Gori, A. C. Tsoi, M. Hagenbuchner, and G. Monfardini, “The graph neural network model,” IEEE Transactions on Neural Networks, vol. 20, pp. 61–80, Jan 2009.

[3] A. Micheli, D. Sona, and A. Sperduti, “Contextual processing of structured data by recursive cascade correlation,” IEEE Transactions on Neural Networks, vol. 15, pp. 1396–1410, Nov 2004.

[4] Y. Li, R. Zemel, M. Brockschmidt, and D. Tarlow, “Gated graph sequence neural networks,” in , April 2016.

[5] H. Dai, Z. Kozareva, B. Dai, A. J. Smola, and L. Song, “Learning steady- states of iterative algorithms over graphs,” in ICML, 2018.

[6] J. Bruna, W. Zaremba, A. Szlam, and Y. LeCun, “Spectral networks and locally connected networks on graphs,” in 2nd International Conference on Learning Representations, ICLR 2014, Banff, AB, Canada, April 14-16, 2014, Conference Track Proceedings, 2014.

[7] M. Henaff, J. Bruna, and Y. LeCun, “Deep convolutional networks on graph-structured data,” 2015.

[8] M. Defferrard, X. Bresson, and P. Vandergheynst, “Convolutional neural networks on graphs with fast localized spectral filtering,” in Proceedings of the 30th International Conference on Neural Information Processing Systems, NIPS’16, (USA), pp. 3844–3852, Curran Associates Inc., 2016.

[9] T. N. Kipf and M. Welling, “Semi-Supervised Classification with Graph Convolutional Networks,” in Proceedings of the 5th International Conference on Learning Representations, ICLR ’17, 2017.

[10] A. Micheli, “Neural network for graphs: A contextual constructive ap- proach,” IEEE Transactions on Neural Networks, vol. 20, pp. 498–511, March 2009.

[11] J. Atwood and D. Towsley, “Diffusion-convolutional neural networks,” in Proceedings of the 30th International Conference on Neural Information Processing Systems, NIPS’16, (USA), pp. 2001–2009, Curran Associates Inc., 2016.

[12] J. Gilmer, S. S. Schoenholz, P. F. Riley, O. Vinyals, and G. E. Dahl, “Neu- ral message passing for quantum chemistry,” in Proceedings of the 34th International Conference on Machine Learning - Volume 70, ICML’17, pp. 1263–1272, JMLR.org, 2017.

[13] W. L. Hamilton, R. Ying, and J. Leskovec, “Inductive representation learn- ing on large graphs,” in Proceedings of the 31st International Conference on Neural Information Processing Systems, NIPS’17, (USA), pp. 1025–1035, Curran Associates Inc., 2017.

[14] P. Veliˇckovi´c, G. Cucurull, A. Casanova, A. Romero, P. Li`o, and Y. Bengio, “Graph attention networks,” International Conference on Learning Representations, 2018.

[15] K. Ryczko, D. A. Strubbe, and I. Tamblyn, “Deep learning and density- functional theory,” Phys. Rev. A, vol. 100, p. 022512, Aug 2019.

[16] K. Mills, M. Spanner, and I. Tamblyn, “Deep learning and the schr¨odinger equation,” Phys. Rev. A, vol. 96, p. 042113, Oct 2017.

[17] H. Yukawa, “On the interaction of elementary particles,” Proc. Phys. Math. Soc. Jap., vol. 17, pp. 48–57, 1935.

[18] W. Kohn and L. J. Sham, “Self-consistent equations including exchange and correlation effects,” Phys. Rev., vol. 140, pp. A1133–A1138, Nov. 1965.

[19] https://www.kaggle.com/c/champs-scalar-coupling/.

[20] K. He, X. Zhang, S. Ren, and J. Sun, “Deep residual learning for image recognition.,” in CVPR, pp. 770–778, IEEE Computer Society, 2016.

[21] K. He, X. Zhang, S. Ren, and J. Sun, “Identity mappings in deep residual networks.,” in ECCV (4) (B. Leibe, J. Matas, N. Sebe, and M. Welling, eds.), vol. 9908 of Lecture Notes in Computer Science, pp. 630–645, Springer, 2016.

[22] M. Abadi, A. Agarwal, P. Barham, E. Brevdo, Z. Chen, C. Citro, G. S. Cor- rado, A. Davis, J. Dean, M. Devin, S. Ghemawat, I. Goodfellow, A. Harp, G. Irving, M. Isard, Y. Jia, R. Jozefowicz, L. Kaiser, M. Kudlur, J. Levenberg, D. Man´e, R. Monga, S. Moore, D. Murray, C. Olah, M. Schuster, J. Shlens, B. Steiner, I. Sutskever, K. Talwar, P. Tucker, V. Vanhoucke, V. Vasudevan, F. Vi´egas, O. Vinyals, P. Warden, M. Wattenberg, M. Wicke, Y. Yu, and X. Zheng, “TensorFlow: Large-scale machine learning on heterogeneous systems,” 2015. Software available from tensorflow.org.

[23] N. Srivastava, G. Hinton, A. Krizhevsky, I. Sutskever, and R. Salakhutdi- nov, “Dropout: A simple way to prevent neural networks from overfitting,” J. Mach. Learn. Res., vol. 15, pp. 1929–1958, Jan. 2014.

[24] X. Li, S. Chen, X. Hu, and J. Yang, “Understanding the disharmony between dropout and batch normalization by variance shift.,” CoRR, vol. abs/1801.05134, 2018.

[25] E. Eichten, K. Gottfried, T. Kinoshita, K. D. Lane, and T. M. Yan, “Char- monium: Comparison with experiment,” Phys. Rev. D, vol. 21, pp. 203– 233, Jan 1980.

[26] Y. Sumino, “Qcd potential as a coulomb-plus-linear potential,” Physics Letters B, vol. 571, no. 3, pp. 173 – 183, 2003.

A MSE LOSS CURVES

B SCCLMAE LOSS CURVES

designed for accessibility and to further open science