Computer program source code is an abundant and accessible form of data from which machine learning algorithms could learn to perform many useful software development tasks, including variable name suggestion, code completion, bug finding, security vulnerability identification, code quality assessment, or automated test generation. But despite the similarities between natural language and source code, deep learning methods for Natural Language Processing (NLP) have not been straightforward to apply to learning problems on source code (Allamanis et al., 2017).
There are many reasons for this, but two central ones are:
1. Code’s syntactic structure is unlike natural language.
While code contains natural language words and phrases in order to be human–readable, code is not meant to be a read like a natural language text. Code is written in a rigid syntax with delimiters that may open and close dozens of lines apart; it consists in great part of references to faraway lines and different files; and it describes computations that proceed in an order often quite distinct from its written order.
2. Code is written using an open vocabulary. Natural language is mostly composed of words from a large but closed (a.k.a. fixed–size and unchanging) vocabulary. Standard NLP methods can thus perform well by fixing a large vocabulary of words before training, and labeling the few words they encounter outside this vocabulary as “unknown”. But in code every new variable, class, or method declared requires a name, and this abundance of names leads to the use of many obscure words: abbreviations, brand names, technical terms, etc.1 A model must be able to reason about these newly–coined words to understand code.
The second of these issues is significant. To give one indication: 28% of variable names contain out–of–vocabulary words in the test set we use in our experiments below. But more broadly, the open vocabulary issue in code is an acute example of a fundamental challenge in machine learning: how to build models that can reason over unbounded domains of entities, sometimes called “open–set” learning. Despite this, the open vocabulary issue in source code has received relatively little attention in prior work.
The first issue, however, has been the focus of much prior work. A common strategy in these works is to represent source code as an Abstract Syntax Tree (AST) rather than as linear text. Once in this graph–structured format, code can be passed as input to models like Recursive Neural Networks or Graph Neural Networks (GNNs) that can, in principle, exploit the relational structure of their inputs and avoid the difficulties of reading code in linear order (Allamanis et al., 2018).
Our contribution: In this paper we extend such AST– based models for source code in order to address the open vocabulary issue. We do so by introducing a Graph–Structured Cache (GSC) to handle out–of–vocabulary words. The GSC represents vocabulary words as additional nodes in the AST as they are encountered and connects them with the edges to where they are used in the code. We then process the AST+GSC with a GNN to produce outputs. See Figure 1.
We empirically evaluated the utility of a Graph–Structured Cache on two tasks: a code completion (a.k.a. fill–in–the– blank) task and a variable naming task. We found that using a GSC improved performance on both tasks at the cost of an approximately 30% increase in training time. More precisely: even when using hyperparameters optimized for the baseline model, adding a GSC to a baseline model improved its accuracy by at least 7% on the fill–in–the–blank task and 103% on the variable naming task. We also report a number of ablation results in which we carefully demonstrate the necessity of each part of the GSC to a model’s performance.
2.1. Representing Code as a Graph
Given their prominence in the study of programming languages, Abstract Syntax Trees (ASTs) and parse trees are a natural choice for representing code and have been used extensively. Often models that operate on source code consume ASTs by linearizing them (usually with a depth–first traversal) as in (Amodio et al., 2017), (Liu et al., 2017), or (Li et al., 2017) or by using AST paths as input features as in (Alon et al., 2018), but they can also be processed by deep learning models that take graphs as input, as in (White et al., 2016) and (Chen et al., 2018) who use Recursive Neural Networks (RveNNs) (Goller & Kuchler, 1996) on ASTs. RveNNs are models that operate on tree–topology graphs, and have been used extensively for language modeling (Socher et al., 2013) and on domains similar to source code, like mathematical expressions (Zaremba et al., 2014; Arabshahi et al., 2018). They can be considered a special case of Message Passing Neural Networks (MPNNs) in the framework of (Gilmer et al., 2017): in this analogy RveNNs are to Belief Propagation as MPNNs are to Loopy Belief Propagation. They can also be considered a special case of Graph Networks in the framework of (Battaglia et al., 2018). ASTs also serve as a natural basis for models that generate code as output, as in (Maddison & Tarlow, 2014), (Yin & Neubig, 2017), (Rabinovich et al., 2017), (Chen et al., 2018), and (Brockschmidt et al., 2019).
Data–flow graphs are another type of graphical representation of source code with a long history (Krinke, 2001), and they have occasionally been used to featurize source code for machine learning (Chae et al., 2017).
Most closely related to our work is the work of (Allamanis et al., 2018), on which our model is heavily based. (Alla- manis et al., 2018) combine the data–flow graph and AST representation strategies for source code by representing code as an AST augmented with extra labeled edges indicating semantic information like data– and control–flow between variables. These augmentations yield a directed multigraph rather than just a tree,2 so in (Allamanis et al., 2018) a variety of MPNN called a Gated Graph Neural Network (GGNN) (Li et al., 2016) is used to consume the Augmented AST and produce an output for a supervised learning task.
Graph-based models that are not based on ASTs are also sometimes used for analyzing source code, like Conditional Random Fields for joint variable name prediction in (Ray- chev et al., 2015).
2.2. Reasoning about Open Sets
The question of how to gracefully reason over an open vocabulary is longstanding in NLP. Character–level embeddings are a typical way deep learning models handle this issue, whether used on their own as in (Kim et al., 2016), or in conjunction with word–level embedding Recurrent Neural Networks (RNNs) as in (Luong & Manning, 2016), or in conjunction with an n–gram model as in (Bojanowski et al., 2017). Another approach is to learn new word embeddings on–the–fly from context (Kobayashi et al., 2016). Caching novel words, as we do in our model, is yet another strategy (Grave et al., 2017) and has been used to augment N–gram models for analyzing source code (Hellendoorn & Devanbu, 2017).
In terms of producing outputs over variable–sized input and outputs, also known as open–set learning, attention-based pointer mechanisms were introduced in (Vinyals et al., 2015) and have been used for tasks on code, e.g. in (Bhoopchand et al., 2016) and (Vasic et al., 2019). Such methods have been used to great effect in NLP in e.g. (Gulcehre et al., 2016) and (Merity et al., 2017). The latter’s pointer sentinel mixture model is the direct inspiration for the readout function we use in the Variable Naming task below.
Using graphs to represent arbitrary collections of entities and their relationships for processing by deep networks has been widely used (Johnson, 2017; Bansal et al., 2017; Pham et al., 2018; Lu et al., 2017), but to our knowledge we are the first to use a graph–building strategy for reasoning (at train and test time) about an open vocabulary of words.
3.1. Abstract Syntax Trees
An Abstract Syntax Tree (AST) is a graph — specifically an ordered tree with labeled nodes — that is a representation of some written computer source code. There is a 1–to–1 relationship between source code and an AST of that source code, modulo comments and whitespace in the written source code.
Typically the leaves of an AST correspond to the tokens written in the source code, like variable and method names, while the non–leaf nodes represent syntactic language constructs like function calls or class definitions. The specific node labels and construction rules of ASTs can differ between or within languages. The first step in Figure 1 shows an example.
3.2. Graph Neural Networks
The term Graph Neural Network (GNN) refers to any deep, differentiable model that takes graphs as input. Many GNNs have been presented in the literature, and several nomenclatures have been proposed for describing the computations they perform, in particular in (Gilmer et al., 2017) and (Battaglia et al., 2018). Here we give a brief recapitulation of supervised learning with GNNs using the Message Passing Neural Network framework from (Gilmer et al., 2017).
A GNN is trained using pairs (G, y) where G = (V, E) is a graph defined by its vertices V and edges E, and y is a label. y can be any sort of mathematical object: scalar, vector, another graph, etc. In the most general case, each graph in the dataset can be a directed multigraph, each with a different number of nodes and different connectivity. In each graph, each vertex has associated features
, and each edge
has features
A GNN produces a prediction for the label y of a graph G = (V, E) by the following procedure:
1. A function S is used to initialize a hidden state vector for each vertex
as a function of the vertex’s features (e.g., if the
are words, S could be a word embedding function):
2. For each round t out of T total rounds:
(a) Each vertex receives the vector
is the sum of “messages” from its neighbors, each produced by a function
(b) Each vertex updates its hidden state based on the message it received via a function
3. A function R, the “readout function”, produces a prediction based on the hidden states generated during the message passing (usually just those at from time T):
GNNs differ in how they implement all these functions are differentiable and most are parameterized, so the model is trainable via stochastic gradient descent of a loss function on
Our model consumes an input instance of source code and produces an output for a supervised learning task via the following five steps, sketched in Figure 1:
1. Parse the source code (snippet, file, repository, version control history, etc.) into an Abstract Syntax Tree.
2. Add edges of varying types (details in Supplementary Table 4) to this AST that represent semantic information like data– and control– flow, in the spirit of (Allamanis et al., 2018). Also add the reversed version of all edges with their own edge type. This results in a directed multigraph called an Augmented AST.
3. Further augment the Augmented AST by adding a Graph– Structured Cache. That is, add a node to the Augmented AST for each vocabulary word encountered in the input instance. Then connect each such “cache node” with an edge (of edge type WORD USE) to all variables whose names contain its word.
4. Vectorize the Augmented AST + GSC graph into a form suitable for a GNN. (That is, perform Step 1 from Section 3.2.) Each AST node that doesn’t represent a variable is vectorized as a learned embedding of the language construct it represents, e.g. Parameter, Method Declaration, etc. Each cache node and each node that represents a variable is vectorized as a learned linear map of the concatenation of a type embedding and a name embedding. The name embedding is a Character– Level Convolutional Neural Network (CharCNN) (Zhang et al., 2015) embedding of the word/name the node contains. The type embedding is a learned embedding of the name of the Java type of the token it contains, e.g. int, a user–defined class, etc., with cache nodes having their own unique Cache Node type.
5. Process the graph with a GNN, as per Section 3.2. (That is, perform Steps 2 and 3 from Section 3.2.) The readout functions differ depending on the task and are described in the Experiments section below.
Figure 1. Our model’s procedure for consuming a single input instance of source code and producing an output for a supervised learning task.
Our main contribution to previous works is the addition of Step 3, the Graph–Structured Cache step. The combination of relational information from the cache nodes’ connections and lexical information from these nodes’ CharCNN embeddings allows the model to, in principle, flexibly reason about words it never saw during training, but also recognize words it did. E.g. it could potentially see a class named “getGuavaDictionary” and a variable named “guava dict” and both (a) utilize the fact that the word “guava” is common to both names despite having never seen this word before, and (b) exploit learned representations for words like “get”, “dictionary”, and “dict” that it has seen during training.
We evaluated our model, described in Section 4, on two supervised tasks: a Fill–In–The–Blank task and a Variable Naming task. For each task, we compare our model to others that differ in how they parse the code and how they treat the words they encounter. Table 1 details the different variations of the procedure in Section 4 against which we compare our model.
Code to reproduce all experiments is available online.3 4
5.1. Data and Implementation Details
We used Java source code as the data for our experiments as it is among the most popular programming languages in use today (TIOBE, 2018; Github, 2017). To construct our dataset, we randomly selected 18 of the 100 most popular Java repos from the Maven repository5 to serve as training data. (See Supplementary Table 5 for the list.) Together these repositories contain about 500,000 non–empty, non– comment lines of code. We checked for excessive code duplication in our dataset (Lopes et al., 2017) using CPD6 and found only about 7% of the lines to be contiguous, duplicated code blocks containing more than 150 tokens.
We randomly chose 3 of these repositories to sequester as an “Unseen Repos” test set. We then separated out 15% of the files in the remaining 15 repositories to serve as our “Seen Repos” test set. The remaining files served as our training set, from which we separated 15% of the datapoints to act as a validation set.
Our data preprocessor builds on top of the open–source Javaparser7 library to generate ASTs of our source code and then augment the ASTs with the edges described in Supplementary Table 4. We used Apache MXNet8 as our deep learning framework. All hidden states in the GNN contained 64 units; all GNNs ran for 8 rounds of message passing; all models used a 2–layer CharCNN with max– pooling to perform the name embedding; all models were optimized using the Adam optimizer (Kingma & Ba, 2015); all inputs to the GNNs were truncated to a maximum size of 500 nodes centered on the <FILL-IN-THE-BLANK> or <NAME-ME> tokens, as in (Allamanis et al., 2018). About 53% of input graphs were larger than 500 nodes before truncation. The only regularization we used was early stopping — early in our experiments we briefly tried and dropout regularization, but saw no improvements.
We performed only a moderate amount of hyperparameter optimization, but all of it was done on the baseline models to avoid biasing our results in favor of our model. Specifically, we tuned all hyperparameters on the Closed Vocab baseline model, and also did a small amount of extra learning rate exploration for the Pointer Sentinel baseline model to try to maximize its performance.
5.2. The Fill–In–The–Blank Task
In this task we randomly selected a single usage of a variable in some source code, replaced it with a <FILL-IN-THE-BLANK> token, and then asked the model to predict what variable should have been there. An example instance from our dataset is shown in Figure 2. This task is a simplified formulation of the VARMISUSE task from (Allamanis et al., 2017) that accomplishes the same goal of finding misused variables in code.
The models indicate their prediction for what variable should go in the blank by pointing with neural attention over all the nodes in the AugAST. This means all training and test instances only considered cases where the obfuscated variable appears somewhere else in the code. Single uses are rare however, since in Java variables must be declared before they are used. It also means there are sometimes multiple usages of the same, correct variable to which a model can point to get the right answer. In our dataset 78% of variables were used more two times, and 33% were used more than four times.
The models compute the attention weightings for each Augmented AST node i differently depending on the readout function of the GNN they use. Models using a GGNN as their GNN component, as all those in Table 2 do, compute the attention weightings as per (Li et al., 2016):
where the fs are MLPs, is the hidden state of node v after t message passing iterations,
is the sigmoid function, and
is elementwise multiplication. The DTNN and RGCN GNNs compute the attention weightings as per (Sch¨utt et al., 2017):
where f is a single hidden layer MLP. The models were trained using a binary cross entropy loss computed across the nodes in the graph.
The performance of models using our GSC versus those using other methods is reported in Table 2. For context, a baseline strategy of random guessing among all variable nodes within an edge radius of 8 of the <FILL-IN-THE-BLANK> token achieves an accuracy of 0.22. We also compare the performance of different GNNs in Supplementary Table 6.
Table 1. Nomenclature used in Experiments section. Each abbreviation describes a tweak/ablation to our full model as presented in Section 4. Using this nomenclature, our full model as described in Section 4 and shown in Figure 1 would be an “AugAST–GSC” model.
Table 2. Accuracy on the Fill–In–The–Blank task. Our model is the AugAST–GSC. The first number in each cell is the accuracy of the model (1.0 is perfect accuracy), where a correct prediction is one in which the graph node that received the maximum attention weighting by the model contained the variable that was originally in the <FILL-IN-THE-BLANK> spot. The second, parenthetical numbers are the top–5 accuracies, i.e. whether the correct node was among those that received the 5 largest attentions weightings from the model. See Table 1 for explanations of the abbreviations. All models use Gated Graph Neural Networks as their GNN component.
5.3. The Variable Naming Task
In this task we replaced all usages of a name of a particular variable, method, class, or parameter in the code with the special token <NAME-ME>, and asked the model to produce the obfuscated name (in the form of the sequence of words that compose the name). An example instance from our dataset is shown in Figure 3. This task is the same as the
To produce a name from the output of the GNN, our models used the readout function of (Allamanis et al., 2018). This readout function computes the mean of the hidden states of the <NAME-ME> nodes and passing it as the initial hidden state to a 1–layer Gated Recurrent Unit (GRU) RNN (Cho et al., 2014). This GRU is then unrolled to produce words in
Figure 2. Example of a model’s procedure for completing the Fill–In–The–Blank task. Each Fill–In–The–Blank instance is created by replacing a single usage of a variable (n, in this example) with the special token <FILL-IN-THE-BLANK>. The model then processes the code as depicted in Figure 1. To produce outputs, the model’s readout function computes a soft–attention weighting over all nodes in the graph; the model’s output is the variable at the node on which it places maximal attention. In this example, if the model put maximal attention weighting on any of the green–highlighted variables, this would be a correct output. If maximal attention is placed on any other node, it would be an incorrect output. Only in–scope usages of a variable are counted as correct.
Figure 3. Example of a model’s procedure for completing the Variable Naming task. Each Variable Naming instance is created by replacing all uses of some variable (expectedLength, in this example) with a special <NAME-ME> token. The model then processes the code as depicted in Figure 1. To produce outputs, the model takes the mean of the <NAME-ME> nodes’ hidden states (depicted here in orange), uses them as the initial hidden state of a Recurrent Neural Network, and unrolls this RNN to produce a name as a sequence of words.
Table 3. Accuracy on the Variable Naming task. Our model is the AugAST–GSC. The first number in each cell is the accuracy of the model (1.0 is perfect accuracy), where we consider a correct output to be exact reproduction of the full name of the obfuscated variable (i.e. all the words in the name and then a EOS token). The second, parenthetical numbers are the top–5 accuracies, i.e. whether the correct full name was among the 5 most probable sequences output by the model. See Table 1 for explanations of the abbreviations. All models use Gated Graph Neural Networks as their GNN component.
its predicted name, in the style of a traditional NLP decoder. We used a fixed length unrolling of 8 words, as 99.8% of names in our training set were 8 or fewer words long. The models were trained by cross entropy loss over the sequence of words in the name.
To decode each hidden state output of the GRU h into a probability distribution over words w, the Closed Vocab and CharCNN models pass h through a linear layer and a softmax layer with output dimension equal to the number of words in their closed vocabularies (i.e. a traditional decoder output for NLP). In contrast, the GSC model not only has access to a fixed–size vocabulary but can also produce words by pointing to cache nodes in its Graph–Structured Cache. Specifically, it uses a decoder architecture inspired by the Pointer Sentinel Mixture Model of (Merity et al., 2017): the probability of a word w being the GSC decoder’s output given that the GRU’s hidden state was h is
where is a conditional probability distribution over cache nodes in the GSC and the sentinel s, and
is a conditional probability distribution over words in a closed vocabulary.
is computed by passing the hidden states of all cache nodes and the sentinel node through a single linear layer and then computing the softmax dot–product attention of these values with h.
is computed as the softmax of a linear mapping of h to indices in a closed vocabulary, as in the Closed Vocab and CharCNN models. If there is no cache node for w in the Augmented AST or if w is not in the model’s closed dictionary then
and
are 0, respectively.
The performance of our GSC versus other methods is reported in Table 3. More granular performance statistics are reported in Supplementary Table 8. We also compare the performance of different GNNs in Supplementary Table 7.
As can be seen in Tables 2 and 3, the addition of a GSC improved performance on all tasks. Our full model, the AugAST–GSC model, outperforms the other models tested and does comparatively well at maintaining accuracy between the Seen and Unseen test repos on the Variable Naming task.
To some degree the improved performance from adding the GSC is unsurprising: its addition to a graph–based model is adding extra features and doesn’t remove any information or flexibility. Under a satisfactory training regime, a model could simply learn to ignore it if it is unhelpful, so its inclusion should never hurt performance. The degree to which it helps, though, especially on the Variable Naming task, suggests that a GSC is well worth using for some tasks, whether on source code or in NLP tasks in general. Moreover, the fact that the Pointer Sentinel approach shown in Table 3 performs noticeably less well than the full GSC approach suggests that the relational aspect of the GSC is key. Simply having the ability to output out–of–vocabulary words without relational information about their usage, as in the Pointer Sentinal model, is insufficient for our task.
The downside of using a GSC is the computational cost. Our GSC models ran about 30% slower than the Closed Vocab models. Since we capped the graph size at 500 nodes, the slowdown is presumably due to the large number of edges to and from the graph cache nodes. Better support for sparse operations on GPU in deep learning frameworks would be useful for alleviating this downside.
In the near term, there remain a number of design choices to explore regarding AST– and GNN– models for processing source code. Adding information about word order to the GSC might improve performance, as might constructing the vocabulary out of subwords rather than words. It also might help to treat variable types as the GSC treats words: storing them in a GSC and connecting them with edges to the variables of those types; this could be particularly useful when working with code snippets rather than fully compilable code. For the Variable Naming task, there are also many architecture choices to be explored in how to produce a sequence of words for a name: how to unroll the RNN, what to use as the initial hidden state, etc.
In the longer term, given that all results above show that augmenting ASTs with data– and control–flow edges improves performance, it would be worth exploring other static analysis concepts from the Programming Language and Software Verification literatures and seeing whether they could be usefully incorporated into Augmented ASTs. Better understanding of how Graph Neural Networks learn is also crucial, since they are central to the performance of our model and many others. Additionally, the entire domain of machine learning on source code faces the practical issue that many of the best data for supervised learning on source code — things like high–quality code reviews, integration test results, code with high test coverage, etc. — are not available outside private organizations.
Many thanks to Miltos Allamanis and Hyokun Yun for their advice and useful conversations.
Allamanis, M., Barr, E. T., Devanbu, P., and Sutton, C. A survey of machine learning for big code and naturalness. arXiv:1709.06182 [cs], 2017. URL https://arxiv. org/abs/1709.06182.
Allamanis, M., Brockschmidt, M., and Khademi, M. Learning to represent programs with graphs. In International Conference on Learning Representations, 2018. URL https://openreview.net/forum? id=BJOFETxR-.
Alon, U., Zilberstein, M., Levy, O., and Yahav, E. A gen- eral path-based representation for predicting program properties. In Proceedings of the 39th ACM SIGPLAN Conference on Programming Language Design and Im-
plementation, PLDI 2018, pp. 404–419, New York, NY, USA, 2018. ACM. ISBN 978-1-4503-5698-5. doi: 10.1145/3192366.3192412. URL http://doi.acm. org/10.1145/3192366.3192412.
Amodio, M., Chaudhuri, S., and Reps, T. Neural Attribute Machines for Program Generation. arXiv:1705.09231 [cs], May 2017. URL http://arxiv.org/abs/ 1705.09231.
Arabshahi, F., Singh, S., and Anandkumar, A. Combining symbolic expressions and black-box function evaluations for training neural programs. In International Conference on Learning Representations, 2018. URL https:// openreview.net/forum?id=Hksj2WWAW.
Bansal, T., Neelakantan, A., and McCallum, A. Relnet: End-to-end modeling of entities & relations. arXiv:1706.07179 [cs], 2017. URL http://arxiv. org/abs/1706.07179.
Battaglia, P. W., Hamrick, J. B., Bapst, V., Sanchez- Gonzalez, A., Zambaldi, V. F., Malinowski, M., Tacchetti, A., Raposo, D., Santoro, A., Faulkner, R., aglar G¨ulehre, Song, F., Ballard, A. J., Gilmer, J., Dahl, G. E., Vaswani, A., Allen, K. R., Nash, C., Langston, V., Dyer, C., Heess, N., Wierstra, D., Kohli, P., Botvinick, M., Vinyals, O., Li, Y., and Pascanu, R. Relational inductive biases, deep learning, and graph networks. abs/1806.01261, 2018. URL https://arxiv.org/abs/1806.01261.
Bhoopchand, A., Rockt¨aschel, T., Barr, E. T., and Riedel, S. Learning python code suggestion with a sparse pointer network. abs/1611.08307, 2016. URL https: //arxiv.org/abs/1611.08307.
Bojanowski, P., Grave, E., Joulin, A., and Mikolov, T. En- riching word vectors with subword information. TACL, 5: 135–146, 2017.
Brockschmidt, M., Allamanis, M., Gaunt, A. L., and Polo- zov, O. Generative code modeling with graphs. International Conference on Learning Representations, abs/1805.08490, 2019. URL https://arxiv.org/ abs/1805.08490.
Chae, K., Oh, H., Heo, K., and Yang, H. Automatically generating features for learning program analysis heuristics for c-like languages. Proc. ACM Program. Lang., 1(OOPSLA):101:1–101:25, October 2017. ISSN 2475-1421. doi: 10.1145/3133925. URL http: //doi.acm.org/10.1145/3133925.
Chen, X., Liu, C., and Song, D. Tree-to-tree neural net- works for program translation, 2018. URL https: //openreview.net/forum?id=rkxY-sl0W.
Cho, K., van Merrienboer, B., Gulcehre, C., Bahdanau, D., Bougares, F., Schwenk, H., and Bengio, Y. Learning phrase representations using rnn encoder–decoder for statistical machine translation. In Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing (EMNLP), pp. 1724–1734. Association for Computational Linguistics, 2014. doi: 10.3115/v1/D14-1179. URL http://www.aclweb. org/anthology/D14-1179.
Gilmer, J., Schoenholz, S. S., Riley, P. F., Vinyals, O., and Dahl, G. E. Neural message passing for quantum chemistry. In Precup, D. and Teh, Y. W. (eds.), Proceedings of the 34th International Conference on Machine Learning, volume 70 of Proceedings of Machine Learning Research, pp. 1263–1272, International Convention Centre, Sydney, Australia, 06–11 Aug 2017. PMLR. URL http://proceedings.mlr. press/v70/gilmer17a.html.
Github. The State of the Octoverse 2017, 2017. URL https://octoverse.github.com/.
Goller, C. and Kuchler, A. Learning task-dependent distributed representations by backpropagation through structure. In Neural Networks, 1996., IEEE International Conference on, volume 1, pp. 347–352 vol.1, Jun 1996. doi: 10.1109/ICNN.1996.548916.
Grave, E., Ciss´e, M., and Joulin, A. Unbounded cache model for online language modeling with open vocabulary. In NIPS, 2017.
Gulcehre, C., Ahn, S., Nallapati, R., Zhou, B., and Ben- gio, Y. Pointing the unknown words. In Proceedings of the 54th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pp. 140–149. Association for Computational Linguistics, 2016. doi: 10.18653/v1/P16-1014. URL http: //www.aclweb.org/anthology/P16-1014.
Hellendoorn, V. J. and Devanbu, P. T. Are deep neural networks the best choice for modeling source code? In ESEC/SIGSOFT FSE, 2017.
Johnson, D. D. Learning Graphical State Transitions. In International Conference on Learning Representations, 2017. URL https://openreview.net/forum? id=HJ0NvFzxl.
Kim, Y., Jernite, Y., Sontag, D., and Rush, A. M. Character-aware neural language models. In Proceedings of the Thirtieth AAAI Conference on Artificial Intelligence, AAAI’16, pp. 2741–2749. AAAI Press, 2016. URL http://dl.acm.org/citation. cfm?id=3016100.3016285.
Kingma, D. P. and Ba, J. Adam: A method for stochastic optimization. In International Conference on Learning Representations, 2015. URL https://arxiv.org/ abs/1412.6980.
Kobayashi, S., Tian, R., Okazaki, N., and Inui, K. Dy- namic Entity Representation with Max-pooling Improves Machine Reading. In Proceedings of the 2016 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, San Diego, California, June 2016. Association for Computational Linguistics. URL http: //www.aclweb.org/anthology/N16-1099.
Krinke, J. Identifying similar code with program depen- dence graphs. In Proceedings of the Eighth Working Conference on Reverse Engineering (WCRE’01), WCRE ’01, pp. 301–, Washington, DC, USA, 2001. IEEE Computer Society. ISBN 0-7695-1303-4. URL http://dl. acm.org/citation.cfm?id=832308.837142.
Li, J., Wang, Y., King, I., and Lyu, M. R. Code Com- pletion with Neural Attention and Pointer Networks. arXiv:1711.09573 [cs], November 2017. URL http: //arxiv.org/abs/1711.09573.
Li, Y., Tarlow, D., Brockschmidt, M., and Zemel, R. Gated Graph Sequence Neural Networks. In International Conference on Learning Representations, 2016. URL https://arxiv.org/abs/1511.05493.
Liu, C., Wang, X., Shin, R., Gonzalez, J. E., and Song, D. X. Neural code completion. 2017. URL https://openreview.net/forum? id=rJbPBt9lg¬eId=rJbPBt9lg.
Lopes, C. V., Maj, P., Martins, P. H., Saini, V., Yang, D., Zitny, J., Sajnani, H., and Vitek, J. D´ej`avu: a map of code duplicates on github. PACMPL, 1:84:1–84:28, 2017.
Lu, Z., Cui, H., Liu, X., Yan, Y., and Zheng, D. Object-oriented Neural Programming (OONP) for Document Understanding. arXiv:1709.08853 [cs], September 2017. URL http://arxiv.org/abs/1709. 08853. arXiv: 1709.08853.
Luong, M.-T. and Manning, C. D. Achieving Open Vo- cabulary Neural Machine Translation with Hybrid WordCharacter Models. arXiv:1604.00788 [cs], April 2016. URL http://arxiv.org/abs/1604.00788.
Maddison, C. J. and Tarlow, D. Structured generative models of natural source code. pp. II–649–II– 657, 2014. URL http://dl.acm.org/citation. cfm?id=3044805.3044965.
Merity, S., Xiong, C., Bradbury, J., and Socher, R. Pointer Sentinel Mixture Models. In International Conference on Learning Representations, 2017. URL https:// openreview.net/pdf?id=Byj72udxe.
Pham, T., Tran, T., and Venkatesh, S. Graph Memory Networks for Molecular Activity Prediction. arXiv:1801.02622 [cs], 2018. URL http://arxiv. org/abs/1801.02622.
Rabinovich, M., Stern, M., and Klein, D. Abstract syntax networks for code generation and semantic parsing. In Proceedings of the 55th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pp. 1139–1149. Association for Computational Linguistics, 2017. doi: 10.18653/v1/P17-1105. URL http: //www.aclweb.org/anthology/P17-1105.
Raychev, V., Vechev, M., and Krause, A. Predicting program properties from ”big code”. In Proceedings of the 42Nd Annual ACM SIGPLAN-SIGACT Symposium on Principles of Programming Languages, POPL ’15, pp. 111–124, New York, NY, USA, 2015. ACM. ISBN 978-1-4503-3300-9. doi: 10.1145/2676726.2677009. URL http: //doi.acm.org/10.1145/2676726.2677009.
Schlichtkrull, M., Kipf, T. N., Bloem, P., Berg, R. v. d., Titov, I., and Welling, M. Modeling Relational Data with Graph Convolutional Networks. arXiv:1703.06103 [cs, stat], March 2017. URL http://arxiv.org/abs/ 1703.06103.
Sch¨utt, K. T., Arbabzadah, F., Chmiela, S., M¨uller, K. R., and Tkatchenko, A. Quantum-chemical insights from deep tensor neural networks. In Nature communications, 2017.
Socher, R., Perelygin, A., Wu, J., Chuang, J., Manning, C. D., Ng, A., and Potts, C. Recursive deep models for semantic compositionality over a sentiment treebank. In Proceedings of the 2013 Conference on Empirical Methods in Natural Language Processing, pp. 1631–1642, Seattle, Washington, USA, October 2013. Association for Computational Linguistics. URL http: //www.aclweb.org/anthology/D13-1170.
TIOBE. TIOBE Index for September 2018, 2018. URL https://www.tiobe.com/tiobe-index/.
Vasic, M., Kanade, A., Maniatis, P., Bieber, D., and Singh, R. Neural program repair by jointly learning to localize and repair. April 2019. URL https://arxiv.org/ abs/1904.01720v1.
Vinyals, O., Fortunato, M., and Jaitly, N. Pointer networks. In Cortes, C., Lawrence, N. D., Lee, D. D., Sugiyama, M., and Garnett, R. (eds.), Advances in Neural Information Processing Systems 28, pp. 2692–2700. Curran
Associates, Inc., 2015. URL http://papers.nips. cc/paper/5866-pointer-networks.pdf.
White, M., Tufano, M., Vendome, C., and Poshyvanyk, D. Deep learning code fragments for code clone detection. In 2016 31st IEEE/ACM International Conference on Automated Software Engineering (ASE), pp. 87–98, Sept 2016. URL publications/C5.pdf.
Yin, P. and Neubig, G. A syntactic neural model for general-purpose code generation. In Proceedings of the 55th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pp. 440–450. Association for Computational Linguistics, 2017. doi: 10.18653/v1/P17-1041. URL http://www.aclweb. org/anthology/P17-1041.
Zaremba, W., Kurach, K., and Fergus, R. Learning to dis- cover efficient mathematical identities. In Ghahramani, Z., Welling, M., Cortes, C., Lawrence, N. D., and Weinberger, K. Q. (eds.), Advances in Neural Information Processing Systems 27, pp. 1278–1286. Curran Associates, Inc., 2014.
Zhang, X., Zhao, J., and LeCun, Y. Character-level convo- lutional networks for text classification. In Cortes, C., Lawrence, N. D., Lee, D. D., Sugiyama, M., and Garnett, R. (eds.), Advances in Neural Information Processing Systems 28, pp. 649–657. Curran Associates, Inc., 2015. URL http://papers.nips.cc/paper/ 5782-character-level-convolutional-networks-for-text-classification. pdf.
Table 4. Edge types used in Augmented ASTs. The initial AST is constructed using the AST and NEXT TOKEN edges, and then the remaining edges are added. In other words, the the “AST” model from Table 1 uses a graph that contains only the AST and NEXT TOKEN edge types (and WORD USE if it also uses a GSC), while the “AugAST” model contains all the edge types below. The reversed version of every edge is also added as its own type (e.g. reverse AST, reverse LAST READ) to let the GNN message passing occur in both directions.
Table 5. Repositories used in experiments. All were taken from the Maven repository (https://mvnrepository.com/). Entries are in the form “group/repository name/version”.
Table 6. Accuracy (and top–5 accuracy) on the Fill–In–The–Blank task, depending on which type of GNN the model uses. See Table 1 for explanations of the abbreviations. All models use AugAST as their code representation.
Table 7. Accuracy (and top–5 accuracy) on the Variable Naming task, depending on which type of GNN the model uses. See Table 1 for explanations of the abbreviations. All models use AugAST as their code representation.
Table 8. Extra information about performance on the Variable Naming task. Entries in this table are of the form “subword accuracy, edit distance, edit distance divided by real name length”. The edit distance is the mean of the character–wise Levenshtein distance between the produced name and the real name.