Every programmer has experienced the situation where they know what they want to do, but do not have the ability to turn it into a concrete implementation. For example, a Python programmer may want to “sort my list in descending order,” but not be able to come up with the proper syntax sorted(my list, reverse=True) to realize his intention. To resolve this impasse, it is common for programmers to search the web in natural language (NL), find an answer, and modify it into the desired form (Brandt et al., 2009, 2010). However, this is time-consuming, and thus the software engineering literature is ripe with methods to directly generate code from NL descriptions, mostly with hand-engineered methods highly tailored to specific programming languages (Balzer, 1985; Little and Miller, 2009; Gvero and Kuncak, 2015).
In parallel, the NLP community has developed methods for data-driven semantic parsing, which attempt to map NL to structured logical forms executable by computers. These logical forms can be general-purpose meaning representations (Clark and Curran, 2007; Banarescu et al., 2013), formalisms for querying knowledge bases (Tang and Mooney, 2001; Zettlemoyer and Collins, 2005; Berant et al., 2013) and instructions for robots or personal assistants (Artzi and Zettlemoyer, 2013; Quirk et al., 2015), among others. While these methods have the advantage of being learnable from data, compared to the programming languages (PLs) in use by programmers, the domain-specific languages targeted by these works have a schema and syntax that is relatively simple.
Recently, Ling et al. (2016) have proposed a data-driven code generation method for high-level, general-purpose PLs like Python and Java. This work treats code generation as a sequence-to-sequence modeling problem, and introduce methods to generate words from character-level models, and copy variable names from input descriptions. However, unlike most work in semantic parsing, it does not consider the fact that code has to be well-defined programs in the target syntax.
In this work, we propose a data-driven syntaxbased neural network model tailored for generation of general-purpose PLs like Python. In order to capture the strong underlying syntax of the PL, we define a model that transduces an NL statement into an Abstract Syntax Tree (AST; Fig. 1(a), ) for the target PL. ASTs can be deterministically generated for all well-formed programs using standard parsers provided by the PL, and thus give us a way to obtain syntax information with minimal engineering. Once we generate an AST, we can use deterministic generation tools to convert the AST into surface code. We hypothesize that such a structured approach has two benefits.
Table 1: Example production rules for common Python statements (Python Software Foundation, 2016)
First, we hypothesize that structure can be used to constrain our search space, ensuring generation of well-formed code. To this end, we propose a syntax-driven neural code generation model. The backbone of our approach is a grammar model () which formalizes the generation story of a derivation AST into sequential application of actions that either apply production rules (
emit terminal tokens (
). The underlying syntax of the PL is therefore encoded in the grammar model a priori as the set of possible actions. Our approach frees the model from recovering the underlying grammar from limited training data, and instead enables the system to focus on learning the compositionality among existing grammar rules. Xiao et al. (2016) have noted that this imposition of structure on neural models is useful for semantic parsing, and we expect this to be even more important for general-purpose PLs where the syntax trees are larger and more complex.
Second, we hypothesize that structural information helps to model information flow within the neural network, which naturally reflects the recursive structure of PLs. To test this, we extend a standard recurrent neural network (RNN) decoder to allow for additional neural connections which reflect the recursive structure of an AST (As an example, when expanding the node
Fig. 1(a), we make use of the information from both its parent and left sibling (the dashed rectangle). This enables us to locally pass information of relevant code segments via neural network connections, resulting in more confident predictions.
Experiments () on two Python code generation tasks show 11.7% and 9.3% absolute improvements in accuracy against the state-of-the-art system (Ling et al., 2016). Our model also gives competitive performance on a standard semantic parsing benchmark.
Given an NL description x, our task is to generate the code snippet c in a modern PL based on the intent of x. We attack this problem by first generating the underlying AST. We define a probabilistic grammar model of generating an AST y given x: p(y|x). The best-possible AST is then given by
is then deterministically converted to the corresponding surface code
While this paper uses examples from Python code, our method is PLagnostic.
Before detailing our approach, we first present a brief introduction of the Python AST and its underlying grammar. The Python abstract grammar contains a set of production rules, and an AST is generated by applying several production rules composed of a head node and multiple child nodes. For instance, the first rule in Tab. 1 is used to generate the function call Fig. 1(a). It consists of a head node of type Call, and three child nodes of type expr, expr* and keyword*, respectively. Labels of each node are noted within brackets. In an AST, non-terminal nodes sketch the general structure of the target code, while terminal nodes can be categorized into two types: operation terminals and variable terminals. Operation terminals correspond to basic arithmetic operations like AddOp.Variable terminal nodes store values for variables and constants of built-in data types2. For instance, all terminal nodes in Fig. 1(a) are variable terminal nodes.
Before detailing our neural code generation method, we first introduce the grammar model at its core. Our probabilistic grammar model defines the generative story of a derivation AST. We factorize the generation process of an AST into sequential application of actions of two types:
• APPLYRULE[r] applies a production rule r to the current derivation tree;
Figure 1: (a) the Abstract Syntax Tree (AST) for the given example code. Dashed nodes denote terminals. Nodes are labeled with time steps during which they are generated. (b) the action sequence (up to ) used to generate the AST in (a)
• GENTOKEN[v] populates a variable terminal node by appending a terminal token v.
Fig. 1(b) shows the generation process of the target AST in Fig. 1(a). Each node in Fig. 1(b) indicates an action. Action nodes are connected by solid arrows which depict the chronological order of the action flow. The generation proceeds in depth-first, left-to-right order (dotted arrows represent parent feeding, explained in
Formally, under our grammar model, the probability of generating an AST y is factorized as:
where is the action taken at time step
is the sequence of actions before t. We will explain how to compute Eq. (2) in
. Put simply, the generation process begins from a
and proceeds by the model choosing APPLYRULE actions to generate the overall program structure from a closed set of grammar rules, then at leaves of the tree corresponding to variable terminals, the model switches to GENTOKEN actions to generate variables or constants from the open set. We describe this process in detail below.
3.1 APPLYRULE Actions
APPLYRULE actions generate program structure, expanding the current node (the frontier node at time step ) in a depth-first, left-to-right traversal of the tree. Given a fixed set of production rules, APPLYRULE chooses a rule r from the subset that has a head matching the type of
and uses
by appending all child nodes specified by the selected production. As an example, in Fig. 1(b), the rule
expands the frontier node Call at time step
its three child nodes expr, expr* and keyword* are added to the derivation.
APPLYRULE actions grow the derivation AST by appending nodes. When a variable terminal node (e.g., str) is added to the derivation and becomes the frontier node, the grammar model then switches to GENTOKEN actions to populate the variable terminal with tokens.
Unary Closure Sometimes, generating an AST requires applying a chain of unary productions. For instance, it takes three time steps (to generate the sub-structure
(a). This can be effectively reduced to one step of APPLYRULE action by taking the closure of the chain of unary productions and merging them into a single rule:
str. Unary closures reduce the number of actions needed, but would potentially increase the size of the grammar. In our experiments we tested our model both with and without unary closures (
3.2 GENTOKEN Actions
Once we reach a frontier node that corresponds to a variable type (e.g.,
are used to fill this node with values. For general-purpose PLs like Python, variables and constants have values with one or multiple tokens. For instance, a node that stores the name of a function (e.g., sorted) has a single token, while a node that denotes a string constant (e.g.,
) could have multiple tokens. Our model copes with both scenarios by firing GENTOKEN actions at one or more time steps. At each time step, GENTOKEN appends one terminal token to the current frontier variable node. A special </n> token is used to “close” the node. The grammar model then proceeds to the new frontier node.
Terminal tokens can be generated from a pre-defined vocabulary, or be directly copied from the input NL. This is motivated by the observation that the input description often contains out-of-vocabulary (OOV) variable names or literal values that are directly used in the target code. For instance, in our running example the variable name my list can be directly copied from the the input at . We give implementation details in
We estimate action probabilities in Eq. (2) using attentional neural encoder-decoder models with an information flow structured by the syntax trees.
4.1 Encoder
For an NL description x consisting of n words , the encoder computes a context sen- sitive embedding
using a bidirectional Long Short-Term Memory (LSTM) network (Hochreiter and Schmidhuber, 1997), similar to the setting in (Bahdanau et al., 2014). See supplementary materials for detailed equations.
4.2 Decoder
The decoder uses an RNN to model the sequential generation process of an AST defined as Eq. (2). Each action step in the grammar model naturally grounds to a time step in the decoder RNN. Therefore, the action sequence in Fig. 1(b) can be interpreted as unrolling RNN time steps, with solid arrows indicating RNN connections. The RNN maintains an internal state to track the generation process (), which will then be used to compute action probabilities
4.2.1 Tracking Generation States
Our implementation of the decoder resembles a vanilla LSTM, with additional neural connections (parent feeding, Fig. 1(b)) to reflect the topological structure of an AST. The decoder’s internal hidden state at time step , is given by:
where is the LSTM update function. [:] denotes vector concatenation.
will then be used to compute action probabilities
in Eq. (2). Here,
is the embedding of the previous action.
is a context vector retrieved from
Figure 2: Illustration of a decoder time step (t = 9)
input encodings via soft attention.
vector that encodes the information of the parent action.
denotes the node type embedding of the current frontier node
. Intuitively, feeding the decoder the information of
helps the model to keep track of the frontier node to expand. Action Embedding
We maintain two action embedding matrices,
. Each row in
) corresponds to an embedding vector for an action APPLYRULE
Context Vector
The decoder RNN uses soft attention to retrieve a context vector
from the input encodings
pertain to the prediction of the current action. We follow Bahdanau et al. (2014) and use a Deep Neural Network (DNN) with a single hidden layer to compute attention weights. Parent Feeding
Our decoder RNN uses additional neural connections to directly pass information from parent actions. For instance, when computing
, the information from its parent action step
will be used. Formally, we define the parent action step
as the time step at which the frontier node
is generated. As an example, for
, its parent action step
is the node
, which is generated at
APPLYRULE
We model parent information sources: (1) the hidden state of parent action
and (2) the embedding of parent action
the concatenation. The parent feeding schema en- ables the model to utilize the information of parent code segments to make more confident predictions. Similar approaches of injecting parent information were also explored in the SEQ2TREE model in Dong and Lapata (2016)4.
4.2.2 Calculating Action Probabilities
In this section we explain how action probabilities are computed based on
APPLYRULE The probability of applying rule r as the current action is given by a softmax5:
where is a non-linearity
e(r) the one-hot vector for rule r.
GENTOKEN As in can be generated from a predefined vocabulary or copied from the input, defined as the marginal probability:
The selection probabilities are given by
ability of generating v from the vocabulary,
, is defined similarly as Eq. (4), except that we use the GENTOKEN embedding matrix
, and we concatenate the context vector
as input. To model the copy probability, we follow recent advances in modeling copying mechanism in neural networks (Gu et al., 2016; Jia and Liang, 2016; Ling et al., 2016), and use a pointer network (Vinyals et al., 2015) to compute the probability of copying the i-th word from the input by attending to input representations
where is a DNN with a single hidden layer. Specifically, if
is an OOV word (e.g., my list, which is represented by a special <unk> token in encoding), we directly copy the actual word
the derivation.
4.3 Training and Inference
Given a dataset of pairs of NL descriptions code snippets
into its AST
decompose
into a sequence of oracle actions under the grammar model. The model is then optimized by maximizing the log-likelihood of the oracle action sequence. At inference time, we use beam search to approximate the best AST
in Eq. (1). See supplementary materials for the pseudo-code of the inference algorithm.
Table 2: Statistics of datasets and associated grammars
5.1 Datasets and Metrics
HEARTHSTONE (HS) dataset (Ling et al., 2016) is a collection of Python classes that implement cards for the card game HearthStone. Each card comes with a set of fields (e.g., name, cost, and description), which we concatenate to create the input sequence. This dataset is relatively difficult: input descriptions are short, while the target code is in complex class structures, with each AST having 137 nodes on average.
DJANGO dataset (Oda et al., 2015) is a collection of lines of code from the Django web framework, each with a manually annotated NL description. Compared with the HS dataset where card implementations are somewhat homogenous, examples in DJANGO are more diverse, spanning a wide variety of real-world use cases like string manipulation, IO operations, and exception handling.
IFTTT dataset (Quirk et al., 2015) is a domain-specific benchmark that provides an interesting side comparison. Different from HS and DJANGO which are in a general-purpose PL, programs in IFTTT are written in a domain-specific language used by the IFTTT task automation App. Users of the App write simple instructions (e.g., If Instagram.AnyNewPhotoByYou Then Dropbox.AddFileFromURL) with NL descriptions (e.g., “Autosave your Instagram photos to Dropbox”). Each statement inside the If or Then clause consists of a channel (e.g., Dropbox) and a function (e.g., simple structure results in much more concise ASTs (7 nodes on average). Because all examples are created by ordinary Apps users, the dataset is highly noisy, with input NL very loosely connected to target ASTs. The authors thus provide a high-quality filtered test set, where each example is verified by at least three annotators. We use this set for evaluation. Also note IFTTT’s grammar has more productions (Tab. 2), but this does not imply that its grammar is more complex. This is because for HS and DJANGO terminal tokens are generated by GENTOKEN actions, but for IFTTT, all the code is generated directly by APPLYRULE actions.
Metrics As is standard in semantic parsing, we measure accuracy, the fraction of correctly generated examples. However, because generating an exact match for complex code structures is nontrivial, we follow Ling et al. (2016), and use tokenlevel BLEU-4 with as a secondary metric, defined as the averaged BLEU scores over all examples.7
5.2 Setup
Preprocessing All input descriptions are tokenized using NLTK. We perform simple canonicalization for DJANGO, such as replacing quoted strings in the inputs with place holders. See supplementary materials for details. We extract unary closures whose frequency is larger than a threshold k (k = 30 for HS and
Configuration The size of all embeddings is 128, except for node type embeddings, which is 64. The dimensions of RNN states and hidden layers are 256 and 50, respectively. Since our datasets are relatively small for a data-hungry neural model, we impose strong regularization using recurrent dropouts (Gal and Ghahramani, 2016), together with standard dropout layers added to the inputs and outputs of the decoder RNN. We validate the dropout probability from {0, 0.2, 0.3, 0.4}. For decoding, we use a beam size of 15.
5.3 Results
Evaluation results for Python code generation tasks are listed in Tab. 3. Numbers for our sys-
Table 3: Results on two Python code generation tasks. Results previously reported in Ling et al. (2016).
tems are averaged over three runs. We compare primarily with two approaches: (1) Latent Predictor Network (LPN), a state-of-the-art sequence-to-sequence code generation model (Ling et al., 2016), and (2) SEQ2TREE, a neural semantic parsing model (Dong and Lapata, 2016). SEQ2TREE generates trees one node at a time, and the target grammar is not explicitly modeled a priori, but implicitly learned from data. We test both the original SEQ2TREE model released by the authors and our revised one (SEQ2TREE–UNK) that uses unknown word replacement to handle rare words (Luong et al., 2015). For completeness, we also compare with a strong neural machine translation (NMT) system (Neubig, 2015) using a standard encoder-decoder architecture with attention and unknown word replacement8, and include numbers from other baselines used in Ling et al. (2016). On the HS dataset, which has relatively large ASTs, we use unary closure for our model and SEQ2TREE, and for DJANGO we do not.
System Comparison As in Tab. 3, our model registers 11.7% and 9.3% absolute improvements over LPN in accuracy on HS and DJANGO. This boost in performance strongly indicates the importance of modeling grammar in code generation. For the baselines, we find LPN outperforms others in most cases. We also note that SEQ2TREE achieves a decent accuracy of 13.6% on HS, which is due to the effect of unknown word replacement, since we only achieved 1.5% without it. A closer
Figure 3: Performance w.r.t reference AST size on DJANGO
Figure 4: Performance w.r.t reference AST size on HS
comparison with SEQ2TREE is insightful for understanding the advantage of our syntax-driven approach, since both SEQ2TREE and our system output ASTs: (1) SEQ2TREE predicts one node each time step, and requires additional “dummy” nodes to mark the boundary of a subtree. The sheer number of nodes in target ASTs makes the prediction process error-prone. In contrast, the APPLYRULE actions of our grammar model allows for generating multiple nodes at a single time step. Empirically, we found that in HS, SEQ2TREE takes more than 300 time steps on average to generate a target AST, while our model takes only 170 steps. (2) SEQ2TREE does not directly use productions in the grammar, which possibly leads to grammatically incorrect ASTs and thus empty code outputs. We observe that the ratio of grammatically incorrect ASTs predicted by SEQ2TREE on HS and DJANGO are 21.2% and 10.9%, respectively, while our system guarantees grammaticality.
Ablation Study We also ablated our bestperforming models to analyze the contribution of each component. “–frontier embed.” removes the frontier node embedding from the decoder RNN inputs (Eq. (3)). This yields worse results on DJANGO while gives slight improvements in accuracy on HS. This is probably because that the grammar of HS has fewer node types, and thus the RNN is able to keep track of
without depending on its embedding. Next, “–parent feed.” removes the parent feeding mechanism. The accuracy drops significantly on HS, with a marginal deterioration on DJANGO. This result is interesting because it suggests that parent feeding is more important when the ASTs are larger, which will be the case when handling more complicated code generation tasks like HS. Finally, removing the pointer network (“–copy terminals”) in GENTO-
Table 4: Results on the noise-filtered IFTTT test set of “>3 agree with gold annotations” (averaged over three runs), our model performs competitively among neural models.
KEN actions gives poor results, indicating that it is important to directly copy variable names and values from the input.
The results with and without unary closure demonstrate that, interestingly, it is effective on HS but not on DJANGO. We conjecture that this is because on HS it significantly reduces the number of actions from 173 to 142 (c.f., Tab. 2), with the number of productions in the grammar remaining unchanged. In contrast, DJANGO has a broader domain, and thus unary closure results in more productions in the grammar (237 for DJANGO vs. 100 for HS), increasing sparsity. Performance by the size of AST We further investigate our model’s performance w.r.t. the size of the gold-standard ASTs in Figs. 3 and 4. Not surprisingly, the performance drops when the size of the reference ASTs increases. Additionally, on the HS dataset, the BLEU score still remains at around 50 even when the size of ASTs grows to 200, indicating that our proposed syntax-driven approach is robust for long code segments. Domain Specific Code Generation Although this is not the focus of our work, evaluation on IFTTT brings us closer to a standard semantic parsing setting, which helps to investigate similarities and differences between generation of more complicated general-purpose code and and more limiteddomain simpler code. Tab. 4 shows the results, following the evaluation protocol in (Beltagy and Quirk, 2016) for accuracies at both channel and full parse tree (channel + function) levels. Our full model performs on par with existing neural network-based methods, while outperforming other neural models in full tree accuracy (82.0%). This score is close to the best classical method (LR), which is based on a logistic regression
Table 5: Predicted examples from HS (1st) and DJANGO. Copied contents (copy probability > 0.9) are highlighted.
model with rich hand-engineered features (e.g., brown clusters and paraphrase). Also note that the performance between NMT and other neural models is much closer compared with the results in Tab. 3. This suggests that general-purpose code generation is more challenging than the simpler IFTTT setting, and therefore modeling structural information is more helpful. Case Studies We present output examples in Tab. 5. On HS, we observe that most of the time our model gives correct predictions by filling learned code templates from training data with arguments (e.g., cost) copied from input. However, we do find interesting examples indicating that the model learns to generalize beyond trivial copying. For instance, the first example is one that our model predicted wrong — it generated code block A instead of the gold B (it also missed a function definition not shown here). However, we find that the block A actually conveys part of the input intent by destroying all, not some, of the minions. Since we are unable to find code block A in the training data, it is clear that the model has learned to generalize to some extent from multiple training card examples with similar semantics or structure.
The next two examples are from DJANGO. The first one shows that the model learns the usage of common API calls (e.g., os.path.join), and
how to populate the arguments by copying from inputs. The second example illustrates the dif-ficulty of generating code with complex nested structures like lambda functions, a scenario worth further investigation in future studies. More examples are attached in supplementary materials. Error Analysis To understand the sources of errors and how good our evaluation metric (exact match) is, we randomly sampled and labeled 100 and 50 failed examples (with accuracy=0) from DJANGO and HS, resp. We found that around 2% of these examples in the two datasets are actually semantically equivalent. These examples include: (1) using different parameter names when defining a function; (2) omitting (or adding) default values of parameters in function calls. While the rarity of such examples suggests that our exact match metric is reasonable, more advanced evaluation metrics based on statistical code analysis are definitely intriguing future work.
For DJANGO, we found that 30% of failed cases were due to errors where the pointer network failed to appropriately copy a variable name into the correct position. 25% were because the generated code only partially implementated the required functionality. 10% and 5% of errors were due to malformed English inputs and preprocessing errors, respectively. The remaining 30% of examples were errors stemming from multiple sources, or errors that could not be easily categorized into the above. For HS, we found that all failed card examples were due to partial implementation errors, such as the one shown in Table 5.
Code Generation and Analysis Most existing works on code generation focus on generating code for domain specific languages (DSLs) (Kush- man and Barzilay, 2013; Raza et al., 2015; Man- shadi et al., 2013), with neural network-based approaches recently explored (Parisotto et al., 2016; Balog et al., 2016). For general-purpose code generation, besides the general framework of Ling et al. (2016), existing methods often use language and task-specific rules and strategies (Lei et al., 2013; Raghothaman et al., 2016). A similar line is to use NL queries for code retrieval (Wei et al., 2015; Allamanis et al., 2015). The reverse task of generating NL summaries from source code has also been explored (Oda et al., 2015; Iyer et al., 2016). Finally, there are probabilistic models of source code (Maddison and Tarlow, 2014; Nguyen et al., 2013). The most relevant work is Allama- nis et al. (2015), which uses a factorized model to measure semantic relatedness between NL and ASTs for code retrieval, while our model tackles the more challenging generation task.
Semantic Parsing Our work is related to the general topic of semantic parsing, where the target logical forms can be viewed as DSLs. The parsing process is often guided by grammatical formalisms like combinatory categorical grammars (Kwiatkowski et al., 2013; Artzi et al., 2015), dependency-based syntax (Liang et al., 2011; Pasupat and Liang, 2015) or task-specific formalisms (Clarke et al., 2010; Yih et al., 2015; Krishnamurthy et al., 2016; Misra et al., 2015; Mei et al., 2016). Recently, there are efforts in designing neural network-based semantic parsers (Misra and Artzi, 2016; Dong and Lapata, 2016; Nee- lakantan et al., 2016; Yin et al., 2016). Several approaches have be proposed to utilize grammar knowledge in a neural parser, such as augmenting the training data by generating examples guided by the grammar (Kocisk´y et al., 2016; Jia and Liang, 2016). Liang et al. (2016) used a neural decoder which constrains the space of next valid tokens in the query language for question answering. Finally, the structured prediction approach proposed by Xiao et al. (2016) is closely related to our model in using the underlying grammar as prior knowledge to constrain the generation process of derivation trees, while our method is based on a unified grammar model which jointly captures production rule application and terminal symbol generation, and scales to general purpose code generation tasks.
This paper proposes a syntax-driven neural code generation approach that generates an abstract syntax tree by sequentially applying actions from a grammar model. Experiments on both code generation and semantic parsing tasks demonstrate the effectiveness of our proposed approach.
We are grateful to Wang Ling for his generous help with LPN and setting up the benchmark. We also thank Li Dong for helping with SEQ2TREE and insightful discussions.
Miltiadis Allamanis, Daniel Tarlow, Andrew D. Gor- don, and Yi Wei. 2015. Bimodal modelling of source code and natural language. In Proceedings of ICML. volume 37.
David Alvarez-Melis and Tommi S. Jaakkola. 2017. Tree-structured decoding with doubly recurrent neural networks. In Proceedings of ICLR.
Yoav Artzi, Kenton Lee, and Luke Zettlemoyer. 2015. Broad-coverage CCG semantic parsing with AMR. In Proceedings of EMNLP.
Yoav Artzi and Luke Zettlemoyer. 2013. Weakly su- pervised learning of semantic parsers for mapping instructions to actions. Transaction of ACL 1(1).
Dzmitry Bahdanau, Kyunghyun Cho, and Yoshua Bengio. 2014. Neural machine translation by jointly learning to align and translate. CoRR abs/1409.0473.
Matej Balog, Alexander L. Gaunt, Marc Brockschmidt, Sebastian Nowozin, and Daniel Tarlow. 2016. Deepcoder: Learning to write programs. CoRR abs/1611.01989.
Robert Balzer. 1985. A 15 year perspective on automatic programming. IEEE Trans. Software Eng. 11(11).
Laura Banarescu, Claire Bonial, Shu Cai, Madalina Georgescu, Kira Griffitt, Ulf Hermjakob, Kevin Knight, Philipp Koehn, Martha Palmer, and Nathan Schneider. 2013. Abstract meaning representation for sembanking. In Proceedings of the 7th Linguistic Annotation Workshop and Interoperability with Discourse, LAW-ID@ACL.
I. Beltagy and Chris Quirk. 2016. Improved seman- tic parsers for if-then statements. In Proceedings of ACL.
Jonathan Berant, Andrew Chou, Roy Frostig, and Percy Liang. 2013. Semantic parsing on freebase from question-answer pairs. In Proceedings of EMNLP.
Joel Brandt, Mira Dontcheva, Marcos Weskamp, and Scott R. Klemmer. 2010. Example-centric programming: integrating web search into the development environment. In Proceedings of CHI.
Joel Brandt, Philip J. Guo, Joel Lewenstein, Mira Dontcheva, and Scott R. Klemmer. 2009. Two studies of opportunistic programming: interleaving web foraging, learning, and writing code. In Proceedings of CHI.
Stephen Clark and James R. Curran. 2007. Widecoverage efficient statistical parsing with CCG and log-linear models. Computational Linguistics 33(4).
James Clarke, Dan Goldwasser, Ming-Wei Chang, and Dan Roth. 2010. Driving semantic parsing from the world’s response. In Proceedings of CoNLL.
Li Dong and Mirella Lapata. 2016. Language to logical form with neural attention. In Proceedings of ACL.
Yarin Gal and Zoubin Ghahramani. 2016. A theoret- ically grounded application of dropout in recurrent neural networks. In Proceedings of NIPS.
Jiatao Gu, Zhengdong Lu, Hang Li, and Victor O. K. Li. 2016. Incorporating copying mechanism in sequence-to-sequence learning. In Proceedings of ACL.
Tihomir Gvero and Viktor Kuncak. 2015. Interactive synthesis using free-form queries. In Proceedings of ICSE.
Sepp Hochreiter and J¨urgen Schmidhuber. 1997. Long short-term memory. Neural Computation 9(8).
Srinivasan Iyer, Ioannis Konstas, Alvin Cheung, and Luke Zettlemoyer. 2016. Summarizing source code using a neural attention model. In Proceedings of ACL.
Robin Jia and Percy Liang. 2016. Data recombination for neural semantic parsing. In Proceedings of ACL.
Tom´as Kocisk´y, G´abor Melis, Edward Grefenstette, Chris Dyer, Wang Ling, Phil Blunsom, and Karl Moritz Hermann. 2016. Semantic parsing with semi-supervised sequential autoencoders. In Proceedings of EMNLP.
Jayant Krishnamurthy, Oyvind Tafjord, and Aniruddha Kembhavi. 2016. Semantic parsing to probabilistic programs for situated question answering. In Proceedings of EMNLP.
Nate Kushman and Regina Barzilay. 2013. Using se- mantic unification to generate regular expressions from natural language. In Proceedings of NAACL.
Tom Kwiatkowski, Eunsol Choi, Yoav Artzi, and Luke S. Zettlemoyer. 2013. Scaling semantic parsers with on-the-fly ontology matching. In Proceedings of the EMNLP.
Tao Lei, Fan Long, Regina Barzilay, and Martin C. Ri- nard. 2013. From natural language specifications to program input parsers. In Proceedings of ACL.
Chen Liang, Jonathan Berant, Quoc Le, Kenneth D. Forbus, and Ni Lao. 2016. Neural symbolic machines: Learning semantic parsers on freebase with weak supervision. CoRR abs/1611.00020.
Percy Liang, Michael I. Jordan, and Dan Klein. 2011. Learning dependency-based compositional semantics. In Proceedings of ACL.
Wang Ling, Phil Blunsom, Edward Grefenstette, Karl Moritz Hermann, Tom´as Kocisk´y, Fumin Wang, and Andrew Senior. 2016. Latent predictor networks for code generation. In Proceedings of ACL.
Greg Little and Robert C. Miller. 2009. Keyword pro- gramming in java. Autom. Softw. Eng. 16(1).
Thang Luong, Ilya Sutskever, Quoc V. Le, Oriol Vinyals, and Wojciech Zaremba. 2015. Addressing the rare word problem in neural machine translation. In Proceedings of ACL.
Chris J. Maddison and Daniel Tarlow. 2014. Structured generative models of natural source code. In Proceedings of ICML. volume 32.
Mehdi Hafezi Manshadi, Daniel Gildea, and James F. Allen. 2013. Integrating programming by example and natural language programming. In Proceedings of AAAI.
Hongyuan Mei, Mohit Bansal, and Matthew R. Wal- ter. 2016. Listen, attend, and walk: Neural mapping of navigational instructions to action sequences. In Proceedings of AAAI.
Dipendra K. Misra and Yoav Artzi. 2016. Neural shift- reduce CCG semantic parsing. In Proceedings of EMNLP.
Dipendra Kumar Misra, Kejia Tao, Percy Liang, and Ashutosh Saxena. 2015. Environment-driven lexicon induction for high-level instructions. In Proceedings of ACL.
Arvind Neelakantan, Quoc V. Le, and Ilya Sutskever. 2016. Neural programmer: Inducing latent programs with gradient descent. In Proceedings of ICLR.
Graham Neubig. 2015. lamtram: A toolkit for language and translation modeling using neural networks. http://www.github.com/neubig/lamtram.
Tung Thanh Nguyen, Anh Tuan Nguyen, Hoan Anh Nguyen, and Tien N. Nguyen. 2013. A statistical semantic language model for source code. In Proceedings of ACM SIGSOFT.
Yusuke Oda, Hiroyuki Fudaba, Graham Neubig, Hideaki Hata, Sakriani Sakti, Tomoki Toda, and Satoshi Nakamura. 2015. Learning to generate pseudo-code from source code using statistical machine translation (T). In Proceedings of ASE.
Emilio Parisotto, Abdel-rahman Mohamed, Rishabh Singh, Lihong Li, Dengyong Zhou, and Pushmeet Kohli. 2016. Neuro-symbolic program synthesis. CoRR abs/1611.01855.
Panupong Pasupat and Percy Liang. 2015. Composi- tional semantic parsing on semi-structured tables. In Proceedings of ACL.
Python Software Foundation. 2016. Python abstract grammar. https://docs.python.org/2/library/ast.html.
Chris Quirk, Raymond J. Mooney, and Michel Galley. 2015. Language to code: Learning semantic parsers for if-this-then-that recipes. In Proceedings of ACL.
Mukund Raghothaman, Yi Wei, and Youssef Hamadi. 2016. SWIM: synthesizing what i mean: code search and idiomatic snippet synthesis. In Proceedings of ICSE.
Mohammad Raza, Sumit Gulwani, and Natasa Milic- Frayling. 2015. Compositional program synthesis from natural language and examples. In Proceedings of IJCAI.
Lappoon R. Tang and Raymond J. Mooney. 2001. Us- ing multiple clause constructors in inductive logic programming for semantic parsing. In Proceedings of ECML.
Oriol Vinyals, Meire Fortunato, and Navdeep Jaitly. 2015. Pointer networks. In Proceedings of NIPS.
Yi Wei, Nirupama Chandrasekaran, Sumit Gulwani, and Youssef Hamadi. 2015. Build- ing bing developer assistant. Technical report. https://www.microsoft.com/en- us/research/publication/building-bing-developer- assistant/.
Chunyang Xiao, Marc Dymetman, and Claire Gardent. 2016. Sequence-based structured prediction for semantic parsing. In Proceedings of ACL.
Wen-tau Yih, Ming-Wei Chang, Xiaodong He, and Jianfeng Gao. 2015. Semantic parsing via staged query graph generation: Question answering with knowledge base. In Proceedings of ACL.
Pengcheng Yin, Zhengdong Lu, Hang Li, and Ben Kao. 2016. Neural enquirer: Learning to query tables in natural language. In Proceedings of IJCAI.
Luke Zettlemoyer and Michael Collins. 2005. Learn- ing to map sentences to logical form structured clas-sification with probabilistic categorial grammars. In Proceedings of UAI.
Suppose the input natural language description x consists of denote the embedding of
. We use two LSTMs to process x in forward and backward order, and get the sequence of hidden states
the two directions:
where are standard LSTM up- date functions. The representation of the i-th word,
, is given by concatenating
Given an NL description, we approximate the best AST in Eq. 1 using beam search. The inference procedure is listed in Algorithm 1.
We maintain a beam of size K. The beam is initialized with one hypothesis AST with a single root node (line 2). At each time step, the decoder enumerates over all hypotheses in the beam. For each hypothesis AST, we first find its frontier node is a non-terminal node, we collect all syntax rules
as the head node to the actions set (line 10). If
is a variable terminal node, we add all terminal tokens in the vocabulary and the input description as candidate actions (line 13). We apply each candidate action on the current hypothesis AST to generate a new hypothesis (line 15). We then rank all newly generated hypotheses and keep the top-K scored ones in the beam. A complete hypothesis AST is generated when it has no frontier node. We then convert the top-scored complete AST into the surface code (lines 18-19).
We remark that our inference algorithm can be implemented efficiently by expanding multiple hypotheses (lines 5-16) simultaneously using mini-batching on GPU.
Infrequent Words We replace word types whose frequency is lower than d with a special <unk> token (, 3 for HS and 2 for IFTTT).
Canonicalization We perform simple canonicalization for the DJANGO dataset: (1) We observe that input descriptions often come with quoted string literals (e.g., verbose name is a string ‘cache entry’). We therefore replace quoted strings with indexed placeholders using regular expression. After decoding, we run a postprocessing step to replace all placeholders with their actual values. (2) For descriptions with cascading variable reference (e.g., call method self.makekey), we append after the whole variable
name with tokens separated by ‘.’ (e.g., append self and makekey after self.makekey). This gives the pointer network flexibility to copy either partial or whole variable names.
Generate Oracle Action Sequence To train our model, we generate the gold-standard action sequence from reference code. For IFTTT, we simply parse the officially provided ASTs into sequences of APPLYRULE actions. For HS and DJANGO, we first convert the Python code into ASTs using the standard ast module. Values inside variable terminal nodes are tokenized by space and camel case (e.g., ClassName is tokenized to Class and Name). We then traverse the AST in pre-order to generate the reference action sequence according to the grammar model.
We provide extra decoding examples from the DJANGO and HS datasets, listed in Table 6 and Table 7, respectively. The model heavily relies on the pointer network to copy variable names and constants from input descriptions. We find the source of errors in DJANGO is more diverse, with most incorrect examples resulting from missing arguments and incorrect words copied by the pointer network. Errors in HS are mostly due to partially or incorrectly implemented effects. Also note that the first example in Table 6 is semantically correct, although it was considered incorrect under our exact-match metric. This suggests more advanced evaluation metric that takes into account the execution results in future studies.
Table 6: Predicted examples from DJANGO dataset. Copied contents (copy probability > 0.9) are highlighted
Table 7: Predicted card examples from HS dataset. Copied contents (copy probability > 0.9) are highlighted.