One great advantage of electronic storage of documents is the ease with which we can edit them, and edits are performed in a wide variety of contents. For example, right before a conference deadline, papers worldwide are finalized and polished, often involving common fixes for grammar, clarity and style. Would it be possible to automatically extract rules from these common edits? Similarly, program source code is constantly changed to implement new features, follow best practices and fix bugs. With the widespread deployment of (implicit) version control systems, these edits are quickly archived, creating a major data stream that we can learn from.
In this work, we study the problem of learning distributed representations of edits. We only look at small edits with simple semantics that are more likely to appear often and do not consider larger edits; i.e., we consider “add definite articles” rather than “rewrite act 2, scene 3.” Concretely, we focus on two questions: i) Can we group semantically equivalent edits together, so that we can automatically recognize common edit patterns? ii) Can we automatically transfer edits from one context to another? A solution to the first question would yield a practical tool for copy editors and programmers alike, automatically identifying the most common changes. By leveraging tools from program synthesis, such groups of edits could be turned into interpretable rules and scripts (Rolim et al., 2017). When there is no simple hard rule explaining how to apply an edit, an answer to the second question would be of great use, e.g., to automatically rewrite natural language following some stylistic rule.
We propose to handle edit data in an autoencoder-style framework, in which an “edit encoder” is trained to compute a representation of an edit
, and a “neural editor”
is trained to construct
from the edit representation and
. This framework ensures that the edit representation is semantically meaningful, and a sufficiently strong neural editor allows this representation to not be specific to the changed element. We experiment with various neural architectures that can learn to represent and apply edits and hope to direct the attention of the research community to this new and interesting data source, leading to better datasets and stronger models.
Briefly, the contributions of our paper are: (a) in Sect. 2, we present a new and important machine learning task on learning representations of edits (b) we present a family of
Figure 1: Given an edit (Edit 1) of computes an edit representation vector. Using that representation vector the neural editor
applies the same edit to a new
. The code snippets shown here are real code change examples from the roslyn open-source compiler project.
models that capture the structure of edits and compute efficient representations in Sect. 3 (c) we create a new source code edit dataset, and release the data extraction code at
and the data at . (d) we perform a set of experiments on the learned edit representations in Sect. 4 for natural language text and source code and present promising empirical evidence that our models succeed in capturing the semantics of edits.
In this work, we are interested in learning to represent and apply edits on discrete sequential or structured data, such as text or source code parse trees1. Figure 1 gives a graphical overview of the task, described precisely below.
Edit Representation Given a dataset of edits is the original version of some object and
its edited form (see upper half of Figure 1 for an example), our goal is to learn a representation function
that maps an edit operation
to a real-valued edit representation
A desired quality of
is for the computed edit representations to have the property that semantically similar edits have nearby representations in
. Having distributed representations also allows other interesting downstream tasks, e.g., unsupervised clustering and visualization of similar edits from large-scale data (e.g. the GitHub commit stream), which would be useful for developing human-assistance toolkits for discovering and extracting emerging edit patterns (e.g. new bug fixes or emerging “best practices” of coding).
Neural Editor Given an edit representation function , we want to learn to apply edits in a new context. This can be achieved by learning a neural editor
that accepts an edit representation
and a new input
and generates
This is illustrated in the lower half of Figure 1.
We cast the edit representation problem as an autoencoding task, where we aim to minimize the reconstruction error of for the edited version
given the edit representation
original version
. By limiting the capacity of
’s output and allowing the model to freely use information about
, we are introducing a “bottleneck” that forces the overall framework to not simply treat
as an encoder of
. The main difference from traditional autoencoders is that in our setup, an optimal solution requires to re-use as much information as possible from
Figure 2: (a) Graph representation of statement u = x + x. Rectangular (resp. rounded) nodes denote tokens (resp. non-terminals). (b) Sequence of tree decoding steps yielding x + x - 23, where x + x is copied (using the TREECP action) from the context graph in (a).
to make the most of the capacity of . Formally, given a probabilistic editor function
neural network and a dataset
we seek to minimize the negative likelihood loss
Note that this loss function can be interpreted in two ways: (1) as a conditional autoencoder that encodes the salient information of an edit, given and (2) as an encoder-decoder model that encodes
and decodes
conditioned on the edit representation
. In the rest of this section, we discuss our methods to model
as neural networks.
3.1 NEURAL EDITOR
As discussed above, should use as much information as possible from
, and hence, an encoder-decoder architecture with the ability to copy from the input is most appropriate. As we are primarily interested in edits on text and source code in this work, we explored two architectures: a sequence-to-sequence model for text, and a graph-to-tree model for source code, whose known semantics we can leverage both on the encoder as well as on the decoder side. Other classes of edits, for example, image manipulation, would most likely be better served by convolutional neural models.
Sequence-to-Sequence Neural Editor First, we consider a standard sequence-to-sequence model with attention (over the tokens of ). The architecture of our sequence-to-sequence model is similar to that of Luong et al. (2015), with the difference that we use a bidirectional LSTM in the encoder and a token-level copying mechanism (Vinyals et al., 2015) that directly copies tokens into the decoded sequence. Whereas in standard sequence-to-sequence models the decoder is initialized with the representation computed by the encoder, we initialize it with the concatenation of encoder output and the edit representation. We also feed the edit representation as input to the decoder LSTM at each decoding time step. This allows the LSTM decoder to take the edit representation into consideration while generating the output sequence.
Graph-to-Tree Neural Editor Our second model aims to take advantage of the additional structure of . To achieve this, we combine a graph-based encoder with a tree-based decoder. We use T(x) to denote a tree representation of an element, e.g., the abstract syntax tree (AST) of a fragment of source code. We extend T(x) into a graph form G(x) by encoding additional relationships (e.g., the “next token” relationship between terminal nodes, etc.) (see Figure 2(a)). To encode the elements of
into vector representations, we use a gated graph neural network (GGNN) (Li et al., 2015). Similarly to recurrent neural networks for sequences (such as biRNNs), GGNNs compute a representation for each node in the graph, which can be used in the attention mechanisms of a decoder. Additionally, we use them to obtain a representation of the full input
, by computing their weighted average following the strategy of Gilmer et al. (2017) (i.e., computing a score for each node, normalizing scores with a softmax, and using the resulting values as weights).
Our tree decoder follows the semantic parsing model of Yin & Neubig (2018), which sequentially generate a tree as a series of expansion actions
. The probability of taking an action is modeled as
, where s is the input (a sequence of words in the original semantic
Figure 3: Sequence (a) and graph (b) representation of edit of v.F = x + x to u = x + x.
parsing setting) and is the partial tree that has been generated so far. The model of Yin & Neubig (2018) mainly uses two types of actions: EXPANDR expands the current non-terminal using a grammar rule, and GENTERM generates a terminal token from a vocabulary or copies a token from
. The dependence on the partial tree
is modeled by an LSTM cell which is used to maintain state throughout the generation procedure. Additionally, the LSTM receives the decoder state used to pick the action at the parent node as an additional input (“parent-feeding”). This process illustrated in Figure 2(b).
We extend this model to our setting by replacing the input sequence ; concretely, we condition the decoder on the graph-level representation computed for
. Additionally, we use the change representation
as an additional input to the LSTM initial state and at every decoding step. Based on the observation that edits to source code often manipulate the syntax tree by moving expressions around (e.g. by nesting statements in a conditional, or renaming a function while keeping its arguments), we extend the decoding model of Yin & Neubig (2018) by adding a facility to copy entire subtrees from the input. For this, we add a decoder action TREECP. This action is similar to standard copying mechanism known from pointer networks (Vinyals et al., 2015), but instead of copying only a single token, it copies the whole subtree pointed to.
However, adding the TREECP action means that there are many correct generation sequences for a target tree. This problem appears in token-copying as well, but can be easily circumvented by marginalizing over all correct choices at each generation step (by normalizing the probability distribution over allowed actions to sum up those that have the same effect). In the subtree-copying setting, the lengths of action sequences representing different choices may differ. In our implementation we handle this problem during training by simply picking the generation sequence that greedily selecting TREECP actions.
3.2 EDIT REPRESENTATION
To compute a useful edit representation, a model needs to focus on the differences between and
. A risk in our framework is that
degenerates into an encoder for
into a decoder. To avoid this, we need to follow the standard autoencoder trick, i.e. it is important to limit the capacity of the result of
by generating the edit representation in a low-dimensional space
. This acts as a bottleneck and encodes only the information that is needed to reconstruct
from
. We again experimented with both sequence-based and graph-based representations of edits.
Sequence Encoding of Edits Given (resp.
) as sequence of tokens
(resp.
), we can use a standard (deterministic) diffing algorithm to compute an alignment of tokens in the two sequences. We then use extra symbols
for padding, + for additions,
for deletions,
for replacements, and = for unchanged tokens to generate a single sequence representing both
and
. This is illustrated in Figure 3(a). By embedding the three entries in each element of the sequence separately and concatenating their representation, they can be fed into a standard sequence encoder whose final state is our desired edit representation. In this work, we use a biLSTM. Graph Encoding of Edits As in the graph-to-tree neural editor, we represent
and
. We combine these trees into a graph representation
by merging both trees into one graph, using “Removed”, “Added” and “Replaced” edges. To connect the two trees, we compute the same alignment as in the sequence case, connecting leaves that are the same and each replaced leaf to its replacement. We also propagate this information up in the trees, i.e., two inner nodes are connected by “=” edges if all their descendants are connected by “=” edges. This is illustrated in Figure 3(b). Finally, we also use the same “+” / “-” / “
” / “=” tags for the initial node representation, computing it as the concatenation of the string label (i.e. token or non-terminal name) and the embedding of the tag. To obtain an edit representation, we use a GGNN unrolled for a fixed number of timesteps and again use the weighted averaging strategy of Gilmer et al. (2017).
Evaluating an unsupervised representation learning method is challenging, especially for a newly defined task. Here, we aim to evaluate the quality of the learned edit representations with a series of qualitative and quantitative metrics on natural language and source code.
4.1 DATASETS AND CONFIGURATION
Natural Language Edits We use the WikiAtomicEdits (Faruqui et al., 2018) dataset of pairs of short edits on Wikipedia articles. We sampled 1040K edits from the English insertion portion of the dataset and split the samples into 1000K/20K/20K train-valid-test sets.
Source Code Edits To obtain a dataset for source code, we clone a set of 54 C# projects on GitHub and collected a GitHubEdits dataset (see Appendix A for more information). We selected all changes in the projects that are no more than 3 lines long and whose surrounding 3 lines of code before and after the edited lines have not been changed, ensuring that the edits are separate and short. We then parsed the two versions of the source code and take as and
the code that belongs to the top-most AST node that contains the edited lines. Finally, we remove trivial changes such variable renaming, changes within comments or formatting changes. Overall, this yields 111 724 edit samples. For each edit we run a simple C# analysis to detect all variables and normalize variable names such that each unique variable within
has a unique normalized name V0, V1, etc. This step is necessary to avoid the sparsity of data induced by the variety of different identifier naming schemes. We split the dataset into 91,372 / 10,176 / 10,176 samples as train/valid/test sets.
Additionally, we introduce a labeled dataset of source code edits by using C# “fixers”. Fixers are small tools built on top of the C# compiler, used to perform common refactoring and modernization tasks (e.g., using new syntactic sugar). We selected 16 of these fixers and ran them on 6 C# projects to generate a small C#Fixers dataset of 2,878 edit pairs with known semantics. We present descriptions and examples of each fixer in Appendix A.
Configuration Throughout the evaluation we use a fixed size of 512 for edit representations. The size of word embeddings and hidden states of encoding LSTMs is 128. The dimensionality of the decoding LSTM is set to 256. Details of model configuration can be found in Sect. A.
When generating the target , our neural editor model can optionally take as input the context of the original input
(e.g., the preceding and succeeding code segments surrounding
), whose information could be useful for predicting
. For example, in source code edits the updated code snippet
may reuse variables defined in the preceding snippet. In our code experiments, we use a standard bidirectional LSTM network to encode the tokenized 3 lines of code before and after
as context. The encoded context is used to initialize the decoder, and as an additional source for the pointer network to copy tokens from.
4.2 QUALITY OF EDIT REPRESENTATIONS
First, we study the ability of our models to encode edits in a semantically meaningful way.
Figure 4: t-SNE visualization of edits from 13 C# fixers, where point color indicates the fixer. Labels indicate the id of the fixer, see main text. Visualizing Edits on Fixers Data In a first experiment, we train our sequential neural editor model on our GitHubEdits data and then compute representations for the edits generated by the C# fixers. A t-SNE visualization (Maaten & Hinton, 2008) of the encodings is shown in Figure 4. For this visualization, we randomly selected 100 examples from the edits of each fixer (if that fixer has more than 100 samples) and discarded fixer categories with less than 40 examples. Readers are referred to Appendix A for detailed descriptions of each fixer category. We find that our model produces dense clusters for simple or distinctive code edits, e.g. fixer RCS1089 (using the ++ or -- unary operators instead of a binary operator (e.g., ), and fixer CA2007 (adding .ConfigureAwait(false) for await statements). We also analyzed cases where (1) the edit examples from the same fixer
are scattered, or (2) the clusters of different fixers overlap with each other. For example, the fixer RCS1077 covers 12 different aspects of optimizing LINQ method calls (e.g., type casting, counting, etc.), and hence its edits are scattered. On the other hand, fixers RCS1146 and RCS1206 yield overlapping clusters, as both fixers change code to use the ?. operator. Fixers RCS1207 (change a lambda to a method group, e.g. ) and RCS1021 (simplify lambda expressions, e.g.
) are similar, as both inline lambda expressions in two different ways. Analysis yields that the representation is highly dependent on surface tokens. For instance, IDE004 (removing redundant type casts, e.g.
) and RCS1207 (removing explicit argument lists) yield overlapping clusters, as both involve deleting identifiers wrapped by parentheses.
Human Evaluation on Encoding Natural Language WikiAtomicEdits In a second experiment, we test how well neighborhoods in edit representation space correspond to semantic similarity. We computed the five nearest neighbors of 200 randomly sampled seed edits from our training set, using both our trained sequence-to-sequence editing model with sequential edit encoder, as well as a simple bag-of-words baseline based on TF-IDF scores. We then rated the quality of the retrieved neighbors on a scale of 0 (“unrelated edit”), 1 (“similar edit”) and 2 (“semantically or syntactically same edit”). Details of the annotation schema is included in Sect. E. We show the (normalized) discounted cumulative gain (DCG, Manning et al. (2008)) for the two models at the top of Tab. 1 (higher is better). The relevance scores indicate that our neural model clearly outperforms the simplistic baseline. Tab. 1 also presents two example edits with their nearest neighbors. Example 1 shows that the neural edit models succeeded in representing syntactically and semantically similar edits, while the bag-of-words baseline relies purely on surface token overlap. Interestingly, we also observed that the edit representations learned by the neural editing model on WikiAtomicEdits are somewhat sensitive to position, i.e. the position of the inserted tokens in both the seed edit and the nearest neighbors is similar. This is illustrated in Example 2, where the second (“senegalese striker”) and the third (“republican incumbent”) nearest neighbors returned by the neural model have similar editing positions as the seed edit, while they are semantically diverse.
4.3 EDIT ENCODER PERFORMANCE
To evaluate the performance of our two edit encoders discussed in Sect. 3.2 and disentangle it from the choice of neural editor, we train various combinations of our neural editor model and manually evaluate the quality of the edit representation. More specifically, we trained our neural editor models on GitHubEdits and randomly sampled 200 seed edits and computed their 3 nearest neighbors using each end-to-end model. We then rated the resulting groups using the same 0-2 scale as above. The resulting relevance scores are shown in Tab. 2.
Table 1: Natural language human evaluation results and 3 nearest neighbors. Inserted text
Example 1 neural editing model returns syntactically and semantically similar edits. Example 2 Neural edit representations are sensitive to position.
Table 2: Relevance scores of human evaluation on GitHubEdits data. Acc.@1 denotes the ratio that the 1-nearest neighbor has a score 2.
Comparing the sequential edit encoders trained with Seq2Seq and Graph2Tree editors, we found that the edit encoder trained with the Graph2Tree objective performs better. We hypothesize that this is because the Graph2Tree editor better captures structural-level information about an edit. For instance, Example 1 in Tab. 3 removes explicit type casting. The Seq2Seq editor has difficulty distinguishing this type of edit, confusing it with changes of lambda expressions to method groups (1st and 2nd nearest neighbors) since both two types of edits involve removing paired parentheses.
Surprisingly, we found that the graph-based edit encoder does not outperform the sequence-based encoder. However, we observe that the graph edit encoder in many cases tends to better capture high-level and abstract structural edit patterns. Example 2 in Tab. 3 showcases a seed edit that swaps two consecutive declarations, which corresponds to swapping the intermediate Expression nodes representing each statement on the underlying AST. In this case, the graph edit encoder is capable of grouping semantically similar edits, while it seems to be more difficult for the sequential encoder
Table 3: Two example source code edits and their nearest neighbors based on the edit representations computed by each model.
Table 4: Test performance of different neural editors.
encoder to capture the edit pattern. On the other hand, we found that the graph edit encoder often fails to capture simpler, lexical level edits (e.g., Example 1). This might suggest that terminal node information is not effectively propagated, an interesting issue worth future investigation.
4.4 PRECISION OF NEURAL EDITORS
Finally, we evaluate the performance of our end-to-end system by predicting the edited input and the edit representation. We are interested in answering two research questions: First, how well can our neural editors generate
given the gold-standard edit representation
? Second, and perhaps more interestingly, can we use the representation of a similar edit
to generate
that edit to
To answer the first question, we trained our neural editor models on the WikiAtomicEdits and the GitHubEdits dataset, and evaluate the performance of encoding and applying edits on test sets. For completeness, we also evaluated the performance of our neural editor models with a simple “Bag-of-Edits” edit encoding scheme, where is modeled as the concatenation of two vectors, each representing the sum of the embeddings of added and deleted tokens in the edit, respectively. This edit encoding method is reminiscent of the model used in Guu et al. (2017) for solving a different task of language modeling by marginalizing over latent edits, which we will elaborate in Sect. 5. Tab. 4
Table 5: Transfer learning results on C# fixers data, averaged across all fixer categories.
lists the evaluation results. With our proposed sequence- and graph-based edit encoders, our neural editor models achieve reasonable end-to-end performance, surpassing systems using bag-of-edits representations. This is because many edits are context-sensitive and position-sensitive, requiring edit representation models that go beyond the bag-of-edits scheme to capture those effects (more analysis is included in Appendix B). Interestingly, on the GitHubEdits dataset, we find that the Seq2Seq editor with sequential edit encoder registers the best performance. However, it should be noted that in this set of experiments, we encode the gold-standard edit to predict
. As we will show later, better performance with the gold-standard edit does not necessarily imply better (more generalizable) edit representation. Nevertheless, we hypothesize that the higher accuracy of the Seq2Seq edit is due to the fact that a significant proportion of edits in this dataset is small and primarily syntactically simple. Indeed we find that 69% of test examples have a token-level edit distance of less than 5.
To answer the second question, we use the trained neural editors from the previous experiment, and test their performance in a “one-shot” transfer learning scenario. Specifically, we use our high-quality C#Fixers dataset, and for each fixer category F of semantically similar edits, we randomly select a seed edit , and use its edit representation
to predict the updated code for all examples in F, i.e., we have
. This task is highly non-trivial, since a fixer category could contain more than hundreds of edit examples collected from different C# projects. Therefore, it requires the edit representations to generalize and transfer well, while being invariant of local lexical information like specific method names. To make the experimental evaluation more robust to noise, for each fixer category F, we randomly sample 10 seed edit pairs
, compute their edit representations and use them to predict the edited version of the examples in F and evaluate accuracy of predicting the exact final version. We then report the best score among the 10 seed representations as the performance metric on F.
Tab. 5 summarizes the results and also reports the upper bound performance when using the gold-standard edit representation to predict
, and an approximation of the “lower bound” accuracies using pre-trained Seq2Seq and Graph2Tree models without edit encoders. We found that our neural Graph2Tree editor with the sequential edit encoder significantly outperforms the Seq2Seq editor, even though Seq2Seq performs better when using gold-standard edit representations. This suggest that the edit representations learned with the Graph2Tree model generalize better, especially for edits discussed in Sect. 4.2 that involve syntactic variations like RCS1021 (lambda expression simplification, 7.8% vs. 30.7% for Seq2Seq and Graph2Tree), and RCS1207 (change lambdas to method groups, 7.1% vs. 26.2%). Interestingly, we also observe that Seq2Seq outperforms the Graph2Tree model for edits with trivial surface edit sequences, where the Graph2Tree model does not have a clear advantage. For example, on RCS1015 (use nameof operator, e.g.
), the accuracies for Seq2Seq and Graph2Tree are 40.0% (14/35) and 28.6% (10/35), resp. We include more analysis of the results in Appendix C.
Edits have recently been considered in NLP, as they represent interesting linguistic phenomena in language modeling and discourse (Faruqui et al., 2018; Yang et al., 2017a). Specifically, Guu et al. (2017) present a generative model of natural language sentences via editing prototypes. Our work shares with Guu et al. (2017) in that (1) the posterior edit encoding model in Guu et al. (2017) is similar to our baseline “bag-of-edits” encoder in Sec. 4.4, and (2) the sequence-to-sequence sentence generation model given the prototype and edit representation is reminiscent of our Seq2Seq editor. In contrast, our work directly focuses on discriminative learning of representing edits and applying the learned edits for both sequential (NL) and structured (code) data. Another similar line of research is “retrieve-and-edit” models for text generation (Hashimoto et al., 2018), where given an input data x, the target prediction y is generated by editing a similar target that is retrieved based on the similarity between its source
and the input x. While these models typically require an “editor” component to generate the output by exploiting the difference between similar inputs, they usually use the simpler bag-of-edits representations (Wu et al., 2019), or implicitly capture it via end-to-end neural networks (Contractor et al., 2018). To our best knowledge, there is not any related work that classifies or otherwise explicitly represents the differences over similar input, with the exception of differential recurrent neural networks used for action recognition in videos (Veeriah et al., 2015; Zhuang et al., 2018). This is a substantially different task, as the data includes a temporal component as well.
Source code edits are a widely studied artifact. Specialized software, such as git, is widely used to store source code revision histories. Nguyen et al. (2013) studied the repetitiveness of source code changes by identifying identical types of changes using a deterministic differencing tool. In contrast, we employ on a neural network to cluster similar changes together. Rolim et al. (2017) use such clusters to synthesize small programs that perform the edit. The approach is based on Rolim et al. (2018) extract manually designed syntactic features from code and cluster over multiple changes to find repeatable edit rules. Similarly, Paletov et al. (2018) extract syntactic features specifically targeting edits in cryptography API protocols. In this work, we try to avoid hand-designed features and allow a neural network to learn the relevant aspects of a change by directly giving as input the original and final version of a changed code snippet.
Modeling tree generation with machine learning is an old problem that has been widely studied in NLP. Starting with Maddison & Tarlow (2014), code generation has also been considered as a tree generation problem. Close to our work is the decoder of Yin & Neubig (2017) which we use as the basis of our decoder. The work of Chen et al. (2018) is also related, since it provides a tree-to-tree model, but focuses on learning a single translation tasks and cannot be used directly to represent multiple types of edits. Both Yin & Neubig (2017) and Chen et al. (2018) have copying mechanism for single tokens, but our subtree copying mechanism is novel.
Autoencoders (see Goodfellow et al. (2016) for an overview) have a long history in machine learning. Variational autoencoders (Kingma & Welling, 2013) are similar to autoencoders but instead of focusing on the learned representation, they aim to create accurate generative probabilistic models. Most (variational) autoencoders focus on encoding images but there have been works that autoencode sequences, such as text (Dai & Le, 2015; Bowman et al., 2015; Yang et al., 2017b) and graphs (Si- monovsky & Komodakis, 2018; Liu et al., 2018). Conditional variational autoencoders (Sohn et al., 2015) have a related form to our model (with the exception of the KL term), but are studied as generative models, whereas we are primarily interested in the edit representation.
In this work, we presented the problem of learning distributed representation of edits. We believe that the dataset of edits is highly relevant and should be studied in more detail. While we have presented a set of initial models and metrics on the problem and obtained some first promising results, further development in both of these areas is needed. We hope that our work inspires others to work on this interesting problem in the future.
We would like to thank Rachel Free for her insightful comments and suggestions.
Ron Artstein and Massimo Poesio. Inter-coder agreement for computational linguistics. Computational Linguistics, 34:555–596, 2008.
Samuel R Bowman, Luke Vilnis, Oriol Vinyals, Andrew M Dai, Rafal Jozefowicz, and Samy Bengio. Generating sentences from a continuous space. arXiv preprint arXiv:1511.06349, 2015.
Xinyun Chen, Chang Liu, and Dawn Song. Tree-to-tree neural networks for program translation. arXiv preprint arXiv:1802.03691, 2018.
Danish Contractor, Vineet Kumar, Sachindra Joshi, and Gaurav Pandey. Exemplar encoder-decoder for neural conversation generation. In ACL, 2018.
Andrew M Dai and Quoc V Le. Semi-supervised sequence learning. In Advances in neural information processing systems, pp. 3079–3087, 2015.
Manaal Faruqui and Dipanjan Das. Identifying Well-formed Natural Language Questions. In Proceedings of the Conference on Empirical Methods in Natural Language Processing, 2018.
Manaal Faruqui, Ellie Pavlick, Ian Tenney, and Dipanjan Das. WikiAtomicEdits: A multilingual corpus of Wikipedia edits for modeling language and discourse. In EMNLP, 2018.
Justin Gilmer, Samuel S Schoenholz, Patrick F Riley, Oriol Vinyals, and George E Dahl. Neural message passing for quantum chemistry. In International Conference on Machine Learning (ICML), 2017.
Ian Goodfellow, Yoshua Bengio, and Aaron Courville. Deep Learning. MIT Press, 2016. http: //www.deeplearningbook.org.
Kelvin Guu, Tatsunori B Hashimoto, Yonatan Oren, and Percy Liang. Generating sentences by editing prototypes. arXiv preprint arXiv:1709.08878, 2017.
Tatsunori B. Hashimoto, Kelvin Guu, Yonatan Oren, and Percy S. Liang. A retrieve-and-edit frame- work for predicting structured outputs. In NeurIPS, 2018.
Diederik P Kingma and Max Welling. Auto-encoding variational bayes. arXiv preprint arXiv:1312.6114, 2013.
Yujia Li, Daniel Tarlow, Marc Brockschmidt, and Richard Zemel. Gated graph sequence neural networks. arXiv preprint arXiv:1511.05493, 2015.
Qi Liu, Miltiadis Allamanis, Marc Brockschmidt, and Alexander L. Gaunt. Constrained graph variational autoencoders for molecule design. In Neural Information Processing Systems (NIPS), 2018.
Thang Luong, Hieu Pham, and Christopher D. Manning. Effective approaches to attention-based neural machine translation. In Proceedings of EMNLP, 2015.
Laurens van der Maaten and Geoffrey Hinton. Visualizing data using t-SNE. Journal of machine learning research, 9(Nov):2579–2605, 2008.
Chris Maddison and Daniel Tarlow. Structured generative models of natural source code. In International Conference on Machine Learning (ICML), 2014.
Christopher D. Manning, Prabhakar Raghavan, and Hinrich Sch¨utze. Introduction to Information Retrieval. Cambridge University Press, New York, NY, USA, 2008. ISBN 0521865719, 9780521865715.
Hoan Anh Nguyen, Anh Tuan Nguyen, Tung Thanh Nguyen, Tien N Nguyen, and Hridesh Rajan. A study of repetitiveness of code changes in software evolution. In Proceedings of the 28th IEEE/ACM International Conference on Automated Software Engineering, pp. 180–190. IEEE Press, 2013.
Rumen Paletov, Petar Tsankov, Veselin Raychev, and Martin Vechev. Inferring crypto API rules from code changes. In Proceedings of the 39th ACM SIGPLAN Conference on Programming Language Design and Implementation, pp. 450–464. ACM, 2018.
Reudismam Rolim, Gustavo Soares, Loris D’Antoni, Oleksandr Polozov, Sumit Gulwani, Rohit Gheyi, Ryo Suzuki, and Bj¨orn Hartmann. Learning syntactic program transformations from examples. In Proceedings of the 39th International Conference on Software Engineering, pp. 404–415. IEEE Press, 2017.
Reudismam Rolim, Gustavo Soares, Rohit Gheyi, and Loris D’Antoni. Learning quick fixes from code repositories. arXiv preprint arXiv:1803.03806, 2018.
Martin Simonovsky and Nikos Komodakis. GraphVAE: Towards generation of small graphs using variational autoencoders. arXiv preprint arXiv:1802.03480, 2018.
Kihyuk Sohn, Honglak Lee, and Xinchen Yan. Learning structured output representation using deep conditional generative models. In Advances in Neural Information Processing Systems, pp. 3483–3491, 2015.
Vivek Veeriah, Naifan Zhuang, and Guo-Jun Qi. Differential recurrent neural networks for action recognition. In Proceedings of the IEEE international conference on computer vision, pp. 4041– 4049, 2015.
Oriol Vinyals, Meire Fortunato, and Navdeep Jaitly. Pointer networks. In Advances in Neural Information Processing Systems, pp. 2692–2700, 2015.
Yu Ping Wu, Furu Wei, Shaohan Huang, Zhoujun Li, and Ming Zhou. Response generation by context-aware prototype editing. 2019.
Diyi Yang, Aaron Halfaker, Robert Kraut, and Eduard Hovy. Identifying semantic edit intentions from revisions in wikipedia. In Proceedings of the Conference on Empirical Methods in Natural Language Processing, pp. 2000–2010, 2017a.
Zichao Yang, Zhiting Hu, Ruslan Salakhutdinov, and Taylor Berg-Kirkpatrick. Improved variational autoencoders for text modeling using dilated convolutions. arXiv preprint arXiv:1702.08139, 2017b.
Pengcheng Yin and Graham Neubig. A syntactic neural model for general-purpose code generation. In Proceedings of the 55th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), volume 1, pp. 440–450, 2017.
Pengcheng Yin and Graham Neubig. TRANX: A transition-based neural abstract syntax parser for semantic parsing and code generation. In Conference on Empirical Methods in Natural Language Processing (EMNLP) Demo Track, 2018.
Naifan Zhuang, The Duc Kieu, Guo-Jun Qi, and Kien A Hua. Deep differential recurrent neural networks. arXiv preprint arXiv:1804.04192, 2018.
A DATASETS AND CONFIGURATION
WikiAtomicEdits We randomly sampled 1040K insertion examples from the English portion of
WikiAtomicEdits (Faruqui et al., 2018) dataset, with a train, development and test splits of 1000K,
20K and 20K.
GitHubEdits We cloned the top 54 C# GitHub repositories based on their popularity (Tab. 8). For
each commit in the master branch, we collect the previous and updated versions of the source
code, and extract all consecutive lines of edits that are smaller than three lines, and with at least three
preceding and successive lines that have not been changed. We then filter trivial changes such as
variable and identifier renaming, and changes happened within comments. We also limit the number
of tokens for each edit to be smaller than 100, and down-sample edits whose frequency is larger than
30. Finally, we split the dataset by commit ids, ensuring that there are no edits in the training and
testing (development) sets coming from the same commit. Tab. 6 lists some statistics of the dataset.
Table 6: Statistics of the GitHubEdits Dataset
C#Fixers We selected 16 C# fixers from Roslyn4 and Roslynator5, and ran them on 6 C#
projects to generate a small, high-quality C# fixers dataset of 2 878 edit pairs with known se-
mantics. Table 7 lists the detailed descriptions for each fixer category. And more informa-
tion can be found at https://github.com/JosefPihrt/Roslynator/blob/master/
Network Configuration Throughout the experiments, we use a fixed edit representation size of
512. The dimensionality of word embedding, the hidden states of the encoder LSTMs, as well as the
gated graph neural network is 128, while the decoder LSTM uses a larger hidden size of 256. For the
graph-based edit encoder, we used a two-layer graph neural network, with 5 information propagation
steps at each layer. During training, we performed early stopping, and choose the best model based
on perplexity scores on development set. During testing, we decode a target element using a
beam size of 5.
4http://roslyn.io 5https://github.com/JosefPihrt/Roslynator
Table 7: Descriptions of fixer categories in C#Fixers dataset
Table 8: Our C# GitHub dataset projects
B CLUSTERING EXPERIMENTS
To qualitatively evaluate the quality of the learned edit representations. We use the models trained
on the WikiAtomicEdits and GitHubEdits datasets to cluster natural language and code edits. We
run K-Means clustering algorithm on 0.5 million sampled edits from WikiAtomicEdits, and all 90K
code edits from GitHubEdits, producing 50 000 and 20 000 clusters for each dataset.
Tab. 9 and Tab. 10 list some example clusters on WikiAtomicEdits and GitHub datasets, respectively.
Due to the size of clusters, we omit out-liners and present distinctive examples from each cluster. On
the WikiAtomicEdits dataset, we found clusters whose examples are semantically and syntactically
similar. More interestingly, on the source code data, we find representative clusters that relate to
idiomatic patterns and best practices of programming. The clustering results produced by our model
would be useful for programming synthesis toolkits to generate interpretable code refractory rules,
which we leave as interesting future work.
Finally, we remark that the clustering results indicate that the encoding of edits is context-sensitive and
position-sensitive for both natural language and source code data. For instance, the WikiAtomicEdits
examples we present in Tab. 9 clearly indicate that semantically similar insertions also share similar
editing positions. This is even more visible in code edits (Tab. 10). For instance, in the first example
in Tab. 10, Equal() can be changed to Empty() only in the Assert namespace (i.e., the context).
These examples demonstrate that it is important for an edit encoder to capture the contextual and
positional information in edits, a property that cannot be captured by simple “bag-of-edits” edit
representation methods.
Table 9: Example clusters on WikiAtomicEdits data using representations learned by a neural Seq2Seq editor with sequential edit encoder
currently playing for stevenage borough . 2. audrey brown ( born 24 may , 1913 ) is a british athlete who competed mainly in
the 100 metres . 3. alice rumph was a painter , etcher , and teacher . 4. mark
taufua is an australian professional rugby league player . 5. monique
lamoureux ( born july 3 , 1989 ) is an american ice hockey player .
1. mid-state regional airport , also known as mid-state airport ,
is a small airport on in rush township , centre county in pennsylvania in the united states .
2. islamic culture , also known as saracenic culture ,
is a term primarily used in secular academia to describe the cultural practices common to historically islamic peoples .
3. birds of prey , also known as raptors ,
are birds that hunt for food primarily via flight , using their keen senses , especially vision .
4. tetyana styazhkina , also written as tetyana stiajkina ,
( ; born april 10 , 1977 ) is a ukrainian cycle racer who rides for the chirio forno d’asolo team .
5. acid jazz , also known as club jazz ,
is a musical genre that combines elements of jazz , soul , funk , disco and hip hop .
1. the douro fully enters portuguese territory just after the confluence with the gueda river ; once the douro enters portugal , major population centres are less frequent along the river
2. mochou lake and mochou lake park are located at 35 hanzhongmen da jie in the jianye district of nanjing , china , west to qinhuai river
3. reiner gamma is an albedo feature that is located on the oceanus procellarum , to the west of the reiner crater on the moon
4. she made a brief return to the screen in ” parrish ” ( 1961 ) , playing the supporting role of mother which received little attention by the press
5. he was involved in a few storylines , including one where he broke his toe and had a heart attack after he was pushed by a mugger in the market
working to clear the landslide . 2. at the docks ,
hikaru attempts to befriend the tiger , but finds that it dislikes humans . 3.
about the second ,
i do know they exist , but the question is whether they are considered a
genre outside of japan . 4. in the battle ,
shirou uses his reality marble , unlimited blade works and defeats gilgamesh . 5.
in the game ,
red is a curious 11 - year - old boy from pallet town .
Table 10: Example clusters on GithubEdits data using representations learned by a Graph2Tree editor with sequential edit encoder. Locally defined variable names are canonicalized.
Table 11: Break-down performance results on the transfer learning task. See Tab. 7 for descriptions of each fixer category.
C BREAK-DOWN ANALYSIS OF TRANSFER LEARNING RESULTS
Tab. 11 lists the detailed evaluation results for the transfer learning experiments discussed in
Sect. 4.4. We refer readers to Tab. 7 for detailed descriptions of each fixer category. The neu-
ral Graph2Tree editor outperforms the Seq2Seq editor (both with sequential edit encoders) on 10
out of 16 fixer categories. However, we found that there are categories where both end-to-end sys-
tem under-performs, even though the upper-bound accuracy is high (e.g. RCS1077, RCSRCS1197,
RCS1207, RCS1032). While improving the generalization ability of the neural editor models to
achieve better transfer learning performance is an important future work, we remark that this task
is indeed non-trivial. First, some fixer categories cover a broad range of similar edits, which could
not be captured by a single seed edit. xSecond, some categories contain syntactically or semanti-
cally complex refactoring rules. For instance, RCS1207 converts method groups into anonymous
functions, involving changing multiple positions of the source code, which might not be trivially
captured by the sequential edit encoder from a single example edit. Additionally, RCS1197 re-
quires reasoning about a chain of expressions. It turns sb.Append(s1 + s2 + . . . + sN) into
sb.Append(s1).Append(s2).[. . .]Append(sN)), which our current models are unable to
reason about. More interestingly, we found that there are cases where the edits are syntactically
simple, but could be semantically more difficult to learn. For instance, RCS1032 is about removing
redundant parentheses from expressions. Although the edit pattern might seem to be syntactically
simple at the AST level (replacing a ParethesizedExpressionSyntax node by its child
node), determining which pair of parentheses is actually redundant in an expression (e.g. (a + b)
* (c / d)) is semantically non-trivial to learn from a single edit example. We believe that further
advances in (general) learning from source code are required to correctly handle these cases.
D IMPACT OF TRAINING SET SIZE
To evaluate the data efficiency of our proposed approach, we tested the end-to-end performance of
our neural editor model (Sect. 4.4, Tab. 4) with varying amount of training data. Tab. 12 lists the
results. We found both Graph2Tree and Seq2Seq editors are relatively data efficient. They registered
around 90% of the accuracies achieved using the full training set with only 60% of the training data.
E DETAILS OF HUMAN EVALUATION
As discussed in Sect. 4.2, we performed human evaluation to rate the qualities of neighboring edits
given a seed edit. The annotation instructions on GithubEdits and WikiAtomicEdits datasets are
listed below. The annotation was carried out by three authors of this paper, and we anonymized the
source of systems that generated the output. The three-way Fleiss’ kappa inter-rater agreement is
, which shows moderate agreement (Artstein & Poesio, 2008), an agreement level that is
also used in other annotation tasks in NLP (Faruqui & Das, 2018).
Table 12: Test performance of end-to-end experiments with varying amount of training data.
Table 13: Annotation Instruction for GitHubEdits Data
The changed constituents in the seed edit and the neighboring edit are applied to the similar positions of the original sentence, serving the same syntactic and semantic role. For example,
• Neighbor Assert.True(V0.GetString(V0.GetBytes(LITERAL)) == V1.ContainingAssembly.Identity.CultureName);
The following edit is a related edit of the first example above, as it applies the same simplification (.Where(COND).Func() to .Func(COND)), but for FirstOrDefault instead of Single:
• Seed Edit V0 = V1.Where(V2 => V2.Name == LITERAL).Single();
V0 = V1.Single(V2=> V2.Name == LITERAL);
• Neighbor V0 = V1.Where(V2 => V3.ReportsTo == V2.EmployeeID).FirstOrDefault();
V0 = V1.FirstOrDefault(V2 => V3.ReportsTo == V2.EmployeeID);
The following edit is a related edit of the second example above, as it also replaces a ternary expression for null checking with the ?. and ?? operators:
• Seed Edit Type V0 = V1 == null ? typeof(object) : V1.GetType();
Type V0 = V1?.GetType() ?? typeof(object);
• Neighbor V0 = V1 != null ? V1.ToList() : new List<TextSpan>();
V0 = V1?.ToList() ?? new List<TextSpan>();
• Seed Edit V0.State = V1;
V0.SetState(VAR1);
• Neighbor V0.Quantity = V1;
V0.SetQuantity(V1);
Table 14: Annotation Instruction for WikiAtomEdits Data
The changed constituents in the seed and the neighboring edit are applied to the similar positions of the original sentence, and they play similar syntactic roles. This includes examples like adding a disfunction, adding a complement, prepositional clause or other syntactic constructs with similar phrases or language structures. For example,