With the recent development of neural machine translation (NMT), NMT becomes an attractive option for program language translation [1, 3, 4, 12]. Especially, the language agnostic neural network design in sequence-to-sequence (seq2seq) models [7] is a promising method, as it is currently used in Google. The seq2Seq model can be trained using only a paired translated input without considering the language grammar difference, which makes the same network applicable for different natural languages.
Table 1 shows a comparison between natural languages and programming languages. There are enough similarities but quite a few differences as well. The biggest difference is that natural language translation tolerates grammar errors but program language translation does not. However, that can be corrected through post-processing similar to program debugging after compilation.
In this work, we present a case study of source-to-source translation using neural machine translation (NMT) techniques by translating CUDA to OpenCL. CUDA and OpenCL are both parallel computing programming languages for accelerators. We choose these languages because they share many similarities and provide a good platform to develop translation techniques using NMT and to understand the limitations. Based on the knowledge and techniques from this translation, future work can be expanded to support other program language translation.
The summary of our contribution is as follows. We develop a dataset generation flow for translating CUDA to OpenCL using NMT. To do so, we first write a pair of API usages for CUDA and OpenCL. With the API usages written, we construct usage symbol trees to extract sentences from the CUDA samples and find uncovered API usages. Using the dataset we generated, we train the NMT system, which learns the structural similarity between CUDA and OpenCL, and translate CUDA code to OpenCL code by inference. Finally, we present translation examples and discuss the limitations and feasibility of our current approach.
Figure 1. Overview of workflow [6].
2.1 Neural Machine Translation (NMT)
Neural networks have demonstrated outstanding performance in natural language processing (NLP). For example, the sequence-to-sequence (seq2seq) model [11] presents a large, deep Long Short-Term Memory (LSTM), outperforming a mature SMT system by a sizable margin. It also shows the capability of translating very long sentences.
Motivated by the meaningful success and the similarities between natural and programming languages, we exploit the
Table 1. Comparison between natural languages and programming languages.
seq2seq model to translate programming languages. We take a statement (or statements) as a sentence and translate it to another sentence written in the target language. Since we translate CUDA to OpenCL, in the training dataset, CUDA code becomes source sentences, and OpenCL code becomes target sentences. During an inference phase, we take as input CUDA code and infer OpenCL code.
2.2 CUDA vs. OpenCL
Both CUDA and OpenCL have host code and kernel code. In the host code case, most host API functions have one-to-one correspondence between CUDA and OpenCL. Consider an example of cudaMalloc and clCreateBuffer that have the same meaning between CUDA and OpenCL. Since a function call of cudaMalloc contains all the necessary information to be translated, such as arguments, we can write a function call of clCreateBuffer with a given source code.
Kernel code also has similarities between CUDA and OpenCL. Kernel qualifiers and built-in functions have equivalents, and therefore we replace one program language’s keyword with an equivalent one in the other language. Program translation rules between CUDA and OpenCL have already been studied [8], [5], and we used these rules to train NMT.
In this section, we describe the overview of the NMT workflow. As can be seen in Fig. 1, in addition to an NMT system, we use a pre-processor during the training phase and a pre-/post-processor during the inference phase. Compared to natural languages in which a large, but finite, set of vocabulary exists, programmers use numerous arbitrary variable names such as alphabet characters or abbreviations of variables or functions. By having a pre-/post-processor, we enable arbitrary variable names in programming languages to be translated. This also contributes to minimizing the size of the vocabulary. Both pre-/post-processors are developed as Python scripts, which are explained as follows.
3.1 Pre-processor
A pre-processor performs lexical analyses and variable renaming. First, it reads and tokenizes a given program code. While tokenizing the code, it merges tokens comprise one
Table 2. Mapping tokens to abstract symbols.
statement as a sentence—i.e., even when a statement is written in multiple lines, it puts the tokens together as a sentence. In this way, the pre-processor generates a set of sentences, each of which consists of tokens. Next, we map tokens (from tokenized sentences) to three types of symbols depending on their variable type. This is because we try to reduce the cases in which different sentences are renamed to one sentence. As can be seen in Table 2, we map identifier, string literals, and numeric constants to _id, and each symbol is tagged as a number in ascending order starting from zero—i.e., the first token mapped to _id becomes _id0, and the next one becomes _id1. And, we map operators to _op and data types to _tp with a tagged number in ascending order starting from zero. The numbering for each type has its own order.
Moreover, pre-processing sentences creates a mapping table that contains mapping information between variable names and abstract symbols. This is used later by a post-processor when it replaces the renamed tokens with their original names. Note that we do not rename CUDA/OpenCL APIs since they decide the context of a sentence and which rule to be used to translate it. Finally, the pre-processor is also capable of generating a training dataset. The details of how we generate a dataset are covered in Section 4.
3.2 NMT System
In an NMT system, we exploit a seq2seq model as it has demonstrated outstanding performance in translating long sentences. The NMT system is a component where actual translation occurs. During a training phase, a training dataset, which consists of source sentences for CUDA and target sentences for OpenCL, is pre-processed and fed into the NMT system. During an inference phase, input CUDA code is pre-processed and fed into the NMT system. Then, the NMT system outputs renamed OpenCL code.
Figure 2. Overview of the proposed dataset generation flow.
3.3 Post-processor
A post-processor performs initial name replacement and code restructure. As we train the NMT system with renamed sentences, the output sentences generated from the NMT system also have abstract symbols. Based on the mapping table created by a pre-processor, a post-processor replaces them with their original names. Since the initial variable names from the input code remain in the output code, the translation provides a high-quality code. Finally, based on syntactic rules, it puts appropriate indents between tokens for better readability and generates the final outcome.
Since there is no publicly available dataset for translating CUDA to OpenCL using NMT, we develop a dataset generation flow and generate a dataset from CUDA samples. Fig. 2 shows the overview of our proposed dataset generation flow, and we explain the steps of dataset generation as below.
4.1 Steps of Dataset Generation
We first summarize how we generate a dataset and describe additional details in the following subsections.
• Write a pair of API usages for CUDA and OpenCL. 2) Build usage symbol trees
• Read API usages written. • Tokenize each sentence and rename tokens as abstract symbols.
• Build usage symbol trees that consist of renamed tokens. 3) Gather expressions and find uncovered API us-
• Tokenize each sentence in CUDA samples. • If a sentence includes CUDA APIs, see if the sentence is found in the CUDA usage symbol tree.
• If found, and the API usage has expression nodes, add each expression to a corresponding expression node.
• If not found, write the sentence to a separate file that contains uncovered API usages.
• Read again API usages written. • If an API usage has expression keywords, permutate the expression nodes and add multiple sentences to a dataset.
• If not, add sentences to a dataset without permutation.
4.2 API Usage Generation
Our method requires users to manually write a pair of API usages for CUDA and OpenCL. A CUDA API usage has a sentence pattern to translate, and any variable names can be used in the sentence pattern since they will be renamed as abstract symbols later by a pre-processor. An OpenCL API usage has a sentence pattern that the corresponding CUDA API usage is translated to. Each line of API usages has a one-to-one correspondence with each other—i.e. the first line in the set of CUDA API usages corresponds to the first line in the set of OpenCL API usages.
4.3 Building a Usage Symbol Tree
A pre-processor tokenizes API usages manually written and renames tokens as abstract symbols. Then, it builds usage symbol trees that consist of renamed tokens for each CUDA and OpenCL. By traversing the usage symbol trees with tokens in a sentence, we can easily determine whether a given sentence is covered by our API usages, and based on the outcome we can generate a corresponding OpenCL sentence. If a sentence includes CUDA APIs but is not found in the usage symbol tree, it is considered an uncovered API usage and written to a separate file. The usage symbol trees enable us to maintain sentence patterns to translate and easily generate a new larger dataset when we get new CUDA samples.
Figure 3. Node representation of cudaMemcpy usage.
cudaMemcpy(A_gpu, A, sizeof(double)*NI, cudaMemcpyHost- ToDevice);
cudaMemcpy(B_gpu, B, sizeof(double)*NI*NL, cudaMemcpyHostToDevice);
cudaMemcpy(C_gpu, C, sizeof(double)*NI*NJ*NK, cudaMemcpyHostToDevice);
cudaMemcpy(_expr0, _expr1, _expr2, cudaMemcpyHostToDevice);
Figure 4. Example of using expression keywords.
4.4 Using an Expression Keyword
When we write a pair of API usages, we use an expression keyword, _expr, to represent function parameters. Each one is tagged as a number in ascending order starting from zero. Each expression keyword becomes an expression node in a usage symbol tree. Fig. 3 shows the example of node expression using expression keywords. This reduces manual efforts to write API usages. Consider the example of cudaMemcpy. It takes four parameters in its function call, and each parameter can have various shapes. Fig 4 (a) shows three different sentences. For these sentences, instead of writing three CUDA API usages for each sentence, we write one CUDA API usage with expression keywords, as shown in Fig 4 (b).
4.5 Expression Node Permutation
Expression node permutation is used to increase the size of a dataset. As explained in Section 4, when we define a pair of API usages, we use an expression keyword. Each expression keyword becomes an expression node in a usage symbol tree. When we generate sentences from CUDA samples, instead of simply finding sentences covered by API usages and adding them to a dataset, we collect expressions to the corresponding expression node. After looking through all samples, we permutate each expression node and generate the increased number of sentences. Consider cudaMemcpy in Fig. 3 and assume that we collect n expressions for _expr0, m expressions for _expr1, and k expressions for _expr2. By permutating
Table 3. The number of sentences generated from CUDA benchmarks.
Table 4. Hyper-parameters of the NMT system [7].
each expression node, we produce sentences. Note that _expr symbols are not present in sentences fed into the NMT model. It is only used to generate a dataset and is not used as a vocabulary.
As discussed in Section 4, we extract sentences from CUDA samples and generate a dataset. Our current dataset has 126 lines of API usages, and we use Polybench-gpu-1.0 [10], NVIDIA SDK [9] examples, and Rodinia [2] benchmarks as target translation. We currently support API usages in the evaluated benchmarks. Table 3 presents the number of sentences generated. In the table, Column 2 indicates the number of sentences that include CUDA APIs and are found in our current usage symbol trees, and Column 3 indicates the number of sentences that we generate using the expression node permutation method. With only 126 lines of API usages, we generate 1874 sentences from CUDA samples. We can see that the permutation method increases the number of sentences by 2.6x.
We use the seq2seq model as our NMT model. Table 4 shows the hyper-parameters of the NMT system used. We use the same dataset as training, development, and test dataset. This is because we intend to have the NMT system to produce a correct sentence and understand the limitation of this NMTbased translation approach. In contrast to natural languages, programming languages have a rigid syntax, and therefore we need to generate correct sentences to make the translated code executable. Therefore, we intentionally cause overfitting by using a shared dataset. While we achieve a 99.1 BLEU score, this does not imply that the NMT system has a better capability to infer and translate unseen sentences. However,
A Case Study: Exploiting Neural Machine Translation to Translate CUDA to OpenCL , 2019
it can correctly translate most of the sentences covered by the API usages written by users. For the polybench-gpu-1.0 benchmark, we manually changed about 10 lines out of 200-350 lines for each application and were able to make it run. To fully utilize the potential of using machine learning, we would need a larger dataset and need to split the dataset so that there is no overlap among training, development, and test dataset. We leave this as future work.
Fig 5 shows a translation example of a 2-D matrix multiplication. Fig. 5 (a) and (b) show the CUDA host and kernel code, respectively. Before being fed into the NMT system, they are pre-processed first by a pre-processor; Fig. 5 (c) and (d) show the pre-processed code. If a sentence does not have CUDA APIs, it becomes _line_not_to_translate symbol and is replaced with the original sentences later by a post-processor. After translation using the NMT system, the pre-processed code is translated to OpenCL code that retains the renamed tokens, which can be seen in Fig. 5 (e) and (f). Finally, a post-processor replaces the renamed tokens with their initial names and provides the final OpenCL code. The final output code is shown in Fig. 5 (g) and (h).
In this section, we discuss the limitation of our current approach. Fig 6 presents an example of sentences incorrectly translated.
Long sentence translation Despite the LSTM’s ability to learn long-range temporal dependencies, we observe many sentences in which several words are truncated. In our translation, a sentence fed into the NMT system might be very long as we tokenize sentences in CUDA samples and each token becomes a word. Since many of the tokens are from function parameters, we think a new embedding layer structure needs to be developed to encode parameter parts in a different way. We leave this as future work.
Unseen sentence translation Since we manually write API usages, the dataset generated inevitably has limited function coverage and does not guarantee the successful translation of random programs. Although variable renaming and the use of an expression keyword contribute to increasing coverage with the limited number of API usages, it is still non-trivial to correctly translate a sentence not covered by API usages.
Manual steps for API function mapping We require manual efforts to write a pair of API usages. Considering that most host API functions have one-to-one correspondence between CUDA and OpenCL, most parts of the job in writing API usages would change the function name and the positions of parameters. However, since there are various sentence patterns, we still require efforts to write API usages for those sentences.
void mm2Cuda ( float * A , float * B , float * C ) { cl_mem A_gpu ; ... A_gpu=clCreateBuffer(context, CL_MEM_READ_WRITE, sizeof(float)*NI*NK, NULL, NULL); clEnqueueWriteBuffer(command_queue, A_gpu, CL_TRUE, 0, sizeof(float)*NI*NK, A, 0, NULL, NULL); ...
Figure 5. Translation example of 2-D matrix multiplication code [6].
Figure 6. Example of sentences incorrectly translated.
Using NMT for program translation Chen et al. [3] propose a novel tree-to-tree neural network and demonstrates higher accuracy for program translation, but it has a limited set of variables and restricts the vocabulary size. Zuo et al. [12] utilize NMT techniques to deal with a cross-architecture code similarity comparison. However, it does not handle high-level language translation.
CUDA to OpenCL translation Martinez et al. [8] translate CUDA to OpenCL at an AST level, and Kim et al. [5] use wrapper functions to translate between CUDA and OpenCL. Compared to those works, we use source-to-source translation and exploit NMT techniques to translate programming langauges.
In this work, we exploited NMT techniques to translate CUDA to OpenCL. To do so, we developed a dataset generation flow and generated a dataset from CUDA benchmarks. Moreover, for training and inference phases, the pre-/post-processor were developed to enable arbitrary variable names to be translated. While our current approach correctly translates most of the sentences covered by the API usages manually written, we discovered several challenges of using NMT for program translation, especially for unseen or long sentences. In this work, the NMT itself has not been changed at all. It is our future work to improve the NMT to make it more program language translation friendly.
[1] Dzmitry Bahdanau, Kyunghyun Cho, and Yoshua Bengio. 2015. Neural Machine Translation by Jointly Learning to Align and Translate. CoRR abs/1409.0473 (2015).
[2] S. Che, M. Boyer, J. Meng, D. Tarjan, J. W. Sheaffer, S. Lee, and K. Skadron. 2009. Rodinia: A benchmark suite for heterogeneous computing. In 2009 IEEE International Symposium on Workload Characterization (IISWC). 44–54. https://doi.org/10.1109/IISWC.2009.5306797
[3] Xinyun Chen, Chang Liu, and Dawn Xiaodong Song. 2018. Tree-to-tree Neural Networks for Program Translation. In NeurIPS.
[4] Kyunghyun Cho, Bart van Merrienboer, ÃĞaglar GülÃğehre, Dzmitry Bahdanau, Fethi Bougares, Holger Schwenk, and Yoshua Bengio. 2014. Learning Phrase Representations using RNN Encoder-Decoder for Statistical Machine Translation. In EMNLP.
[5] J. Kim, T. T. Dao, J. Jung, J. Joo, and J. Lee. 2015. Bridging OpenCL and CUDA: a comparative analysis and translation. In of the International Conference for High Performance Computing, Networking, Storage and Analysis. 1–12. https://doi.org/10.1145/2807591. 2807621
[6] Yonghae Kim and Hyesoon Kim. 2019. Translating CUDA to OpenCL for Hardware Generation Using Neural Machine Translation. In Proceedings of the 2019 IEEE/ACM International Symposium on Code Generation and Optimization (CGO 2019). IEEE Press, Piscataway, NJ, USA, 285–286. http://dl.acm.org/citation.cfm?id=3314872.3314916
[7] Minh-Thang Luong, Eugene Brevdo, and Rui Zhao. 2017. Neural Machine Translation (seq2seq) Tutorial. https://github.com/tensorflow/nmt (2017).
[8] G. Martinez, M. Gardner, and W. Feng. 2011. CU2CL: A CUDA-to-OpenCL Translator for Multi- and Many-Core Architectures. In 2011 IEEE 17th International Conference on Parallel and Distributed Systems. 300–307. https://doi.org/10.1109/ICPADS.2011.48
[9] NVIDIA. 2017. NVIDIA. In CUDA Samples Reference Manual.
[10] Robert Searles Sudhee Ayalasomayajula Scott Grauer-Gray, Lifan Xu and John Cavazos. 2012. Auto-tuning a High-Level Language Targeted
A Case Study: Exploiting Neural Machine Translation to Translate CUDA to OpenCL , 2019
to GPU Codes. In Proceedings of Innovative Parallel Computing (InPar ’12).
[11] Ilya Sutskever, Oriol Vinyals, and Quoc V. Le. 2014. Sequence to Sequence Learning with Neural Networks. In Proceedings of the 27th International Conference on Neural Information Processing Systems - . MIT Press, Cambridge, MA, USA, 3104–3112. http: //dl.acm.org/citation.cfm?id=2969033.2969173
[12] Fei Zuo, Xiaopeng Li, Zhexin Zhang, Patrick Young, Lannan Luo, and Qiang Zeng. 2018. Neural Machine Translation Inspired Binary Code Similarity Comparison beyond Function Pairs. CoRR abs/1808.04706 (2018).