perm2vec: Graph Permutation Selection for Decoding of Error Correction Codes using Self-Attention

2020·Arxiv

Abstract

Abstract

Error correction codes are an integral part of communication applications, boosting the reliability of transmission. The optimal decoding of transmitted codewords is the maximum likelihood rule, which is NP-hard due to the curse of dimensionality. For practical realizations, sub-optimal decoding algorithms are employed; yet limited theoretical insights prevent one from exploiting the full potential of these algorithms. One such insight is the choice of permutation in permutation decoding. We present a data-driven framework for permutation selection, combining domain knowledge with machine learning concepts such as node embedding and self-attention. Significant and consistent improvements in the bit error rate are introduced for all simulated codes, over the baseline decoders. To the best of the authors’ knowledge, this work is the first to leverage the benefits of the neural Transformer networks in physical layer communication systems.

Index Terms—Decoding, error correcting codes, belief propagation, deep learning.

I. INTRODUCTION

Shannon’s well known channel coding theorem [1] states that for every channel a code exists, such that encoded messages can be transmitted and decoded with an error as low as needed while the transmission rate is below the channel’s capacity. The converse is no less crucial: Transmitting with rate beyond the capacity enforces uncertain reliability. For practical applications, latency and computational complexity constrain code size. Thus, structured codes with low complexity encoding and decoding schemes, were devised.

Some structured codes possess a main feature known as the permutation group (PG). The permutations in PG map each codeword to some distinct codeword. This is crucial to different decoders, such as the parallelizable soft-decision Belief Propagation (BP) [2] decoder. It empirically stems from evidence that whereas decoding various corrupted words may fail, decoding a permuted version of the same corrupted words may succeed [3]. For instance, this is exploited in the mRRD [4] and the BPL [5] algorithms, which perform multiple runs over different permuted versions of the same corrupted codewords by trading off complexity for higher decoding gains.

Nonetheless, there is room for improvement since not all permutations are required for successful decoding of a given word: simply a fitting one is needed. Our work deals with obtaining the best fit permutation per word, by removing redundant runs which thus preserve computational resources. Nevertheless, it remains unclear how to obtain this type of permutation as indicated by the authors in [5] who stated in their Section III.A, “there exists no clear evidence on which graph permutation performs best for a given input”. Explicitly, the goal is to approximate a function mapping from a single word to the most probable-to-decode permutation. While analytical derivation of this function is hard, advances in the machine learning field may be of use in the computation of this type of function.

The recent emergence of Deep Learning (DL) has demonstrated the advantages of Neural Networks (NN) in a myriad of communication and information theory applications where no analytical solutions exists [6], [7]. For instance in [8], a tight lower bound on the mutual information between two high-dimensional continuous variables was estimated with NN. Another recurring motive for the use of NN in communications has to do with the amount of data at hand. Several data-driven solutions were described in [9], [10] for scenarios with small amounts of data, since obtaining data samples in the real world is costly and hard to collect on-the-fly. On the other hand, one should not belittle the benefits of unlimited simulated data, see [11], [12].

Lately, two main classes of decoders have been put forward in machine learning for decoding. The first is the class of model-free decoders employing neural network architectures as in [13], [14]. The second is composed of model-based decoders [15]–[19] implementing parameterized versions of classical BP decoders. Currently, the model-based approach dominates, but it suffers from a regularized hypothesis space due to its inductive bias.

Our work leverages permutation groups and DL to enhance the decoding capabilities of constrained model-based decoders. First, a self-attention model (introduced in Section III) [20] is employed to embed all the differentiated group permutations of a code in a word-independent manner, by extracting relevant features. This is done once before the test phase during a preprocess phase. At test time, a trained NN accepts a corrupted word and the embedded permutations and predicts the probability for successful decoding for each permutation. Thereafter, a set of either one, five or ten most-probable-to-decode permutations are chosen, and decoding is carried out on the permuted channel words rather than decoding an arbitrary dataset with all permutations, and empirically choosing the best subset of them.

The presented method is simulated on Bose-Chaudhuri-Hocquenghem (BCH) codes of varying lengths, achieving gains of up to 2.75dB over the random permutation selection baseline.

The remainder of this paper is organized as follows. The background on coding, code permutations, node embedding and self-attention is provided in Section III. The formulation and method of the Graph Permutation Selection (GPS) is presented in Section IV and related work is discussed in Section II. Finally experimental setup and results are detailed in Section V.

II. RELATED WORK

Permutation decoding (PD) has attracted renewed attention [17], [21], [22] given its proven gains for 5G-standard approved polar codes. [21] suggested a novel PD method for these codes. However, the main novelty lies in the proposed stopping criteria for the list-decoder, whereas the permutations are chosen in a random fashion. The authors in [17] presented an algorithm to form a permutation set, computed by fixing several first layers of the underlying structure of the polar decoder, and only permuting the last layers. The original graph is included in this set as a default, with additional permutations added during the process of a limited-space search. Finally we refer to [22] which proposes a successive permutations scheme that finds suitable permutations as decoding progresses. Again, due to the exploding search space, they only considered the cyclic shifts of each layer. This limited-search first appeared in [23].

Most PD methods, like the ones mentioned above, have made valuable contributions. We, on the other hand, see the choice of permutation as the most integral part of PD, and suggest a pre-decoding module to choose the best fitting one. Note however that a direct comparisons between the PD model-based works mentioned and ours are infeasible.

Regarding model-free approaches, we refer in particular to [24] since it integrates permutation groups into a model-free approach. In that paper, the decoding network accepts the syndrome of the hard decisions as part of the input. This way, domain knowledge is incorporated into the model-free approach. We introduce domain knowledge by training the permutation embedding on the parity-check matrix and accepting the permuted syndrome. Furthermore, each word is chosen as a fitting permutation such that the sum of LLRs in the positions of the information-bits is maximized. Note that this approach only benefits model-free decoders. Here as well comparisons are infeasible.

III. BACKGROUND

A. Coding

In a typical communication system, first, a length message is encoded by a generator matrix length . Every codeword isfies is the parity-check matrix (uniquely de-fined by ). Next, the codeword is modulated by the Binary Phase Shift Keying (BPSK) mapping (resulting in a modulated word . After transmission through the additive white Gaussian noise (AWGN) channel, the received word is , where .

At the receiver, the received word is checked for any detectable errors. For that purpose, an estimated codeword calculated using a hard decision (HD) rule: syndrome is all zeros, one outputs and concludes. A non-zero syndrome indicates that channel errors occurred. Then, a decoding function , is utilized with output . One standard soft-decision decoding algorithm is Belief Propagation (BP). BP is a graph-based inference algorithm that can be used to decode corrupted codewords in an iterative manner, working over a factor graph known as the Tanner graph. The BP algorithm operates by passing messages over the nodes of the Tanner graph until convergence or a maximum number of iterations is reached. One property known to effect the convergence of the algorithm is cycles. Cycles in a Tanner graph refer to a subset of nodes connected to each other and inducing a closed loop with every edge appearing once. Messages that are propagated along cycles become correlated after several BP iterations, preventing convergence to the correct posterior distribution and thus reducing overall decoding performance. We refer the interested reader to [25] for a full derivation of the BP for linear codes, and to [26] for more details on the effects of cycles in codes.

B. Permutation Group of a code

codeword exchanges the positions of the entries of :

A permutation automorphism of a given code implies . The group of all automorphism permutations of a code C is denoted , also referred to as the PG of the code. A widely employed family of codes, with known PGs [27], is the BCH family. The PGs of this code are presented in [28, pp. 233]:

with . Thus a total of permutations compose .

One possible way to mitigate the detrimental effects of cycles is by using code permutations. We can apply BP on the permuted received word and then apply the inverse permutation on the decoded word. This can be viewed as applying BP on the original received word with different parity-check matrix. Since there are cycles in the Tanner graph there is no guarantee that the BP will converge to an optimal solution and each permutation enables a different decoding attempt. This strategy has proved to yield to a better convergence and overall decoding performance gains [4], as observed in our experiments, in Section V.

C. Graph Node Embedding

The method we propose uses a node embedding technique for embedding the variable nodes of the code’s Tanner graph, thus taking the code structure into consideration. Specifically, in Sec. IV-B we employ the node2vec [29] method. We briefly describe this method and the reader can refer to the paper for more technical details. The task of node embedding is to encode nodes in a graph as low-dimensional vectors that summarize their relative graph position and the structure of their local neighborhood. Each learned vector corresponds to a node in the graph, and it has been shown that in the learned vector space, geometric relations are captured; e.g., interactions that are modeled as edges between the nodes in the graph. Specifically, node2vec is trained by maximizing the mean probability of the occurrence of subsequent nodes in fixed length sampled random walks. It employs both breadth-first (BFS) and depth-first (DFS) graph searches to produce high quality informative node representations.

D. Self-Attention

An attention mechanism for neural networks that was designed to enable neural models to focus on the most relevant parts of the input. This modern neural architecture allows for the use of weighted averaging to optimize a task objective and to deal with variable sized inputs. When feeding an input sequence into an attention model, the resulting output is an embedded representation of the input. When a single sequence is fed, the attentive mechanism is employed to attend to all positions within the same sequence. This is commonly referred to as the self-attention representation of a sequence. Initially, self-attention modelling was used in conjunction with recurrent neural networks (RNNs) and convolutional neural networks (CNNs) mostly for natural language processing (NLP) tasks. In [30], this setup was first employed and was shown to produce superior results on multiple automatic machine translation tasks.

Recently, an advanced form of modelling attentive relations was introduced. Transformer networks allows modeling intersequence dependencies regardless to the position in the input sequence. [20] demonstrated that machine translation models could achieve state-of-the-art results by solely using this self-attention model. A more recent family of Transformer-based self-attentive models [31]–[33], uses multiple self-attention layers, significantly advanced the state-of-the-art in various linguistic tasks rather than machine translation, e.g. question answering [33], coreference resolution [34] and a variety of linguistic tasks according to the GLUE benchmark [35]. Transformer networks have been employed to non-NLP fields as well, e.g., visual object detection [36], medical imaging [37] and recommender systems [38], [39]

In this work we use self-attention for permutation representation. This mechanism enables better and richer permutation modelling compared to a non-attentive representation. The rationale behind using self-attention comes from permutation distance metrics preservation; a pair of “similar” permutations will have a close geometric self-attentive representation in the learned vector space, since the number of index swaps between permutations only affects the positional embedding additions.

IV. THE DECODING ALGORITHM

A. Problem Formulation and Algorithm Overview

Assume we want to decode a received word encoded by a code C. Picking a permutation from the PG may result in better decoding capabilities. However, executing the decoding algorithm for each permutation within the PG is a computationally prohibitive task especially if the code permutation group is large. An alternative approach involves first choosing the best permutation and only then decoding the corresponding permuted word.

Given a received word , the optimal single permutation is the one that minimizes the bit error rate (BER):

where is the submitted codeword and BER is the Hamming distance between binary vectors. The solution to Eq. (1) is intractable since the correct codeword is not known in the decoding process. We propose a data-driven approach as an approximate solution. The gist of our approach is to estimate the best permutation without applying a tedious decoding process for each code permutation and without relying on the correct codeword . We highlight the key points of our approach below, and elaborate on each one in the rest of this section. Our architecture is depicted in Fig. 1. The main components are the permutation embedding (Section IV-B) and the permutation classifier (Section IV-C). First, the permutation embedding block perm2vec receives a permutation , and outputs an embedding vector Next, the vectors are the input to the permutation classifier that computes an estimation of the probability of word to be successfully decoded by dec. Next, we select the permutation whose probability of successful decoding is maximal:

and decoding is done on . Finally, the decoded word is outputted. The full algorithm is presented in Algorithm 1.

B. Permutation Embedding

Our permutation embedding model consists of two sublayers: self-attention followed by an average pooling layer. To the best of our knowledge, our work is the first to leverage the benefits of the self-attention network in physical layer communication systems.

In [20], positional encodings are vectors that are originally compounded with entries based on sinusoids of varying frequency. They are added as input elements prior to the first self-attention layer, in order to add a position-dependent signal to each embedded token and help the model incorporate the order of the input tokens by injecting information about the relative or absolute position of the tokens. Inspired by this method and other recent NLP works [31]–[33], we used learned positional embeddings which have been shown to yield better performance than the constant positional encodings, but instead of randomly initializing them, we first pre-train node2vec node embeddings over the corresponding code’s Tanner graph. We then take the variable nodes output embeddings to serve as the initial positional embeddings. This helps our model to incorporate some graph structure and to use the code information. We denote by the dimension of the output embedding space (this hyperparameter is set before the node embedding training). It should be noted that any other node embedding model can be trained instead of node2vec which we leave for future work. Self-attention sublayers usually employ

Fig. 1: A schematic architecture of the Graph Permutation Selection (GPS) classifier.

multiple attention heads, but we found that using one attention head was sufficient.

Denote the embedding vector of by and the embedding of the variable nodes by . Note that both are learned, but as stated above, v is initialized with the output of the pre-trained variable node embedding over the code’s Tanner graph. Thereafter, the augmented attention head operates on an input vector sequence, vectors where .

The attention head computes a same-length vector sequence , where . Each encoder’s output vector is computed as a weighted sum of linearly transformed input entries,

where the attention weight coefficient is computed using the softmax function,

of the normalized relative attention between two input vectors and ,

Note that are learned parameters matrices.

Fig. 2: Illustration of the perm2vec model architecture. The input embeddings are the sum of the raw permutation embeddings and the position embeddings, which are initialized using node2vec.

Finally, the vector representation of the permutation is computed by applying the average pooling operation across the sequence of output vectors,

and is passed to the permutation classifier. The permutation embedding is illustrated in Fig. 2.

C. Permutation Classifier

We next describe a classifier that predicts the probability of a successful decoding given received word and a permutation

TABLE I: Values of the hyper-parameters, permutation embed- ding and classifier.

where,

Here stands for concatenation and stands for the Hadamard product. We also define

where and are the learned matrices and and are the learned biases respectively.

Finally, the estimated probability for successful decoding of is computed as follows,

where is the last hidden layer and is the sigmoid function. The Graph Permutation Selection (GPS) algorithm for choosing the most suitable permutation is depicted in Fig. 1

D. Training Details

We jointly train the permutation embedding and the permutation classifier, employing a single decoder dec. The cross entropy loss computed for a single received word is:

where if decoding of was successful under permutation , otherwise . The set of decoders dec used for the dataset generation is described in Section V.

Each mini-batch consists of received words from the generated training dataset. This dataset contains pairs of permuted word together with a corresponding label . We used an all-zero transmitted codeword. Empirically, using only the all-zero word seems to be sufficient for training. Nonetheless, the test dataset is composed of randomly chosen binary codewords , as one would expect, without any degradation in performance. Each codeword is transmitted over the AWGN channel with specified by a given signal-to-noise ratio (SNR), with an equal number of positive examples () and negative examples () in each batch. The overall hyperparameters used for training the perm2vec and the GPS classifier are depicted in Table I.

To pre-train the node embeddings, we used the default hyperparameters suggested in the original work [29] except for the following modifications: number of random walks 2000, walk length 10, neighborhood size 10 and node embedding dimension .

Note that because perm2vec depends solely on a given permutation (per code), all embeddings can be computed once and stored in memory. Then, at test time, determination of depends on the latency of parallelizable forwardpasses of the permutation classifier.

V. EXPERIMENTAL SETUP AND RESULTS

The proposed GPS algorithm is evaluated on five different BCH codes - (31, 16), (63, 36), (63, 45), (127, 64), and (255, 163). As for the decoder dec, we applied GPS on top of the BP (GPS+BP) and on top of a pre-trained WBP (GPS+WBP), trained with the configuration from [41]. For the longest code, (255, 163), we did not apply the WBP algorithm as it raised us memory issues on our available hardware. All decoders are tested with 5 BP iterations and the syndrome stopping criterion is adopted after each iteration. These decoders are based on the systematic parity-check matrices, , since these matrices are commonly used. For comparison, we employ a random permutation selection (from the PG) as a baseline for each decoder - random+BP and random+WBP. In addition, we depict the maximum likelihood results. Maximum likelihood results were taken from [42], [43]. In addition, see [25, Section 1.5] for more theoretical details.

A. Performance Analysis

We assess the quality of our GPS using the BER metric, for different SNR values [dB] when at least 1000 error words occurred. Note that we refer to the SNR as the normalized SNR (), which is commonly used in digital communication. Fig. 3 presents the results for BCH(31,16) and BCH(63,36) and Table II lists the results for all codes and decoders, with our GPS method and random selection. For clarity, in Table II we present the BER negative decimal logarithm only for the baselines, considered as the top-1 results. As can be seen, using our preprocess method outperforms the examined baselines. For BCH(31,16) (Fig. 3a), perm2vec together with BP gains up to 2.75 dB as compared to the random BP and up to 1.8 dB over the random WBP. Similarly, for BCH(63,36) (Fig. 3b), our method outperforms the random BP by up to 2.75 dB and by up to 2.2 dB with respect to WBP. We also observed a

Fig. 3: BER vs. SNR for GPS and random permutation selection. Both BP and WBP are considered.

TABLE II: A comparison of the BER negative decimal logarithm for three SNR values [dB]. Higher is better. We bold the best results and underline the second best ones.

small gap between our method and the maximum likelihood lower bound. The maximal gaps are 0.4 dB and 1.4 dB for BCH(31,16) and BCH(63,36), respectively.

B. Top-𝜅 Evaluation

selecting the output word by the criterion mentioned above. In Fig. 4 the reported results are for BCH(63,45). We observed an improvement of 0.4 dB between and only 0.2 dB between and . Furthermore, the gap between and the BP lower bound is small (0.4 dB). Note that using the BP lower bound is impractical since each BP scales by while our method only scales by Moreover, in our simulations, we found that the latency for 5 BP iterations was 10-100 times greater compared to our classifier’s inference.

C. Embedding Size Evaluation

In Fig. 5 we present the performance of our method using two embedding sizes. We compare our base model, that uses embedding size to the small model that uses embedding size (note that ). Recall that

Fig. 4: BER vs. SNR performance comparison for top-evaluation for BCH(63,45).

Fig. 5: BER vs. SNR performance comparison for embedding size evaluation for various BCH codes.

changing the embedding size also affects the number of parameters in , as in Eq. (3). Using a smaller embedding size causes a slight degradation in performance, but still dramatically improves the random BP baseline. For the shorter BCH(63,36), the gap is 0.5 dB and for BCH(127,64) the gap is 0.2 dB.

D. Evaluating different Parity-Check Matrices

In Fig. 7 we depict the results for applying our method over two different parity-check matrices: the cycle-reduced HCR and the systematic Hsys, both define the codes. These matrices are taken from [44].

By inspecting the number of length-4 cycles in the Tanner graphs induced by these matrices, substantial differences can be noticed, as seen in Fig 6. While HCR induces a low amount of length-4 cycles evenly across all nodes, Hsys induces an imbalanced distribution: the information bits have high amount of length-4 cycles while the parity bits have none.

Fig. 6: Distribution of variable node’s normalized number of length-4 cycles for each BCH code. Note that each code was normalized by the following value (from left to right) - 3042, 1022 and 20214.

As is evident from Fig. 7, our method is able to exploit the structure of Hsys, outperforming the HCR based parity-check matrix by 0.75, 0.4 and 0.6 dB for BCH (63,36), (63,45) and (127,64), respectively.

E. Ablation Study

We present an analysis over a number of facets of our permutation embedding and classifier for BCH (63,36), (63,45) and (127,64). We fixed the BER to and inspected the SNR degradation of various excluded components with respect to our complete model. We present the ablation analysis for our permutation classifier and permutation embedding separately.

Regarding the permutation classifier, we evaluated the complete classifier (described in Section IV-C) against its three partial versions; Omitting the permutation embedding feature vector caused a performance degradation of 1.5 to 2 dB. Note that the permutation still affects both and . Excluding or caused a degradation of 1-1.5 and 2.5-3 dB, respectively. In addition, we tried a simpler feature vector which led to a performance degradation of 1 to 1.5 dB.

For ablating the permutation embedding component, we compared the complete perm2vec (described in Section IV-B) against its two partial versions: omitting the self-attention mechanism decreased performance by 1.25 to 1.75 dB. Initializing the positional embedding randomly instead of using node embedding also caused a performance degradation of 1.25 to 1.75 dB.

The above results illustrate the advantages of our complete method, and, as observed, the importance of the permutation embedding component. Note that we preserved the total number of parameters after each exclusion for fair comparison.

VI. CONCLUSION

We presented a self-attention mechanism to improve decoding of linear error correction codes. For every received noisy word, the proposed model selects a suitable permutation out of the code’s PG without actually trying all the permutation

Fig. 7: Performance comparison for different parity-check matrices.

based decodings. Our method pre-computes the permutations’ representations, thus allowing for fast and accurate permutation selection at the inference phase. Furthermore, our method is independent of the code length and therefore is considered scalable. We demonstrate the effectiveness of perm2vec by showing significant BER performance improvement compared to the baseline decoding algorithms for various code lengths. Future research should extend our method to polar codes, replacing the embedded Tanner graph variable nodes by embedded factor graph variable nodes, and further investigate other codes, e.g., non-binary Reed–Solomon codes. Another possible research direction would be to train both perm2vec and WBP jointly in an end-to-end manner, or consider an iterative training solution that alternates between training the graph permutation selection model and the WBP algorithm (or any other trainable decoder).

ACKNOWLEDGMENT

The authors would like to thank the reviewers for their helpful comments that improved the presentation of the paper.

REFERENCES

[1] C. E. Shannon, “A mathematical theory of communication,” Bell system technical journal, vol. 27, 1948.

[2] J. Pearl, Probabilistic reasoning in intelligent systems: networks of plausible inference. Elsevier, 2014.

[3] J. Macwilliams, “Permutation decoding of systematic codes,” The Bell System Technical Journal, vol. 43, no. 1, 1964.

[4] I. Dimnik and Y. Be’ery, “Improved random redundant iterative HDPC decoding,” IEEE Transactions on Communications, vol. 57, pp. 1982– 1985, 2009.

[5] A. Elkelesh, M. Ebada, S. Cammerer, and S. ten Brink, “Belief propagation list decoding of polar codes,” IEEE Communications Letters, vol. 57, pp. 1536–1539, 2018.

[6] O. Simeone, “A very brief introduction to machine learning with applications to communication systems,” IEEE Transactions on Cognitive Communications and Networking, vol. 4, pp. 648–664, 2018.

[7] A. Zappone, M. Di Renzo, and M. Debbah, “Wireless networks design in the era of deep learning: Model-based, ai-based, or both?” IEEE Transactions on Communications, vol. 67, pp. 7331–7376, 2019.

[8] M. I. Belghazi, A. Baratin, S. Rajeshwar, S. Ozair, Y. Bengio, A. Courville, and D. Hjelm, “Mutual information neural estimation,” in Proceedings of the International Conference on Machine Learning (ICML), 2018.

[9] A. Caciularu and D. Burshtein, “Blind channel equalization using variational autoencoders,” in International Conference on Communications Workshops (ICC Workshops), 2018.

[10] X. Lin, I. Sur, S. A. Nastase, A. Divakaran, U. Hasson, and M. R. Amer, “Data-efficient mutual information neural estimator,” arXiv preprint arXiv:1905.03319, 2019.

[11] I. Be’ery, N. Raviv, T. Raviv, and Y. Be’ery, “Active deep decoding of linear codes,” IEEE Transactions on Communications, vol. 68, pp. 728–736, 2020.

[12] O. Simeone, S. Park, and J. Kang, “From learning to meta-learning: Reduced training overhead and complexity for communication systems,” arXiv preprint arXiv:2001.01227, 2020.

[13] T. Gruber, S. Cammerer, J. Hoydis, and S. ten Brink, “On deep learningbased channel decoding,” in Annual Conference on Information Sciences and Systems (CISS), 2017.

[14] H. Kim, Y. Jiang, R. Rana, S. Kannan, S. Oh, and P. Viswanath, “Communication algorithms via deep learning,” International Conference on Learning Representations (ICLR, 2018.

[15] E. Nachmani, Y. Be’ery, and D. Burshtein, “Learning to decode linear codes using deep learning,” in Annual Allerton Conference on Communication, Control, and Computing (Allerton), 2016.

[16] E. Nachmani, E. Marciano, L. Lugosch, W. J. Gross, D. Burshtein, and Y. Be’ery, “Deep learning methods for improved decoding of linear codes,” IEEE Journal of Selected Topics in Signal Processing, vol. 12, pp. 119–131, 2018.

[17] N. Doan, S. A. Hashemi, M. Mondelli, and W. J. Gross, “On the decoding of polar codes on permuted factor graphs,” in Global Communications Conference (GLOBECOM), 2018.

[18] M. Lian, F. Carpi, C. Häger, and H. D. Pfister, “Learned beliefpropagation decoding with simple scaling and SNR adaptation,” in IEEE International Symposium on Information Theory (ISIT), 2019.

[19] F. Carpi, C. Häger, M. Martalò, R. Raheli, and H. D. Pfister, “Reinforcement learning for channel coding: Learned bit-flipping decoding,” in Annual Allerton Conference on Communication, Control, and Computing (Allerton), 2019.

[20] A. Vaswani, N. Shazeer, N. Parmar, J. Uszkoreit, L. Jones, A. N. Gomez, L. u. Kaiser, and I. Polosukhin, “Attention is all you need,” in Advances in Neural Information Processing Systems (NIPS), 2017.

[21] M. Kamenev, Y. Kameneva, O. Kurmaev, and A. Maevskiy, “A new permutation decoding method for Reed-Muller codes,” in IEEE International Symposium on Information Theory (ISIT), 2019.

[22] S. A. Hashemi, N. Doan, M. Mondelli, and W. J. Gross, “Decoding Reed-Muller and polar codes by successive factor graph permutations,” in International Symposium on Turbo Codes & Iterative Information Processing (ISTC), 2018.

[23] S. B. Korada, “Polar codes for channel and source coding,” EPFL, Tech. Rep. 4461, 2009.

[24] A. Bennatan, Y. Choukroun, and P. Kisilev, “Deep learning for decoding of linear codes-a syndrome-based approach,” in International Symposium on Information Theory (ISIT), 2018.

[25] T. Richardson and R. Urbanke, Modern coding theory. Cambridge university press, 2008.

[26] A. Dehghan and A. H. Banihashemi, “On the tanner graph cycle distribution of random LDPC, random protograph-based LDPC, and random quasi-cyclic LDPC code ensembles,” IEEE Transactions on Information Theory, vol. 64, pp. 4438–4451, 2018.

[27] K. Guenda, “The permutation groups and the equivalence of cyclic and quasi-cyclic codes,” arXiv preprint arXiv:1002.2456, 2010.

[28] F. J. MacWilliams and N. J. A. Sloane, The theory of error-correcting codes. Elsevier, 1977, vol. 16.

[29] A. Grover and J. Leskovec, “node2vec: Scalable feature learning for networks,” in International Conference on Knowledge Discovery and Data Mining (KDD), 2016.

[30] D. Bahdanau, K. Cho, and Y. Bengio, “Neural machine translation by jointly learning to align and translate,” in International Conference on Learning Representations (ICLR), 2015.

[31] J. Devlin, M.-W. Chang, K. Lee, and K. Toutanova, “BERT: Pretraining of deep bidirectional transformers for language understanding,” in Conference of the North American Chapter of the Association for Computational Linguistics (NAACL), 2019.

[32] Y. Liu, M. Ott, N. Goyal, J. Du, M. Joshi, D. Chen, O. Levy, M. Lewis, L. Zettlemoyer, and V. Stoyanov, “Roberta: A robustly optimized bert pretraining approach,” arXiv preprint arXiv:1907.11692, 2019.

[33] Z. Yang, Z. Dai, Y. Yang, J. Carbonell, R. Salakhutdinov, and Q. V. Le, “XLNet: Generalized autoregressive pretraining for language understanding,” in Advances in Neural Information Processing Systems (NeurIPS), 2019.

[34] M. Joshi, O. Levy, L. Zettlemoyer, and D. Weld, “BERT for coreference resolution: Baselines and analysis,” in Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLP-IJCNLP), 2019.

[35] A. Wang, A. Singh, J. Michael, F. Hill, O. Levy, and S. Bowman, “GLUE: A multi-task benchmark and analysis platform for natural language understanding,” in EMNLP Workshop BlackboxNLP: Analyzing and Interpreting Neural Networks for NLP, 2018.

[36] N. Carion, F. Massa, G. Synnaeve, N. Usunier, A. Kirillov, and S. Zagoruyko, “End-to-end object detection with transformers,” arXiv preprint arXiv:2005.12872, 2020.

[37] M. C. Lee, O. Oktay, A. Schuh, M. Schaap, and B. Glocker, “Image-and-spatial transformer networks for structure-guided image registration,” in International Conference on Medical Image Computing and Computer Assisted Intervention (MICCAI), 2019.

[38] O. Barkan, A. Caciularu, O. Katz, and N. Koenigstein, “Attentive item2vec: Neural attentive user representations,” in International Conference on Acoustics, Speech and Signal Processing (ICASSP), 2020.

[39] O. Barkan, Y. Fuchs, A. Caciularu, and N. Koenigstein, “Explainable recommendations via attentive multi-persona collaborative filtering,” in ACM Conference on Recommender Systems (RecSys), 2020.

[40] O. Barkan, N. Razin, I. Malkiel, O. Katz, A. Caciularu, and N. Koenigstein, “Scalable attentive sentence-pair modeling via distilled sentence embedding,” AAAI, 2020.

[41] E. Nachmani, E. Marciano, D. Burshtein, and Y. Be’ery, “RNN decoding of linear block codes,” arXiv preprint arXiv:1702.07560, 2017.

[42] E. Nachmani, Y. Bachar, E. Marciano, D. Burshtein, and Y. Be’ery, “Near maximum likelihood decoding with deep learning,” arXiv preprint arXiv:1801.02726, 2018.

[43] S. E. Alnawayseh and P. Loskot, “Ordered statistics-based list decoding techniques for linear binary block codes,” EURASIP Journal on Wireless Communications and Networking, vol. 2012, no. 1, p. 314, 2012.

[44] M. Helmling, S. Scholl, F. Gensheimer, T. Dietz, K. Kraft, S. Ruzika, and N. Wehn, “Database of Channel Codes and ML Simulation Results,” www.uni-kl.de/channel-codes, 2019.