b

DiscoverSearch
About
My stuff
perm2vec: Graph Permutation Selection for Decoding of Error Correction Codes using Self-Attention
2020·arXiv
Abstract
Abstract

Error correction codes are an integral part of communication applications, boosting the reliability of transmission. The optimal decoding of transmitted codewords is the maximum likelihood rule, which is NP-hard due to the curse of dimensionality. For practical realizations, sub-optimal decoding algorithms are employed; yet limited theoretical insights prevent one from exploiting the full potential of these algorithms. One such insight is the choice of permutation in permutation decoding. We present a data-driven framework for permutation selection, combining domain knowledge with machine learning concepts such as node embedding and self-attention. Significant and consistent improvements in the bit error rate are introduced for all simulated codes, over the baseline decoders. To the best of the authors’ knowledge, this work is the first to leverage the benefits of the neural Transformer networks in physical layer communication systems.

Index Terms—Decoding, error correcting codes, belief propagation, deep learning.

Shannon’s well known channel coding theorem [1] states that for every channel a code exists, such that encoded messages can be transmitted and decoded with an error as low as needed while the transmission rate is below the channel’s capacity. The converse is no less crucial: Transmitting with rate beyond the capacity enforces uncertain reliability. For practical applications, latency and computational complexity constrain code size. Thus, structured codes with low complexity encoding and decoding schemes, were devised.

Some structured codes possess a main feature known as the permutation group (PG). The permutations in PG map each codeword to some distinct codeword. This is crucial to different decoders, such as the parallelizable soft-decision Belief Propagation (BP) [2] decoder. It empirically stems from evidence that whereas decoding various corrupted words may fail, decoding a permuted version of the same corrupted words may succeed [3]. For instance, this is exploited in the mRRD [4] and the BPL [5] algorithms, which perform multiple runs over different permuted versions of the same corrupted codewords by trading off complexity for higher decoding gains.

Nonetheless, there is room for improvement since not all permutations are required for successful decoding of a given word: simply a fitting one is needed. Our work deals with obtaining the best fit permutation per word, by removing redundant runs which thus preserve computational resources. Nevertheless, it remains unclear how to obtain this type of permutation as indicated by the authors in [5] who stated in their Section III.A, “there exists no clear evidence on which graph permutation performs best for a given input”. Explicitly, the goal is to approximate a function mapping from a single word to the most probable-to-decode permutation. While analytical derivation of this function is hard, advances in the machine learning field may be of use in the computation of this type of function.

The recent emergence of Deep Learning (DL) has demonstrated the advantages of Neural Networks (NN) in a myriad of communication and information theory applications where no analytical solutions exists [6], [7]. For instance in [8], a tight lower bound on the mutual information between two high-dimensional continuous variables was estimated with NN. Another recurring motive for the use of NN in communications has to do with the amount of data at hand. Several data-driven solutions were described in [9], [10] for scenarios with small amounts of data, since obtaining data samples in the real world is costly and hard to collect on-the-fly. On the other hand, one should not belittle the benefits of unlimited simulated data, see [11], [12].

Lately, two main classes of decoders have been put forward in machine learning for decoding. The first is the class of model-free decoders employing neural network architectures as in [13], [14]. The second is composed of model-based decoders [15][19] implementing parameterized versions of classical BP decoders. Currently, the model-based approach dominates, but it suffers from a regularized hypothesis space due to its inductive bias.

Our work leverages permutation groups and DL to enhance the decoding capabilities of constrained model-based decoders. First, a self-attention model (introduced in Section III) [20] is employed to embed all the differentiated group permutations of a code in a word-independent manner, by extracting relevant features. This is done once before the test phase during a preprocess phase. At test time, a trained NN accepts a corrupted word and the embedded permutations and predicts the probability for successful decoding for each permutation. Thereafter, a set of either one, five or ten most-probable-to-decode permutations are chosen, and decoding is carried out on the permuted channel words rather than decoding an arbitrary dataset with all permutations, and empirically choosing the best subset of them.

The presented method is simulated on Bose-Chaudhuri-Hocquenghem (BCH) codes of varying lengths, achieving gains of up to 2.75dB over the random permutation selection baseline.

The remainder of this paper is organized as follows. The background on coding, code permutations, node embedding and self-attention is provided in Section III. The formulation and method of the Graph Permutation Selection (GPS) is presented in Section IV and related work is discussed in Section II. Finally experimental setup and results are detailed in Section V.

Permutation decoding (PD) has attracted renewed attention [17], [21], [22] given its proven gains for 5G-standard approved polar codes. [21] suggested a novel PD method for these codes. However, the main novelty lies in the proposed stopping criteria for the list-decoder, whereas the permutations are chosen in a random fashion. The authors in [17] presented an algorithm to form a permutation set, computed by fixing several first layers of the underlying structure of the polar decoder, and only permuting the last layers. The original graph is included in this set as a default, with additional permutations added during the process of a limited-space search. Finally we refer to [22] which proposes a successive permutations scheme that finds suitable permutations as decoding progresses. Again, due to the exploding search space, they only considered the cyclic shifts of each layer. This limited-search first appeared in [23].

Most PD methods, like the ones mentioned above, have made valuable contributions. We, on the other hand, see the choice of permutation as the most integral part of PD, and suggest a pre-decoding module to choose the best fitting one. Note however that a direct comparisons between the PD model-based works mentioned and ours are infeasible.

Regarding model-free approaches, we refer in particular to [24] since it integrates permutation groups into a model-free approach. In that paper, the decoding network accepts the syndrome of the hard decisions as part of the input. This way, domain knowledge is incorporated into the model-free approach. We introduce domain knowledge by training the permutation embedding on the parity-check matrix and accepting the permuted syndrome. Furthermore, each word is chosen as a fitting permutation such that the sum of LLRs in the positions of the information-bits is maximized. Note that this approach only benefits model-free decoders. Here as well comparisons are infeasible.

A. Coding

In a typical communication system, first, a length  𝑘 binarymessage  𝒎 ∈ {0, 1}𝑘 is encoded by a generator matrix  𝑮 into alength  𝑛 codeword 𝒄 = 𝑮𝑇𝒎 ∈ {0, 1}𝑛. Every codeword  𝒄 sat-isfies  𝑯𝒄 = 0, where 𝑯is the parity-check matrix (uniquely de-fined by  𝑮𝑯𝑇 = 0). Next, the codeword  𝒄is modulated by the Binary Phase Shift Keying (BPSK) mapping (0 → 1, 1 → −1)resulting in a modulated word  𝒙. After transmission through the additive white Gaussian noise (AWGN) channel, the received word is  𝒚 = 𝒙 + 𝒛, where  𝒛 ∼ N (0, 𝜎2𝑧 𝑰𝑛).

At the receiver, the received word is checked for any detectable errors. For that purpose, an estimated codeword  ˆc iscalculated using a hard decision (HD) rule:  ˆ𝑐𝑖 = 1{𝑦𝑖<0}. If thesyndrome  𝒔 = 𝑯 ˆ𝒄is all zeros, one outputs  ˆ𝒄and concludes. A non-zero syndrome indicates that channel errors occurred. Then, a decoding function  dec : 𝒚 → {0, 1}𝑛, is utilized with output ˆ𝒄. One standard soft-decision decoding algorithm is Belief Propagation (BP). BP is a graph-based inference algorithm that can be used to decode corrupted codewords in an iterative manner, working over a factor graph known as the Tanner graph. The BP algorithm operates by passing messages over the nodes of the Tanner graph until convergence or a maximum number of iterations is reached. One property known to effect the convergence of the algorithm is cycles. Cycles in a Tanner graph refer to a subset of nodes connected to each other and inducing a closed loop with every edge appearing once. Messages that are propagated along cycles become correlated after several BP iterations, preventing convergence to the correct posterior distribution and thus reducing overall decoding performance. We refer the interested reader to [25] for a full derivation of the BP for linear codes, and to [26] for more details on the effects of cycles in codes.

B. Permutation Group of a code

image

codeword  𝒄 = (𝑐1, ..., 𝑐𝑛)exchanges the positions of the entries of  𝒄:

image

A permutation  𝜋 is anautomorphism of a given code  C if 𝒄 ∈ Cimplies  𝜋(𝒄) ∈ C. The group of all automorphism permutations of a code C is denoted  𝐴𝑢𝑡(C), also referred to as the PG of the code. A widely employed family of codes, with known PGs [27], is the BCH family. The PGs of this code are presented in [28, pp. 233]:

image

with  𝛼 ∈ {1, . . . , log2(𝑛 + 1)} and 𝛽 ∈ {1, . . . , 𝑛}. Thus a total of  𝑛 log2(𝑛 + 1)permutations compose  𝐴𝑢𝑡(C).

One possible way to mitigate the detrimental effects of cycles is by using code permutations. We can apply BP on the permuted received word and then apply the inverse permutation on the decoded word. This can be viewed as applying BP on the original received word with different parity-check matrix. Since there are cycles in the Tanner graph there is no guarantee that the BP will converge to an optimal solution and each permutation enables a different decoding attempt. This strategy has proved to yield to a better convergence and overall decoding performance gains [4], as observed in our experiments, in Section V.

C. Graph Node Embedding

The method we propose uses a node embedding technique for embedding the variable nodes of the code’s Tanner graph, thus taking the code structure into consideration. Specifically, in Sec. IV-B we employ the node2vec [29] method. We briefly describe this method and the reader can refer to the paper for more technical details. The task of node embedding is to encode nodes in a graph as low-dimensional vectors that summarize their relative graph position and the structure of their local neighborhood. Each learned vector corresponds to a node in the graph, and it has been shown that in the learned vector space, geometric relations are captured; e.g., interactions that are modeled as edges between the nodes in the graph. Specifically, node2vec is trained by maximizing the mean probability of the occurrence of subsequent nodes in fixed length sampled random walks. It employs both breadth-first (BFS) and depth-first (DFS) graph searches to produce high quality informative node representations.

D. Self-Attention

An attention mechanism for neural networks that was designed to enable neural models to focus on the most relevant parts of the input. This modern neural architecture allows for the use of weighted averaging to optimize a task objective and to deal with variable sized inputs. When feeding an input sequence into an attention model, the resulting output is an embedded representation of the input. When a single sequence is fed, the attentive mechanism is employed to attend to all positions within the same sequence. This is commonly referred to as the self-attention representation of a sequence. Initially, self-attention modelling was used in conjunction with recurrent neural networks (RNNs) and convolutional neural networks (CNNs) mostly for natural language processing (NLP) tasks. In [30], this setup was first employed and was shown to produce superior results on multiple automatic machine translation tasks.

Recently, an advanced form of modelling attentive relations was introduced. Transformer networks allows modeling intersequence dependencies regardless to the position in the input sequence. [20] demonstrated that machine translation models could achieve state-of-the-art results by solely using this self-attention model. A more recent family of Transformer-based self-attentive models [31][33], uses multiple self-attention layers, significantly advanced the state-of-the-art in various linguistic tasks rather than machine translation, e.g. question answering [33], coreference resolution [34] and a variety of linguistic tasks according to the GLUE benchmark [35]. Transformer networks have been employed to non-NLP fields as well, e.g., visual object detection [36], medical imaging [37] and recommender systems [38], [39]

In this work we use self-attention for permutation representation. This mechanism enables better and richer permutation modelling compared to a non-attentive representation. The rationale behind using self-attention comes from permutation distance metrics preservation; a pair of “similar” permutations will have a close geometric self-attentive representation in the learned vector space, since the number of index swaps between permutations only affects the positional embedding additions.

A. Problem Formulation and Algorithm Overview

Assume we want to decode a received word  𝒚encoded by a code C. Picking a permutation from the PG  𝐴𝑢𝑡(C)may result in better decoding capabilities. However, executing the decoding algorithm for each permutation within the PG is a computationally prohibitive task especially if the code permutation group is large. An alternative approach involves first choosing the best permutation and only then decoding the corresponding permuted word.

Given a received word  𝒚, the optimal single permutation 𝜋★ ∈ 𝐴𝑢𝑡(C)is the one that minimizes the bit error rate (BER):

image

where  𝒄is the submitted codeword and BER is the Hamming distance between binary vectors. The solution to Eq. (1) is intractable since the correct codeword is not known in the decoding process. We propose a data-driven approach as an approximate solution. The gist of our approach is to estimate the best permutation without applying a tedious decoding process for each code permutation and without relying on the correct codeword  𝒄. We highlight the key points of our approach below, and elaborate on each one in the rest of this section. Our architecture is depicted in Fig. 1. The main components are the permutation embedding (Section IV-B) and the permutation classifier (Section IV-C). First, the permutation embedding block perm2vec receives a permutation  𝜋, and outputs an embedding vector  𝒒 𝜋.Next, the vectors  𝜋(𝒚) and 𝒒 𝜋are the input to the permutation classifier that computes an estimation  𝑝(𝒚, 𝜋)of the probability of word  𝜋(𝒚)to be successfully decoded by dec. Next, we select the permutation whose probability of successful decoding is maximal:

image

and decoding is done on  ˆ𝜋(𝒚). Finally, the decoded word  ˆ𝒄 =ˆ𝜋−1(dec( ˆ𝜋(𝒚)))is outputted. The full algorithm is presented in Algorithm 1.

B. Permutation Embedding

Our permutation embedding model consists of two sublayers: self-attention followed by an average pooling layer. To the best of our knowledge, our work is the first to leverage the benefits of the self-attention network in physical layer communication systems.

In [20], positional encodings are vectors that are originally compounded with entries based on sinusoids of varying frequency. They are added as input elements prior to the first self-attention layer, in order to add a position-dependent signal to each embedded token and help the model incorporate the order of the input tokens by injecting information about the relative or absolute position of the tokens. Inspired by this method and other recent NLP works [31][33], we used learned positional embeddings which have been shown to yield better performance than the constant positional encodings, but instead of randomly initializing them, we first pre-train node2vec node embeddings over the corresponding code’s Tanner graph. We then take the variable nodes output embeddings to serve as the initial positional embeddings. This helps our model to incorporate some graph structure and to use the code information. We denote by  𝑑𝑤the dimension of the output embedding space (this hyperparameter is set before the node embedding training). It should be noted that any other node embedding model can be trained instead of node2vec which we leave for future work. Self-attention sublayers usually employ

image

Fig. 1: A schematic architecture of the Graph Permutation Selection (GPS) classifier.

image

multiple attention heads, but we found that using one attention head was sufficient.

Denote the embedding vector of  𝜋(𝑖)by  u𝑖 ∈ R𝑑𝑤and the embedding of the variable nodes by  v ∈ R𝑑𝑤. Note that both u𝑖 and vare learned, but as stated above, v is initialized with the output of the pre-trained variable node embedding over the code’s Tanner graph. Thereafter, the augmented attention head operates on an input vector sequence,  W = (w1, . . . , w𝑛) of 𝑛vectors where  w𝑖 ∈ R𝑑𝑤, w𝑖 = u𝑖 + v.

The attention head computes a same-length vector sequence P = (p1, . . . , p𝑛), where  p𝑖 ∈ R𝑑𝑝. Each encoder’s output vector is computed as a weighted sum of linearly transformed input entries,

image

where the attention weight coefficient is computed using the softmax function,

image

of the normalized relative attention between two input vectors 𝒘𝑖and  𝒘 𝑗,

image

Note that  𝑸, 𝑲, 𝑽 ∈ 𝑹𝑑𝑤×𝑑𝑝 are learned parameters matrices.

image

Fig. 2: Illustration of the perm2vec model architecture. The input embeddings are the sum of the raw permutation embeddings and the position embeddings, which are initialized using node2vec.

Finally, the vector representation of the permutation  𝜋is computed by applying the average pooling operation across the sequence of output vectors,

image

and is passed to the permutation classifier. The permutation embedding is illustrated in Fig. 2.

C. Permutation Classifier

We next describe a classifier that predicts the probability of a successful decoding given received word  𝒚and a permutation  𝜋

TABLE I: Values of the hyper-parameters, permutation embed- ding and classifier.

image

where,

image

Here  [·]stands for concatenation and  ◦stands for the Hadamard product. We also define

image

where  𝑾1 ∈ R9𝑑𝑝×2𝑑𝑝, 𝑾2 ∈ R2𝑑𝑝×𝑑𝑝, 𝑾3 ∈ R𝑑𝑝×𝑑𝑝/2and 𝑾4 ∈ R𝑑𝑝/2 are the learned matrices and  𝒃1 ∈ R2𝑑𝑝, 𝒃2 ∈ R𝑑𝑝,𝒃3 ∈ 𝑹𝑑𝑝/2and  𝒃4 ∈ Rare the learned biases respectively.

Finally, the estimated probability for successful decoding of 𝜋(𝒚)is computed as follows,

image

where  𝑔(𝒉)is the last hidden layer and  𝜎(·)is the sigmoid function. The Graph Permutation Selection (GPS) algorithm for choosing the most suitable permutation is depicted in Fig. 1

D. Training Details

We jointly train the permutation embedding and the permutation classifier, employing a single decoder dec. The cross entropy loss computed for a single received word  𝒚is:

image

where  𝑑𝒚,𝜋 = 1if decoding of  𝜋(𝒚)was successful under permutation  𝜋, otherwise  𝑑𝒚, 𝜋 = 0. The set of decoders dec used for the dataset generation is described in Section V.

Each mini-batch consists of  𝐾received words from the generated training dataset. This dataset contains pairs of permuted word  (𝒚, 𝜋)together with a corresponding label 𝑑𝒚, 𝜋. We used an all-zero transmitted codeword. Empirically, using only the all-zero word seems to be sufficient for training. Nonetheless, the test dataset is composed of randomly chosen binary codewords  𝒄 ∈ C, as one would expect, without any degradation in performance. Each codeword is transmitted over the AWGN channel with  𝜎𝑧specified by a given signal-to-noise ratio (SNR), with an equal number of positive examples (𝑑 = 1) and negative examples (𝑑 = 0) in each batch. The overall hyperparameters used for training the perm2vec and the GPS classifier are depicted in Table I.

To pre-train the node embeddings, we used the default hyperparameters suggested in the original work [29] except for the following modifications: number of random walks 2000, walk length 10, neighborhood size 10 and node embedding dimension  𝑑𝑤 = 80.

Note that because perm2vec depends solely on a given permutation (per code), all embeddings can be computed once and stored in memory. Then, at test time, determination of  ˆ𝜋depends on the latency of  𝑛 log2(𝑛 + 1)parallelizable forwardpasses of the permutation classifier.

The proposed GPS algorithm is evaluated on five different BCH codes - (31, 16), (63, 36), (63, 45), (127, 64), and (255, 163). As for the decoder dec, we applied GPS on top of the BP (GPS+BP) and on top of a pre-trained WBP (GPS+WBP), trained with the configuration from [41]. For the longest code, (255, 163), we did not apply the WBP algorithm as it raised us memory issues on our available hardware. All decoders are tested with 5 BP iterations and the syndrome stopping criterion is adopted after each iteration. These decoders are based on the systematic parity-check matrices,  𝑯 =�𝑷𝑇 |𝑰𝑛−𝑘�, since these matrices are commonly used. For comparison, we employ a random permutation selection (from the PG) as a baseline for each decoder - random+BP and random+WBP. In addition, we depict the maximum likelihood results. Maximum likelihood results were taken from [42], [43]. In addition, see [25, Section 1.5] for more theoretical details.

A. Performance Analysis

We assess the quality of our GPS using the BER metric, for different SNR values [dB] when at least 1000 error words occurred. Note that we refer to the SNR as the normalized SNR (𝐸𝑏/𝑁0), which is commonly used in digital communication. Fig. 3 presents the results for BCH(31,16) and BCH(63,36) and Table II lists the results for all codes and decoders, with our GPS method and random selection. For clarity, in Table II we present the BER negative decimal logarithm only for the baselines, considered as the top-1 results. As can be seen, using our preprocess method outperforms the examined baselines. For BCH(31,16) (Fig. 3a), perm2vec together with BP gains up to 2.75 dB as compared to the random BP and up to 1.8 dB over the random WBP. Similarly, for BCH(63,36) (Fig. 3b), our method outperforms the random BP by up to 2.75 dB and by up to 2.2 dB with respect to WBP. We also observed a

image

Fig. 3: BER vs. SNR for GPS and random permutation selection. Both BP and WBP are considered.

TABLE II: A comparison of the BER negative decimal logarithm for three SNR values [dB]. Higher is better. We bold the best results and underline the second best ones.

image

small gap between our method and the maximum likelihood lower bound. The maximal gaps are 0.4 dB and 1.4 dB for BCH(31,16) and BCH(63,36), respectively.

B. Top-𝜅 Evaluation

image

selecting the output word by the  𝑎𝑟𝑔𝑚𝑎𝑥criterion mentioned above. In Fig. 4 the reported results are for BCH(63,45). We observed an improvement of 0.4 dB between  𝜅 = 1 and 𝜅 = 5and only 0.2 dB between  𝜅 = 5and  𝜅 = 10. Furthermore, the gap between  𝜅 = 10and the BP lower bound is small (0.4 dB). Note that using the BP lower bound is impractical since each BP scales by  O(𝑛 log 𝑛)while our method only scales by  O(𝑛).Moreover, in our simulations, we found that the latency for 5 BP iterations was 10-100 times greater compared to our classifier’s inference.

C. Embedding Size Evaluation

In Fig. 5 we present the performance of our method using two embedding sizes. We compare our base model, that uses embedding size  𝑑𝑞 = 80to the small model that uses embedding size  𝑑𝑞 = 20(note that  𝑑𝑞 = 𝑑𝑤). Recall that

image

Fig. 4: BER vs. SNR performance comparison for top-𝜅evaluation for BCH(63,45).

image

Fig. 5: BER vs. SNR performance comparison for embedding size evaluation for various BCH codes.

changing the embedding size also affects the number of parameters in  𝑔, as in Eq. (3). Using a smaller embedding size causes a slight degradation in performance, but still dramatically improves the random BP baseline. For the shorter BCH(63,36), the gap is 0.5 dB and for BCH(127,64) the gap is 0.2 dB.

D. Evaluating different Parity-Check Matrices

In Fig. 7 we depict the results for applying our method over two different parity-check matrices: the cycle-reduced HCR and the systematic Hsys, both define the codes. These matrices are taken from [44].

By inspecting the number of length-4 cycles in the Tanner graphs induced by these matrices, substantial differences can be noticed, as seen in Fig 6. While HCR induces a low amount of length-4 cycles evenly across all nodes, Hsys induces an imbalanced distribution: the information bits have high amount of length-4 cycles while the parity bits have none.

image

Fig. 6: Distribution of variable node’s normalized number of length-4 cycles for each BCH code. Note that each code was normalized by the following value (from left to right) - 3042, 1022 and 20214.

As is evident from Fig. 7, our method is able to exploit the structure of Hsys, outperforming the HCR based parity-check matrix by 0.75, 0.4 and 0.6 dB for BCH (63,36), (63,45) and (127,64), respectively.

E. Ablation Study

We present an analysis over a number of facets of our permutation embedding and classifier for BCH (63,36), (63,45) and (127,64). We fixed the BER to  10−3and inspected the SNR degradation of various excluded components with respect to our complete model. We present the ablation analysis for our permutation classifier and permutation embedding separately.

Regarding the permutation classifier, we evaluated the complete classifier (described in Section IV-C) against its three partial versions; Omitting the permutation embedding feature vector  q𝜋caused a performance degradation of 1.5 to 2 dB. Note that the permutation  𝜋still affects both  ℓ′and  s′. Excluding  ℓ′or  s′caused a degradation of 1-1.5 and 2.5-3 dB, respectively. In addition, we tried a simpler feature vector h = [q; ℓ′; s′]which led to a performance degradation of 1 to 1.5 dB.

For ablating the permutation embedding component, we compared the complete perm2vec (described in Section IV-B) against its two partial versions: omitting the self-attention mechanism decreased performance by 1.25 to 1.75 dB. Initializing the positional embedding randomly instead of using node embedding also caused a performance degradation of 1.25 to 1.75 dB.

The above results illustrate the advantages of our complete method, and, as observed, the importance of the permutation embedding component. Note that we preserved the total number of parameters after each exclusion for fair comparison.

We presented a self-attention mechanism to improve decoding of linear error correction codes. For every received noisy word, the proposed model selects a suitable permutation out of the code’s PG without actually trying all the permutation

image

Fig. 7: Performance comparison for different parity-check matrices.

based decodings. Our method pre-computes the permutations’ representations, thus allowing for fast and accurate permutation selection at the inference phase. Furthermore, our method is independent of the code length and therefore is considered scalable. We demonstrate the effectiveness of perm2vec by showing significant BER performance improvement compared to the baseline decoding algorithms for various code lengths. Future research should extend our method to polar codes, replacing the embedded Tanner graph variable nodes by embedded factor graph variable nodes, and further investigate other codes, e.g., non-binary Reed–Solomon codes. Another possible research direction would be to train both perm2vec and WBP jointly in an end-to-end manner, or consider an iterative training solution that alternates between training the graph permutation selection model and the WBP algorithm (or any other trainable decoder).

The authors would like to thank the reviewers for their helpful comments that improved the presentation of the paper.

[1] C. E. Shannon, “A mathematical theory of communication,” Bell system technical journal, vol. 27, 1948.

[2] J. Pearl, Probabilistic reasoning in intelligent systems: networks of plausible inference. Elsevier, 2014.

[3] J. Macwilliams, “Permutation decoding of systematic codes,” The Bell System Technical Journal, vol. 43, no. 1, 1964.

[4] I. Dimnik and Y. Be’ery, “Improved random redundant iterative HDPC decoding,” IEEE Transactions on Communications, vol. 57, pp. 1982– 1985, 2009.

[5] A. Elkelesh, M. Ebada, S. Cammerer, and S. ten Brink, “Belief propagation list decoding of polar codes,” IEEE Communications Letters, vol. 57, pp. 1536–1539, 2018.

[6] O. Simeone, “A very brief introduction to machine learning with applications to communication systems,” IEEE Transactions on Cognitive Communications and Networking, vol. 4, pp. 648–664, 2018.

[7] A. Zappone, M. Di Renzo, and M. Debbah, “Wireless networks design in the era of deep learning: Model-based, ai-based, or both?” IEEE Transactions on Communications, vol. 67, pp. 7331–7376, 2019.

[8] M. I. Belghazi, A. Baratin, S. Rajeshwar, S. Ozair, Y. Bengio, A. Courville, and D. Hjelm, “Mutual information neural estimation,” in Proceedings of the International Conference on Machine Learning (ICML), 2018.

[9] A. Caciularu and D. Burshtein, “Blind channel equalization using variational autoencoders,” in International Conference on Communications Workshops (ICC Workshops), 2018.

[10] X. Lin, I. Sur, S. A. Nastase, A. Divakaran, U. Hasson, and M. R. Amer, “Data-efficient mutual information neural estimator,” arXiv preprint arXiv:1905.03319, 2019.

[11] I. Be’ery, N. Raviv, T. Raviv, and Y. Be’ery, “Active deep decoding of linear codes,” IEEE Transactions on Communications, vol. 68, pp. 728–736, 2020.

[12] O. Simeone, S. Park, and J. Kang, “From learning to meta-learning: Reduced training overhead and complexity for communication systems,” arXiv preprint arXiv:2001.01227, 2020.

[13] T. Gruber, S. Cammerer, J. Hoydis, and S. ten Brink, “On deep learningbased channel decoding,” in Annual Conference on Information Sciences and Systems (CISS), 2017.

[14] H. Kim, Y. Jiang, R. Rana, S. Kannan, S. Oh, and P. Viswanath, “Communication algorithms via deep learning,” International Conference on Learning Representations (ICLR, 2018.

[15] E. Nachmani, Y. Be’ery, and D. Burshtein, “Learning to decode linear codes using deep learning,” in Annual Allerton Conference on Communication, Control, and Computing (Allerton), 2016.

[16] E. Nachmani, E. Marciano, L. Lugosch, W. J. Gross, D. Burshtein, and Y. Be’ery, “Deep learning methods for improved decoding of linear codes,” IEEE Journal of Selected Topics in Signal Processing, vol. 12, pp. 119–131, 2018.

[17] N. Doan, S. A. Hashemi, M. Mondelli, and W. J. Gross, “On the decoding of polar codes on permuted factor graphs,” in Global Communications Conference (GLOBECOM), 2018.

[18] M. Lian, F. Carpi, C. Häger, and H. D. Pfister, “Learned beliefpropagation decoding with simple scaling and SNR adaptation,” in IEEE International Symposium on Information Theory (ISIT), 2019.

[19] F. Carpi, C. Häger, M. Martalò, R. Raheli, and H. D. Pfister, “Reinforcement learning for channel coding: Learned bit-flipping decoding,” in Annual Allerton Conference on Communication, Control, and Computing (Allerton), 2019.

[20] A. Vaswani, N. Shazeer, N. Parmar, J. Uszkoreit, L. Jones, A. N. Gomez, L. u. Kaiser, and I. Polosukhin, “Attention is all you need,” in Advances in Neural Information Processing Systems (NIPS), 2017.

[21] M. Kamenev, Y. Kameneva, O. Kurmaev, and A. Maevskiy, “A new permutation decoding method for Reed-Muller codes,” in IEEE International Symposium on Information Theory (ISIT), 2019.

[22] S. A. Hashemi, N. Doan, M. Mondelli, and W. J. Gross, “Decoding Reed-Muller and polar codes by successive factor graph permutations,” in International Symposium on Turbo Codes & Iterative Information Processing (ISTC), 2018.

[23] S. B. Korada, “Polar codes for channel and source coding,” EPFL, Tech. Rep. 4461, 2009.

[24] A. Bennatan, Y. Choukroun, and P. Kisilev, “Deep learning for decoding of linear codes-a syndrome-based approach,” in International Symposium on Information Theory (ISIT), 2018.

[25] T. Richardson and R. Urbanke, Modern coding theory. Cambridge university press, 2008.

[26] A. Dehghan and A. H. Banihashemi, “On the tanner graph cycle distribution of random LDPC, random protograph-based LDPC, and random quasi-cyclic LDPC code ensembles,” IEEE Transactions on Information Theory, vol. 64, pp. 4438–4451, 2018.

[27] K. Guenda, “The permutation groups and the equivalence of cyclic and quasi-cyclic codes,” arXiv preprint arXiv:1002.2456, 2010.

[28] F. J. MacWilliams and N. J. A. Sloane, The theory of error-correcting codes. Elsevier, 1977, vol. 16.

[29] A. Grover and J. Leskovec, “node2vec: Scalable feature learning for networks,” in International Conference on Knowledge Discovery and Data Mining (KDD), 2016.

[30] D. Bahdanau, K. Cho, and Y. Bengio, “Neural machine translation by jointly learning to align and translate,” in International Conference on Learning Representations (ICLR), 2015.

[31] J. Devlin, M.-W. Chang, K. Lee, and K. Toutanova, “BERT: Pretraining of deep bidirectional transformers for language understanding,” in Conference of the North American Chapter of the Association for Computational Linguistics (NAACL), 2019.

[32] Y. Liu, M. Ott, N. Goyal, J. Du, M. Joshi, D. Chen, O. Levy, M. Lewis, L. Zettlemoyer, and V. Stoyanov, “Roberta: A robustly optimized bert pretraining approach,” arXiv preprint arXiv:1907.11692, 2019.

[33] Z. Yang, Z. Dai, Y. Yang, J. Carbonell, R. Salakhutdinov, and Q. V. Le, “XLNet: Generalized autoregressive pretraining for language understanding,” in Advances in Neural Information Processing Systems (NeurIPS), 2019.

[34] M. Joshi, O. Levy, L. Zettlemoyer, and D. Weld, “BERT for coreference resolution: Baselines and analysis,” in Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLP-IJCNLP), 2019.

[35] A. Wang, A. Singh, J. Michael, F. Hill, O. Levy, and S. Bowman, “GLUE: A multi-task benchmark and analysis platform for natural language understanding,” in EMNLP Workshop BlackboxNLP: Analyzing and Interpreting Neural Networks for NLP, 2018.

[36] N. Carion, F. Massa, G. Synnaeve, N. Usunier, A. Kirillov, and S. Zagoruyko, “End-to-end object detection with transformers,” arXiv preprint arXiv:2005.12872, 2020.

[37] M. C. Lee, O. Oktay, A. Schuh, M. Schaap, and B. Glocker, “Image-and-spatial transformer networks for structure-guided image registration,” in International Conference on Medical Image Computing and Computer Assisted Intervention (MICCAI), 2019.

[38] O. Barkan, A. Caciularu, O. Katz, and N. Koenigstein, “Attentive item2vec: Neural attentive user representations,” in International Conference on Acoustics, Speech and Signal Processing (ICASSP), 2020.

[39] O. Barkan, Y. Fuchs, A. Caciularu, and N. Koenigstein, “Explainable recommendations via attentive multi-persona collaborative filtering,” in ACM Conference on Recommender Systems (RecSys), 2020.

[40] O. Barkan, N. Razin, I. Malkiel, O. Katz, A. Caciularu, and N. Koenigstein, “Scalable attentive sentence-pair modeling via distilled sentence embedding,” AAAI, 2020.

[41] E. Nachmani, E. Marciano, D. Burshtein, and Y. Be’ery, “RNN decoding of linear block codes,” arXiv preprint arXiv:1702.07560, 2017.

[42] E. Nachmani, Y. Bachar, E. Marciano, D. Burshtein, and Y. Be’ery, “Near maximum likelihood decoding with deep learning,” arXiv preprint arXiv:1801.02726, 2018.

[43] S. E. Alnawayseh and P. Loskot, “Ordered statistics-based list decoding techniques for linear binary block codes,” EURASIP Journal on Wireless Communications and Networking, vol. 2012, no. 1, p. 314, 2012.

[44] M. Helmling, S. Scholl, F. Gensheimer, T. Dietz, K. Kraft, S. Ruzika, and N. Wehn, “Database of Channel Codes and ML Simulation Results,” www.uni-kl.de/channel-codes, 2019.


Designed for Accessibility and to further Open Science