b

DiscoverSearch
About
My stuff
Reinforcement Learning for Nested Polar Code Construction
2019·arXiv
Abstract
Abstract

In this paper, we model nested polar code construction as a Markov decision process (MDP), and tackle it with advanced reinforcement learning (RL) techniques. First, an MDP environment with state, action, and reward is defined in the context of polar coding. Specifically, a state represents the construction of an (N, K) polar code, an action specifies its reduction to an  (N, K − 1)subcode, and reward is the decoding performance. A neural network architecture consisting of both policy and value networks is proposed to generate actions based on the observed states, aiming at maximizing the overall rewards. A loss function is defined to trade off between exploitation and exploration. To further improve learning efficiency and quality, an “integrated learning” paradigm is proposed. It first employs a genetic algorithm to generate a population of (sub-)optimal polar codes for each (N, K), and then uses them as prior knowledge to refine the policy in RL. Such a paradigm is shown to accelerate the training process, and converge at better performances. Simulation results show that the proposed learning-based polar constructions achieve comparable, or even better, performances than the state of the art under successive cancellation list (SCL) decoders. Last but not least, this is achieved without exploiting any expert knowledge from polar coding theory in the learning algorithms.

Index Terms—Polar codes, Nested polar code construction, Markov decision process, Reinforcement learning, Integrated learning

In communication systems, the capacity of an AWGN channel is defined in theory [1]. Classic code construction methods are built upon coding theory, in which code performance can be theoretically modeled in terms of various types of code properties, e.g. minimum distance, decoding threshold, reliability, etc. However, it seems insufficient for us to rely on only these classic coding theory metrics in facing of such practical concerns and application-specific requirements as realistic channel types, decoding latency and complexity and so on.

Recently, artificial intelligence (AI) techniques have been applied to physical layer design. AI techniques can be a tool to design or optimize error correction codes [2], while leaving their legacy encoding and decoding architectures and implementations unchanged. Within a “constructor-evaluator” framework [2], AI algorithms such as policy gradient, genetic algorithm, and actor critic, are capable of constructing linear block codes and polar codes with as good performances as the state of the art. In [3], RL and Monte Carlo tree search (MCTS) are combined to guide edge growth in LDPC code construction. In [4], [5], genetic algorithms are used to design polar codes and LDPC codes. The main difference from [2] is that coding expert knowledge is utilized during the initialization to speed up the learning process.

In this paper, our motivation is to investigate the feasibility of using AI technologies to explore the design space for wireless systems. Channel code, especially polar code, is a good example for this endeavor. We propose novel RL algorithms for designing nested polar codes [6]. Because nested polar code construction (sequential information subchannel selection) is inherently modeled as a Markov decision process (MDP), and RL algorithms can be applied to approach the optimum. To improve training efficiency and code performance, we propose an integrated learning paradigm and various parameter optimization techniques.

The structure of this paper is as following. Section II introduces the preliminaries about polar code construction and nested polar codes. Section III models the nested polar code construction as an MDP task, and solves it with several advanced reinforcement learning algorithms. The integrated learning paradigm is introduced in Section IV. All proposed algorithms are evaluated in Section V in terms of sample efficiency and code performance. Conclusions are given in section VI.

A. Polar code construction

Polar codes [7] are the first class of capacity-achieving codes (under successive cancellation (SC) decoding). For polar codes, physical channels are synthesized to polarized subchannels, with the most reliable ones selected to carry information bits. As a result, an (N, K) polar code is defined by the K most reliable subchannel indices, namely information set I. The remaining  (N −K)subchannel indices are defined as frozen set F. As code length N increases, subchannels polarize to either purely noiseless or completely noisy, where the fraction of noiseless subchannels approaches channel capacity [7]. For binary erasure channel (BEC), subchannel reliability can be efficiently calculated by Bhattacharyya parameter. For general binary-input memoryless channels, density evolution (DE) was applied to estimate subchannel reliability [8], [9], and improved in [10] and analyzed in [11] in terms of complexity. For AWGN channels, Gaussian approximation to density evolution (DE/GA) was proposed [12] to further reduce complexity with negligible performance loss.

To improve the performance of polar codes at finite length, enhanced decoding algorithms are proposed [13], [14]. Among them, SC list (SCL) decoding achieves the best tradeoff among decoding latency, complexity and performance. However, to our best knowledge, for polar codes with SCL-based decoders, theoretically optimal code construction is still an open problem. Existing constructions either directly adopt DE/GA, which are designed for SC rather than SCL, or apply genetic algorithms for SCL decodings [2], [4].

B. Nested polar codes

In practical communication systems where code rate and length adaption is required, efficient code description is mandatory. For example, 5G enhanced mobile broadband (eMBB) [15] supports thousands of polar codes with different (N, K) combinations. It is impossible to store all code configurations separately, due to large overhead. It is much more convenient for description and implementation to impose a nested property [6], so that all polar codes of the same mother code length can be derived from a single nested sequence. Specifically, denote  FN,Kas the frozen set of an (N, K) polar code. FN,N−1, FN,N−2, · · · , FN,0can be constructed sequentially, on condition that  FN,N−1 ⊂ FN,N−2 ⊂ · · · ⊂ FN,0. As seen, a single nested sequence can be obtained as {FN,N−1,setdiff(FN,N−2, FN,N−1), · · · ,setdiff(FN,0, FN,1)}.

image

Fig. 1: Graphical model of nested polar code construction and Markov Decision Process. For nested polar code construction, c denotes a polar code, b denotes a subchannel, e denotes error correction performance with respect to (w.r.t.) c. For MDP, s denotes a state, a denotes an action, and r denotes a reward value w.r.t. the state s.

Nested polar codes are adopted by 5G in the form of a reliability sequence of length  Nmax = 1024[15]. To

construct an (N, K) polar code from the length-Nmaxnested sequence (N = 2n ≤ Nmax), 1) First, a sequence  SeqNof length N is extracted from the length-Nmaxsequence (by taking all indices {i : i ∈ SeqNmax, i < N}while keeping the ordering). 2) Second, the last K entries of  SeqNare selected as the information set.

In this section, we show that nested polar code construction is actually a Markov decision process (MDP) that can be tackled by reinforcement learning. We further discuss some applicable learning algorithms.

A. Constructing nested polar code with MDP

Nested polar code construction can be modeled as an MDP for the following reasons:

According to Markov property of nested polar code construction in Fig. 1, the construction of  (N, N−K−1)polar code  cK+1and its performance  eK+1depend only on that of  (N, N − K)polar code  cKand a subchannel selection  bK;

The goal is to optimize all (N, K) polar codes for K = 1, 2, · · · , N − 1through maximizing an overall performance metric �K eK. To explicitly map the nested polar code construction into an MDP task, we define a base environment (S, A, R):

A state is denoted by a length-N binary vector  sNK ∈S = {0, 1}N, whose support set is  FN,N−K. The initial state  sN0is an all-zero vector corresponding to empty set (FN,N). An action is denoted by an integer  aNK ∈ A ={0, 1, ..., N − 1}, such that  aNK /∈ FN,N−Kand  aNK ∪FN,N−K = FN,N−K−1. The reward value of state  sNKis  rNK ∈ R, representing the performance of the polar code defined by  FN,N−K. The state transfer process is deterministic, i.e., given  sNKand aNK, sNK+1can be determined. The maximum length of an episode is N. A trajectory of the base environment,  (sN0,  aN0, sN1,  aN1,  · · · , sNN−1, aNN−1, sNN), corresponds to the nested polar code construction (ordered sequence)  {aN0,  aN1,  · · ·, aNN−1}. In the following, the superscripts of  sNK, aNKand  rNKare omitted with some abuse of notation. Following the “constructor-evaluator” framework [2], we propose to directly evaluate the rewards through decoding performance. Monte-Carlo (MC) simulations are conducted to output a block error rate (BLER) performance for each code construction. The evaluator implements SCL decoding algorithms, which generate a list of L codewords. We name two types of SCL decoders based on final output selection:

SCL-PM: select the first codeword, i.e, the most likely one with the smallest PM;

SCL-Genie: select the correct codeword, as long as it is among the L surviving ones.

Sufficient decoding error events are counted to obtain an accurate BLER estimation. Then, the reward value is defined as  r ≜ − log10 BLER. 1

B. Reinforcement learning

Nested code construction is actually to search an optimal sequence in a large solution space. RL would help approach the optimum, dragged by a reward. The devised reward, through one real value metric, should represent the performances of all component codes.

Strictly speaking, an RL agent interacts with the MDP environment over discrete timesteps. At each timestep t, the agent observes a state  st, chooses an action  ataccording to its policy  π(at|st)and obtains a reward  rtfrom the environment. The goal of this agent is to optimize its policy in order to maximize the discounted return  Rt = �∞i=0 γirt+iat each timestep. Here discount factor  γ ∈ [0, 1)is introduced to trade off the contribution of immediate and long term rewards to return value.

For the nested polar code construction task, the state space is  2N, the action space is N and the solution space, i.e. trajectory space, is N!. Concerning the large solution space, it is necessary to have sample efficient RL algorithms. Sample efficiency is defined by number of samples used to solve the MDP task, where an MDP sample is a state-action-reward (s, a, r) tuple. In literature, sample efficient RL algorithms include advantage actor critic (A2C), proximal policy optimization (PPO) [16] and actor critic using Kroneckerfactored trust region (ACKTR) [17], etc.

We apply PPO [16] as it is by far the most advanced model-free algorithm. The PPO is an extension of A2C, where a Kullback-Leibler (KL) divergence constraint is imposed between the updated policy and the old policy, i.e. a trust region constraint [18].

For the PPO, the policy loss function is defined,

image

where ˆA(st, at) = R(st, at) − V (st)is the estimate of advantage function for taking action  atat state  st; rt(θ) =

πθold(at|st)is the probability ratio between the updated policy πθand the old policy  πθoldfor taking action  atat state  st; πθ(at|st)is the policy function parameterized by  θ; ǫis a clipping ratio to constrain the probability ratio  rt(θ).

The value loss function is defined,

image

For the advantage estimation ˆA(st, at), a general advantage estimation method (GAE) [19] implements an exponential average among advantage estimations of different steps to trade off between the estimation bias and variance,

image

where  λis the exponential moving average parameter.

A policy function entropy regularization, defined in (4), can be considered in policy loss function to trade off between exploration and exploitation.

image

In the section, we propose an integrated learning method for nested polar code construction to improve the sample efficiency and code performance.

For reinforcement learning algorithms, policy function is initialized to explore all possible MDP trajectories with equal probability. However, for most trajectories in the trajectory space, the accumulated rewards are far worse than optimal one(s). Given prior knowledge about the distributions of actions with large rewards, the policy function can be pretrained to bias the exploration towards trajectories with larger accumulated rewards. Depending on the prior knowledge, this pretraining can significantly accelerate the learning process [20][22].

In the context of polar code construction, we may rely on sub-optimal expert knowledge (e.g., DE/GA constructions) for pretraining, where direct state-action (s, a) pairs (demonstrations) are available. However, genetic algorithm is the best choice to generate a large population of (sub-)optimal code constructions, corresponding to the distribution of states with large rewards. As the genetic algorithm converges, its population already contains code constructions with the best performances. Moreover, the genetic algorithm in [2] does not require any expert knowledge, which means the proposed method also learns everything from scratch.

An integrated learning is proposed in Alg. 1. Firstly, the polar code constructions are generated by genetic algorithm. These constructions are used to produce pretraining examples. Policy function is then pretrained in supervised learning manner. Nested polar code constructions are learned through reinforcement learning, as in section III-B, with the pretrained policy function.

A. Multi-stage genetic algorithm

For each (N, K) pair, we apply the genetic algorithm in [2] to generate a population of (sub-)optimal polar codes. In the original version [2], BLER performance is evaluated at a fixed SNR, where existing code constructions achieve BLER ≈ (10−2 ∼ 10−3). This setting causes a “slow

image

start” problem during the beginning phase when most code constructions result in BLER = 1, which means equally bad. This would confuse the genetic algorithm as it could not distinguish good code constructions from bad ones, and loses the direction of evolution. As a result, the genetic algorithm either stucks at this phase, or converges very slowly.

In this work, we propose a novel multi-stage genetic algorithm based on [2] to improve learning efficiency. The idea is simple, i.e., to adaptively set the evaluating SNR such that the BLER performances of different code constructions can be differentiated. Specifically, the evaluating SNR decreases in a multi-stage manner, by tracking the working SNR (at BLER ≈ 10−2 ∼ 10−3) of the best code construction in the population. The algorithm is detailed in Alg. 2. It enables fast convergence especially for longer codes. As shown in Fig. 2, the construction of a (N = 1024, K = 512) code is learned to perform well under SCL-Genie with list size L = 8. In contrast, a single-stage genetic algorithm fails to converge within a reasonable time period.

image

Fig. 2: The convergence of multi-stage genetic algorithm for learning a (N = 1024, K = 512) code.

B. Pretraining example production

In terms of MDP, the code constructions generated by genetic algorithm represent good states with large reward values. The remaining problem is how to design state-action (s, a) pairs from these constructions. One characteristic of this MDP is that, the state is a collection (set) of history actions taken to reach it. Meanwhile, the order of actions

image

is neither distinguishable from the state, nor relevant to reach the state. Therefore, the state-action (s, a) pairs can be produced based on two intuitions, 1) Given a current state, if the agent can take one action to reach a good state, then this can be a valid state-action pair; 2) If the current state is a good state, a potentially good choice of action can be the ones that has not been taken to reach the current state, while is recorded by some other states with close information length. The process to produce state-action (s, a) pairs from good states is described in Alg. 3.

In this section, we elaborate the model of the reinforcement learning algorithms, and evaluate the sample efficiency of various schemes.

A. Model

For the reinforcement learning algorithms, we use neural networks to represent the policy and value function. The same neural network architecture, shown in Fig. 3, is deployed for all of the nested polar construction tasks. For an input state s, a feed-forward network was used for feature extraction, with two fulled connected layers, with 2N tanh units per

image

layer. This feature layer was shared by policy and value function. For the output layer, the policy function used a linear layer to screen out previously selected subchannels (e.g., by subtracting a larger value from the corresponding entries), and followed by a softmax nonlinearity to generate probability mass function (PMF). The value function used a linear layer to output an estimated value for state s.

image

Fig. 3: Neural network architecture.

One synthesized loss function is used for simultaneous

training of policy and value function,

image

where  LossAand  LossCare the loss functions for actor and critic, and  HAis the entropy for policy function, defined in section III-B. The policy function entropy is subtracted in loss function to encourage exploitation.

We define an action model and a training model, based on the agent’s policy and value function. The action model generates an action and a value estimation for an observed state. For any legal state, the policy function calculates the probability mass function (PMF), based on which an action is randomly sampled. The value function estimates the value for the state. The training model trains the policy and value function.

B. Speed optimizations

Observing that MC simulations are time-consuming, we propose several optimizations to further improve sample efficiency.

1) early termination: A “surrogate environment” is de-fined to early terminate an episode at an ill-defined construction whose current reward is already bad. At the beginning of the learning task, the entropy of agent’s policy is large, therefore explorations are mostly random. From the agent’s perspective, when a state  stis encountered with a reward rt = − log10 BLER ≈ 0(BLER ≈ 1), we observed that the future rewards  rt′(t′ > t)are likely to be close to 0 except for the ones with  t′approaching N. Since the agent can barely improve the policy when the reward values approximate 0, we can define a surrogate environment. From the environment’s perspective, this surrogate environment behaves exactly like the aforementioned base environment, except that when a state with reward value  r < retthris reached, it returns the reward value along with a flag indicating the termination of current episode. This environment is named “base environment with early termination” and is a default option unless otherwise stated.

2) memoization: By definition, a state-reward pair corresponds to the performance of a specific construction. Once explored, it remains unchanged in this MDP task. The most frequently encountered state-reward pairs are memoized for future retrieval. This is shown to effectively reduce the MC simulation burden. In addition, evaluating a code construction with a larger reward value, i.e. small BLER, requires longer MC simulation time. This is because more code blocks are simulated to collect sufficient errors. Therefore, memoization is employed to collect state-reward (s, r) pairs if the reward value  r > rrecthr, such that future MC simulation is skipped if the same state (i.e., code construction) has been evaluated before.

3) vectorized environment: A vectorized environment is defined to improve MC simulation efficiency, which is a collection of  nenvparallel base environments. To guarantee independency among the environments, their random number generator seeds are set to different values. For the vectorized

TABLE I: Default parameters setting

image

environment, action model generates  nenvactions and  nenvvalue estimations based on the observed  nenvstates from each base environment. The training model trains the policy and value function, based on a batch of state-action-return-value (s, a, R, V ) tuples. The batch size  nbatch = nenv·nstep, where  nstepis the timestep number for empirical return value estimation.

C. Reinforcement learning

We conduct a series of experiments under SCL-Genie decoding to investigate the following questions:

1) Which reinforcement learning algorithm is most sample efficient?

2) How to select hyper-parameters to trade off between sample efficiency and convergence performance?

The default parameters are listed in Table I.

Fig. 4 shows the episode rewards of A2C, ACKTR and PPO for 100E3 timesteps. The number of timesteps for return estimation was optimized for each algorithm. PPO outperformed A2C and ACKTR in terms of sample efficiency by a significant margin, and was therefore adopted in the following experiments.

image

Fig. 4: Sample efficiency comparison among A2C, ACKTR, and PPO. For each algorithm, episode rewards from 16 different base environments are plotted.

Fig. 5 shows the episode rewards of base environment with and without early termination. At the beginning of learning when episode reward is below 100, base environment with early termination showed much better sample efficiency, since it saves the MC simulations of a number of trivial samples (with reward values approximating 0). Afterwards, the episode rewards for both base environments showed similar growing speed. This proves that the early termination is effective and has little impact on the learning task except by skipping trivial sample simulations.

image

Fig. 5: Sample efficiency comparison for base environment with and without early termination.

Fig. 6 shows the episode rewards for the amount of policy entropy evolved in loss function. Similar sample efficiency is observed for entropy weight  βe ≤ 0.01. Nevertheless, entropy weight  βe = 0shows slightly better convergence performance.

image

Fig. 6: Sample efficiency comparison in terms of entropy weight.

Fig. 7 shows the episode rewards for discount factor selection. For a smaller discount factor, the sample efficiency is increased since the current return would be affected by shorter future actions. While for this learning task, the convergence performance is not compromised.

In this subsection, we demonstrate that PPO is sample efficient. Early termination in the base environment saved MC simulations for trivial samples. For entropy weight of 0

image

Fig. 7: Sample efficiency comparison in terms of discount factor  γ.

and small discount factor, the sample efficiency is increased without compromising the learning performance.

D. Integrated learning

In this subsection, we evaluate the integrated learning to show its improved sample efficiency.

We first obtained a population of polar codes for each (N, K) pair by genetic algorithm. Then we applied Alg. 3 to generate examples of state-action (s, a) pair. The same policy network architecture is used in integrated learning as in Fig. 3. The policy network was trained on randomly sampled examples with stochastic gradient descent to minimize the training loss function,

image

where  LossApreis the cross entropy between policy output and the (one-hot) action label,  HApreis the entropy value of policy function, with entropy weight  βepre = 1.0. After 20 epoches of training, the policy function is saved for reinforcement learning as described in section III-B.

Fig. 8 shows the episode rewards for 100E3 timesteps for reinforcement learning and integrated learning. It is shown that integrated learning has better sample efficiency as well as larger episode reward values.

E. BLER performance

For nested polar code construction with code length of 256, the error correction performance of the learned codes are compared with those constructed by DE/GA. It should be noted that the comparison is unfair with respect to description and implementation complexity, because the constructions by DE/GA are not necessarily nested.

We consider two MDP tasks with different decoders:

SCL-Genie decoding under AWGN channel

SCL-PM decoding under AWGN channel The same learning method (parameters) are used for both MDP tasks.

image

Fig. 8: Sample efficiency comparison between reinforcement learning and integrated learning.

For SCL-Genie decoding under AWGN channel, nested polar code constructions are learned by reinforcement learning and integrated learning with 1E6 training timesteps. Fig. 9 shows the relative EsN0 value (at BLER of  10−2) for the three constructions. The nested polar constructions learned by both reinforcement learning and integrated learning outperform the case-by-case DE/GA constructions for a majority of cases. Integrated learning exhibits even better performance, with a maximum gain over DE/GA approaching 0.3dB.

image

Fig. 9: Relative performance between polar codes constructed by reinforcement learning and DE/GA for SCL-Genie decoding under AWGN channel.

For SCL-PM decoding under AWGN channel, nested polar code constructions are learned with 100E3 training timesteps. Fig. 10 shows that the learned nested polar constructions outperform the case-by-case DE/GA constructions for almost all information length. The maximum performance gain achieves as large as 1.2dB.

In this paper, the “constructor-evaluator” framework in [2] is adopted to construct polar codes. In particular, we show that constructing nested polar codes can be viewed as a Markov decision process. Thus, reinforcement learning techniques such as A2C and its latest improvements (e.g., PPO and ACKTR) are employed to iteratively optimize the

image

Fig. 10: Relative performance between polar codes constructed by reinforcement learning and DE/GA for SCL-PM decoding under AWGN channel.

“code construction” policy without expert intervention. An MDP environment is set up using the BLER performance as feedback to guide the learning process, which is implemented by neural network based policy and value functions. To facilitate faster and better convergence, a multi-stage genetic algorithm is integrated in the RL algorithms to provide prior knowledge about (sub-)optimal code constructions. We carry out extensive experiments to compare the learning process under various settings. The polar code constructions for both SCL-PM and SCL-Genie decoders are obtained, which exhibit superior performance over classic constructions.

[1] C. E. Shannon, “A mathematical theory of communication”, Bell System Technical Journal, vol. 27, no.34, pp. 379–423, Jul. 1948.

[2] L. Huang, H. Zhang, R. Li, Y. Ge, J. Wang , “AI Coding: Learning to Construct Error Correction Codes”, arXiv:1901.05719 , Jan. 2019.

[3] M. Zhang, Q. Huang, S. Wang, Z. Wang, “Construction of LDPC Codes Based on Deep Reinforcement Learning”, 2018 10th International Conference on Wireless Communications and Signal Processing, Oct 2018.

[4] A. Elkelesh, M. Ebada, S. Cammerer, S. Brink , “Decoder-tailored Polar Code Design Using the Genetic Algorithm”, arXiv:1901.10464 , Jan. 2019.

[5] A. Elkelesh, M. Ebada, S. Cammerer, S. Brink , “Decoder-in-the-Loop: Genetic Optimization-based LDPC Code Design”, arXiv:1903.03128 , Mar. 2019.

[6] S. Korada and R. Urbanke, “Polar codes are optimal for lossy source coding”, Transactions on Information Theory, vol. 56, no. 4, pp. 1751-1768, Mar. 2010.

[7] E. Arikan, “Channel polarization: A method for constructing capacity- achieving codes for symmetric binary-input memoryless channels”, IEEE Trans. Inf. Theory, vol. 55, no. 7, pp. 3051–3073, Jul. 2009.

[8] R. Mori and T. Tanaka, ”Performance of polar codes with the construc- tion using density evolution”, IEEE Communications Letters, vol. 13, no. 7, pp. 519–521, July 2009.

[9] R. Mori and T. Tanaka, ”Performance and construction of polar codes on symmetric binary-input memoryless channels”, IEEE International Symposium on Information Theory, June 2009.

[10] I. Tal and A. Vardy, “How to construct polar codes”, IEEE Trans. Inf. Theory, vol. 59, no. 10, pp. 6562–6582, July 2013.

[11] R. Pedarsani, S. Hassani, I. Tal and E. Telatar, “On the construction of polar codes”, IEEE International Symposium on Information Theory, July 2011.

[12] P. Trifonov, “Efficient design and decoding of polar codes”, IEEE Transactions on Communications vol. 60, no. 11, pp. 3221–3227, Nov. 2012.

[13] I. Tal and A. Vardy, “List decoding of polar codes”, IEEE International Symposium on Information Theory Proceedings pp. 1–5, 2011.

[14] K. Chen, K. Niu and J. Lin, “Improved Successive Cancellation Decoding of Polar Codes”, IEEE Transactions on Communications vol. 61, no. 8, pp. 3100-3107, August 2013.

[15] 3GPP, “NR; Multiplexing and channel coding”, 3GPP TS 38.212, 15.5.0, Mar. 2019.

[16] J. Schulman, F. Wolski, P. Dhariwal, A. Radford, O. Klimov, “Proximal Policy Optimization Algorithms”, arXiv:1707.06347 , Jul. 2017.

[17] Y. Wu, E. Mansimov, S. Liao, R. Grosse, J. Ba, “Scalable trust- region method for deep reinforcement learning using Kronecker-factored approximation”, arXiv:1708.05144 , Aug. 2017.

[18] J. Schulman, S. Levine, P. Moritz, M. Jordan, P. Abbeel, “Trust Region Policy Optimization”, arXiv:1502.05477 , Feb. 2015.

[19] J. Schulman, P. Moritz, S. Levine, M. Jordan, P. Abbeel, “High- Dimensional Continuous Control Using Generalized Advantage Estimation”, arXiv:1506.02438 , Jun. 2015.

[20] D. Silver, A. Huang, C. J. Maddison, et al., “Mastering the game of Go with deep neural networks and tree search”, Nature vol. 529, pp. 484, 2016.

[21] X. Zhang, H. Ma, “Pretraining Deep Actor-Critic Reinforcement Learning Algorithms With Expert Demonstrations”, arXiv:1801.10459 , Jan. 2018.

[22] Y. Gao, H. Xu, J. Lin, F. Yu, S. Levine, T. Darrell, “Reinforcement Learning from Imperfect Demonstrations”, arXiv:1802.05313 , Feb. 2018.


Designed for Accessibility and to further Open Science