Control Regularization for Reduced Variance Reinforcement Learning

2019·Arxiv

Abstract

Abstract

Dealing with high variance is a significant challenge in model-free reinforcement learning (RL). Existing methods are unreliable, exhibiting high variance in performance from run to run using different initializations/seeds. Focusing on problems arising in continuous control, we propose a functional regularization approach to augmenting model-free RL. In particular, we regularize the behavior of the deep policy to be similar to a policy prior, i.e., we regularize in function space. We show that functional regularization yields a bias-variance trade-off, and propose an adaptive tuning strategy to optimize this trade-off. When the policy prior has control-theoretic stability guarantees, we further show that this regularization approximately preserves those stability guarantees throughout learning. We validate our approach empirically on a range of settings, and demonstrate significantly reduced variance, guaranteed dynamic stability, and more efficient learning than deep RL alone.

1. Introduction

Reinforcement learning (RL) focuses on finding an agent’s policy (i.e. controller) that maximizes long-term accumulated reward. This is done by the agent repeatedly observing its state, taking an action (according to a policy), and receiving a reward. Over time the agent modifies its policy to maximize its long-term reward. Amongst other applications, this method has been successfully applied to control tasks (Lillicrap et al., 2016; Schulman et al., 2015; Ghosh et al., 2018), learning to stabilize complex robots.

In this paper, we focus particularly on policy gradient (PG) RL algorithms, which have become popular in solving continuous control tasks (Duan et al., 2016). Since PG algo- rithms focus on maximizing the long-term reward through trial and error, they can learn to control complex tasks without a prior model of the system. This comes at the cost of slow, high variance, learning – complex tasks can take millions of iterations to learn. More importantly, variation between learning runs can be very high, meaning some runs of an RL algorithm succeed while others fail depending on randomness in initialization and sampling. Several studies have noted this high variability in learning as a significant hurdle for the application of RL, since learning becomes unreliable (Henderson et al., 2018; Arulkumaran et al., 2017; Recht, 2019). All policy gradient algorithms face the same issue.

We can alleviate the aforementioned issues by introducing a control-theoretic prior into the learning process using functional regularization. Theories and procedures exist to design stable controllers for the vast majority of real-world physical systems (from humanoid robots to robotic grasping to smart power grids). However, conventional controllers for complex systems can be highly suboptimal and/or require great effort in system modeling and controller design. It would be ideal then to leverage simple, suboptimal controllers in RL to reliably learn high-performance policies.

In this work, we propose a policy gradient algorithm, CORERL (COntrol REgularized Reinforcement Learning), that utilizes a functional regularizer around a, typically suboptimal, control prior (i.e. a controller designed from any prior knowledge of the system). We show that this approach sig-nificantly lowers variance in the policy updates, and leads to higher performance policies when compared to both the baseline RL algorithm and the control prior. In addition, we prove that our policy can maintain control-theoretic stability guarantees throughout the learning process. Finally, we empirically validate our approach using three benchmarks: a car-following task with real driving data, the TORCS racecar simulator, and a simulated cartpole problem. In summary, the main contributions of this paper are as follows:

• We introduce functional regularization using a control prior, and prove that this significantly reduces variance during learning at the cost of potentially increasing bias.

• We provide control-theoretic stability guarantees throughout learning when utilizing a robust control prior.

• We validate experimentally that our algorithm, CORE-

RL, exhibits reliably higher performance than the base RL algorithm (and control prior), achieves significant variance reduction in the learning process, and maintains stability throughout learning for stabilization tasks.

2. Related Work

Significant previous research has examined variance reduction and bias in policy gradient RL. It has been shown that an unbiased estimate of the policy gradient can be obtained from sample trajectories (Williams, 1992; Sutton et al., 1999; Baxter & Bartlett, 2000), though these estimates exhibit extremely high variance. This variance can be reduced without introducing bias by subtracting a baseline from the reward function in the policy gradient (Weaver & Tao, 2001; Greensmith et al., 2004). Several works have studied the optimal baseline for variance reduction, often using a critic structure to estimate a value function or advantage function for the baseline (Zhao et al., 2012; Silver et al., 2014; Schulman et al., 2016; Wu et al., 2018). Other works have examined variance reduction in the value function using temporal regularization or regularization directly on the sampled gradient variance (Zhao et al., 2015; Thodo- roff et al., 2018). However, even with these tools, variance still remains problematically high in reinforcement learning (Islam et al., 2017; Henderson et al., 2018). Our work aims to achieve significant further variance reduction directly on the policy using control-based functional regularization.

Recently, there has been increased interest in functional regularization of deep neural networks, both in reinforcement learning and other domains. Work by Le et al. (2016) has utilized functional regularization to guarantee smoothness of learned functions, and Benjamin et al. (2018) studied properties of functional regularization to limit function distances, though they relied on pointwise sampling from the functions which can lead to high regularizer variance. In terms of utilizing control priors, work by Johannink et al. (2018) adds a control prior during learning, and empirically demonstrates improved performance. Researchers in Farshidian et al. (2014); Nagabandi et al. (2017) used model-based priors to produce a good initialization for their RL algorithm, but did not use regularization during learning.

Another thread of related work is that of safe RL. Several works on model-based RL have looked at constrained learning such that stability is always guaranteed using Lyapunovbased methods (Perkins & Barto, 2003; Chow et al., 2018; Berkenkamp et al., 2017). However, these approaches do not address reward maximization or they overly constrain exploration. On the other hand, work by Achiam et al. (2017) has incorporated constraints (such as stability) into the learning objective, though model-free methods only guarantee approximate constraint satisfaction after a learning period, not during learning (Garc´ıa & Fern´andez, 2015). Our work proves stability properties throughout learning by taking advantage of the robustness of control-theoretic priors.

3. Problem Formulation

Consider an infinite-horizon discounted Markov decision process (MDP) with deterministic dynamics defined by the tuple is a set of states, A is a continuous and convex action space, and the system dynamics, which is unknown to the learning agent. The evolution of the system is given by the following dynamical system and its continuous-time analogue,

where captures the known dynamics, represents the unknowns, denotes the continuous time-derivative of the state denotes the continuous-time analogue of the discrete time dynamics . A control prior can typically be designed from the known part of the system model,

Consider a stochastic policy parameterized by . RL aims to find the policy (i.e. parameters, ) that maximizes the expected accumulated reward

Here is a trajectory whose actions and states are sampled from the policy distribution and the environmental dynamics (1), respectively. The function is the reward function, and is the discount factor.

This work focuses on policy gradient RL methods, which estimate the gradient of the expected return with respect to the policy based on sampled trajectories. We can estimate the gradient, , as follows (Sutton et al., 1999),

where Q

. With a good Q-function estimate, the term (3) is a low-bias estimator of the policy gradient, and utilizes the variance-reduction technique of subtracting a baseline from the reward. However, the resulting policy gradient still has very high variance with respect to , because the expectation in term (3) must be estimated using a finite set of sampled trajectories. This high variance in the policy gradient, var, translates to high variance in the updated policy, var, as seen below,

θ

where is the user-defined learning rate. It is important to note that the variance we are concerned about is with respect to the parameters , not the noise in the exploration process.

To illustrate the variance issue, Fig. 1 shows the results of 100 separate learning runs using direct policy search on the OpenAI gym task Humanoid-v1 (Recht, 2019). Though high rewards are often achieved, huge variance arises from random initializations and seeds. In this paper, we show that introducing a control prior reduces learning variability, im- proves learning efficiency, and can provide control-theoretic stability guarantees during learning.

Figure 1. Performance on humanoid walking task from 100 training runs with different initializations. Results from (Recht, 2018).

4. Control Regularization

The policy gradient allows us to optimize the objective from sampled trajectories, but it does not utilize any prior model. However, in many cases we have enough system information to propose at least a crude nominal controller. Therefore, suppose we have a (suboptimal) control prior, A, and we want to combine our RL policy, , with this control prior at each learning stage, k. Before we proceed, let us define to represent the realized controller sampled from the stochastic RL policy (we will use u to represent deterministic policies and to represent the analogous stochastic ones). We propose to combine the RL policy with the control prior as follows,

where we assume a continuous, convex action space. Note that is the realized controller sampled from stochastic policy , whose distribution over actions has been shifted by We refer to mixed policy, and as the RL policy.

Utilizing the mixed policy (5) is equivalent to placing a functional regularizer on the RL policy, , with regularizer weight be Gaussian distributed: describes the exploration noise. Then we obtain the following,

where the control prior, can be interpreted as a Gaussian prior on the mixed control policy (see Appendix A). Let us define the norm

Lemma 1. The policy in Equation (6) is the solution to the following regularized optimization problem,

which can be equivalently expressed as the constrained optimization problem,

where constrains the policy search. Assuming convergence of the RL algorithm, converges to the solution,

This lemma is proved in Appendix A. The equivalence between (6) and (7) illustrates that the control prior acts as a functional regularization (recall that solves the reward maximization problem appearing in (9) ). The policy mixing (6) can also be interpreted as constraining policy search near the control prior, as shown by (8). More weight on the control prior (higher ) constrains the policy search more heavily. In certain settings, the problem can be solved in the constrained optimization formulation (Le et al., 2019).

4.1. CORE-RL Algorithm

Our learning algorithm is described in Algorithm 1. At the high level, the process can be described as:

• First compute the control prior based on prior knowledge (Line 1). See Section 5 for details on controller synthesis.

• For a given policy iteration, compute the regularization weight, , using the strategy described in Section 4.3 (Lines 7-9). The algorithm can also use a fixed regularization weight, (Lines 10-11).

• Deploy the mixed policy (5) on the system, and record the resulting states/action/rewards (Lines 13-15).

• At the end of each policy iteration, update the policy based on the recorded state/action/rewards (Lines 16-18).

4.2. Bias-Variance Tradeoff

Theorem 1 formally states that mixing the policy gradient-based controller, , with the control prior, , decreases learning variability. However, the mixing may introduce bias into the learned policy that depends on the (a) regularization , and (b) sub-optimality of the control prior. Bias is defined in (10) and refers to the difference between the mixed policy and the (potentially locally) optimal RL policy at convergence.

Theorem 1. Consider the mixed policy (5) where is a policy gradient-based RL policy, and denote the (potentially local) optimal policy to be . The variance (4) of the mixed policy arising from the policy gradient is reduced by a factor when compared to the RL policy with no control prior.

However, the mixed policy may introduce bias proportional to the sub-optimality of the control prior. If we let , then the policy bias (i.e.

) is bounded as follows,

where represents the total variation distance between two probability measures (i.e. policies). Thus, if and are large, this will introduce policy bias.

The proof can be found in Appendix B. Recall that is the stochastic analogue to the deterministic control prior , such that where 1 is the indicator function. Note that the bias/variance results apply to the policy – not the accumulated reward.

Intuition: Using Figure 2, we provide some intuition for the control regularization discussed above. Note the following:

1) The explorable region of the state space is denoted by the set , which grows as decreases and vice versa. This illustrates the constrained policy search interpretation of regularization in the state space.

2) The difference between the control prior trajectory and optimal trajectory (i.e. ) may bias the final policy depending on the explorable region (i.e. ). Fig 2. shows this difference, and its implications, in state space.

3) If the optimal trajectory is within the explorable region, then we can learn the corresponding optimal policy – otherwise the policy will remain suboptimal.

Points 1 and 3 will be formally addressed in Section 5.

Figure 2. Illustration of optimal trajectory vs. control-theoretic trajectory with the explorable set . (a) With high regularization, set is small so we cannot learn the optimal trajectory. (b) With lower regularization, set is larger so we can learn the optimal trajectory. However, this also enlarges the policy search space.

4.3. Computing the mixing parameter λ

A remaining challenge is automatically tuning , especially as we acquire more training data. While setting a fixed can perform well, intuitively, should be large when the RL controller is highly uncertain, and it should decrease as we become more confident in our learned controller.

Consider the multiple model adaptive control (MMAC) framework, where a set of controllers (each based on a different underlying model) are generated. A meta-controller computes the overall controller by selecting the weighting for different candidate controllers, based on how close the underlying system model for each candidate controller is to the “true” model (Kuipers & Ioannou, 2010). Inspired by this approach, we should weight the RL controller proportional to our confidence in its model. Our confidence should be state-dependent (i.e. low confidence in areas of the state space where little data has been collected). However, since the RL controller does not utilize a dynamical system model, we propose measuring confidence in the RL controller via the magnitude of the temporal difference (TD) error,

where . This TD error measures how poorly the RL algorithm predicts the value of subsequent actions from a given state. A high TD-error implies that the estimate of the action-value function at a given state is poor, so we should rely more heavily on the control prior (a high value). In order to scale the TD-error to a value in the interval , we take the negative exponential of the TD-error, computed at run-time,

The parameters C and are tuning parameters of the adaptive weighting strategy. Note that Equation (12) uses rather than , because computing measurement of state . Thus we rely on the reasonable assumption that , since should be very close to in practice.

Equation (12) yields a low value of if the RL action-value function predictions are accurate. This measure is chosen because the (explicit) underlying model of the RL controller is the value function (rather than a dynamical system model). Our experiments show that this adaptive scheme based on the TD error allows better tuning of the variance and performance of the policy.

5. Control Theoretic Stability Guarantees

In many controls applications, it is crucial to ensure dynamic stability, not just high rewards, during learning. When a (crude) dynamical system model is available, we can utilize classic controller synthesis tools (i.e. LQR, PID, etc.) to obtain a stable control prior in a region of the state space. In this section, we utilize a well-established tool from robust control theory (control), to analyze system stability under the mixed policy (5), and prove stability guarantees throughout learning when using a robust control prior.

Our work is built on the idea that the control prior should maximize robustness to disturbances and model uncertainty, so that we can treat the RL control, , as a performancemaximizing “disturbance” to the control prior, . The mixed policy then takes advantage of the stability properties of the robust control prior, and the performance optimization properties of the RL algorithm. To obtain a robust control prior, we utilize concepts from

Consider the nonlinear dynamical system (1), and let us linearize the known part of the model a desired equilibrium point to obtain the following,

where is the disturbance vector, and is the controlled output. Note that we analyze the continuous-time dynamics rather than discrete-time, since all mechanical systems have continuous time dynamics that can be discovered through analysis of the system Lagrangian. However, similar analysis can be done for discrete-time dynamics. We make the following standard assumption – conditions for its satisfaction can be found in (Doyle et al., 1989), Assumption 1. A controller exists for linear system (13) that stabilizes the system in a region of the state space.

Stability here means that system trajectories are bounded around the origin/setpoint. We can then synthesize an

controller, , using established techniques described in (Doyle et al., 1989). The resulting controller is robust with worst-case disturbances attenuated by a factor before entering the output, where is a parameter returned by the synthesis algorithm. See Appendix F for further details on control and its robustness properties.

Having synthesized a robust controller for the linear system model (13), we are interested in how those robustness properties (e.g. disturbance attenuation by ) influence the nonlinear system (1) controlled by the mixed policy (5). We rewrite the system dynamics (1) in terms of the linearization (13) plus a disturbance term as follows,

where d(s, a) gathers together all dynamic uncertainties and nonlinearities. To keep this small, we could use feedback linearization based on the nominal nonlinear model (1).

We now analyze stability of the nonlinear system (14) under the mixed policy (5) using Lyapunov analysis (Khalil, 2000). Consider the Lyapunov function , where P is obtained when synthesizing the controller (see Appendix F). If we can define a closed region, , around the origin such that outside that region, then by standard Lyapunov analysis, is forward invariant and asymptotically stable (note is the time-derivative of the Lyapunov function). Since the control law satis-fies an Algebraic Riccati Equation, we obtain the following relation,

Lemma 2. For any state s, satisfaction of the condition,

implies that

This lemma is proved in Appendix C. Note that denotes the difference between the RL controller and control prior, and come from (13). Let us bound the RL control output such that , and define the set control is stabilizing}. We also bound the “disturbance” , for all , and define the minimum singular value

, which reflects the robustness of the control prior (i.e. larger imply greater robustness). Then using Lemma 2 and Lyapunov analysis tools, we can derive a conservative set that is guaranteed asymptotically stable and forward invariant under the mixed policy, as described in the following theorem (proof in Appendix D).

Theorem 2. Assume a stabilizing control prior within the set C for the dynamical system (14). Then asymptotic stability and forward invariance of the set

is guaranteed under the mixed policy (5) for all . The set contracts as we (a) increase robustness of the control prior (increase ), (b) decrease our dynamic uncertainty/nonlinearity , or (c) increase weighting on the control prior.

Put simply, Theorem 2 says that all states in C will converge to (and remain within) set under the mixed policy (5). Therefore, the stability guarantee is stronger if smaller cardinality. The set is drawn pictorally in Fig. 2, and essentially dictates the explorable region. Note that the region of attraction.

Theorem 2 highlights the tradeoff between robustness parameter, , of the control prior, the nonlinear uncertainty in the dynamics , and the utilization of the learned controller, . If we have a more robust control prior (higher ) or better knowledge of the dynamics (smaller we can heavily weight the learned controller (lower ing the learning process while still guaranteeing stability.

While shrinking the set and achieving asymptotic stability along a trajectory or equilibrium point may seem desirable, Fig. 2 illustrates why this is not necessarily the case in an RL context. The optimal trajectory for a task typically deviates from the nominal trajectory (i.e. the control theoretic-trajectory), as shown in Fig. 2 – the set illustrates the explorable region under regularization. Fig. 2(a) shows that we do not want strict stability of the nominal trajectory, and instead would like limited flexibility (a suffi-ciently large ) to explore. By increasing the weighting on the learned policy (decreasing ), we expand the set and allow for greater exploration around the nominal trajectory (at the cost of stability) as seen in Fig. 2(b).

6. Empirical Results

We apply the CORE-RL Algorithm to three problems: (1) cartpole stabilization, (2) car-following control with experimental data, and (3) racecar driving with the TORCS simulator. We show results using DDPG or PPO or TRPO (Lillicrap et al., 2016; Schulman et al., 2017; Schulman et al., 2015) as the policy gradient RL algorithm (PPO + TRPO results moved to Appendix G), though any similar RL algorithm could be used. All code can be found at https://github.com/rcheng805/CORE-RL.

Note that our results focus on reward rather than bias. Bias (as defined in Section 4.2) assumes convergence to a (locally) optimal policy, and does not include many factors influencing performance (e.g. slow learning, failure to converge, etc.). In practice, Deep-RL algorithms often do not converge (or take very long to do so). Therefore, reward better demonstrates the influence of control regularization on performance, which is of greater practical interest.

6.1. CartPole Problem

We apply the CORE-RL algorithm to control of the cartpole from the OpenAI gym environment (CartPole-v1). We modified the CartPole environment so that it takes a continuous input, rather than discrete input, and we utilize a reward function that encourages the cartpole to maintain its position while keeping the pole upright. Further details on the environment and reward function are in Appendix E. To obtain a control prior, we assume a crude model (i.e. linearization of the nonlinear dynamics with the mass and length values), and from this we synthesize an controller. Using this control prior, we run Algorithm 1 with several different regularization weights, , we run CORE-RL 6 times with different random seeds.

Figure 4a plots reward improvement over the control prior, which shows that the regularized controllers perform much better than the baseline DDPG algorithm (in terms of variance, reward, and learning speed). We also see that intermediate values of (i.e. ) result in the best learning, demonstrating the importance of policy regularization.

Figure 4b better illustrates the performance-variance trade- off. For small , we see high variance and poor performance. With intermediate , we see higher performance and lower variance. As we further increase , variance continues to decrease, but the performance also decreases since policy exploration is heavily constrained. The adaptive mixing strategy performs very well, exhibiting low variance through learning, and converging on a high-performance policy.

Figure 3. Stability region for CartPole under mixed policy. (a) Illustration of the stability region for different regularization, For each shown, the trajectory goes to and remains within the corresponding stability set throughout training. (b) Size of the stability region in terms of the angle , and position increases, we are guaranteed to remain closer to the equilibrium point during learning.

While Lemma 1 proved that the mixed controller (6) has the same optimal solution as optimization problem (7), when we ran experiments directly using the loss in (7), we found that performance (i.e. reward) was worse than CORE-RL and still suffered high variance. In addition, learning with pre-training on the control prior likewise exhibited high variance and had worse performance than CORE-RL.

Importantly, according to Theorem 2, the system should maintain stability (i.e. remain within an invariant set around our desired equilibrium point) throughout the learning process, and the stable region shrinks as we increase . Our simulations exhibit exactly this property as seen in Figure 3, which shows the maximum deviation from the equilibrium point across all episodes. The system converges to a stability region throughout learning, and this region contracts as we increase . Therefore, regularization not only improves learning performance and decreases variance, but can capture stability guarantees from a robust control prior.

6.2. Experimental Car-Following

We next examine experimental data from a chain of 5 cars following each other on an 8-mile segment of a single-lane public road. We obtain position (via GPS), velocity, and acceleration data from each of the cars, and we control the acceleration/deceleration of the car in the chain. The goal is to learn an optimal controller for this car that maximizes fuel efficiency while avoiding collisions. The experimental setup and data collection process are described in (Ge et al., 2018). For the control prior, we utilize a bang-bang controller that (inefficiently) tries to maintain a large distance from the car in front and behind the controlled car. The reward function penalizes fuel consumption and collisions (or near-collisions). Specifics of the control prior, reward function, and experiments are in Appendix E.

For our experiments, we split the data into 10 second “episodes”, shuffle the episodes, and run CORE-RL six times with different random seeds (for several different

Figure 4a shows again that the regularized controllers perform much better than the baseline DDPG algorithm for the car-following problem, and demonstrates that regularization leads to performance improvements over the control prior and gains in learning efficiency. Figure 4b reinforces that intermediate values of (i.e. ) exhibit optimal performance. Low values of exhibit significant deterioration of performance, because the car must learn (with few samples) in a much larger policy search space; the RL algorithm does not have enough data to converge on an optimal policy. High values of also exhibit lower performance because they heavily constrain learning. Intermediate allow for the best learning using the limited number of experiments.

Using an adaptive strategy for setting (or alternatively tuning to an optimal ), we obtain high-performance policies that improve upon both the control prior and RL baseline controller. The variance is also low, so that the learning process reliably learns a good controller.

6.3. Driving in TORCS

Finally we run CORE-RL to generate controllers for cars in The Open Racing Car Simulator (TORCS) (Wymann et al., 2014). The simulator provides readings from 29 sensors, which describe the environment state. The sensors provide information like car speed, distance from track center, wheel spin, etc. The controller decides values for the acceleration, steering and braking actions taken by the car.

To obtain a control prior for this environment, we use a simple PID-like linearized controller for each action, similar to the one described in (Verma et al., 2018). These types of controllers are known to have sub-optimal performance, while still being able to drive the car around a lap. We perform all our experiments on the CG-Speedway track in TORCS. For each , we run the algorithm 5 times with different initializations and random seeds.

For TORCS, we plot laptime improvement over the control prior so that values above zero denote improved performance over the prior. The laps are timed out at 150s, and the objective is to minimize lap-time by completing a lap as fast as possible. Due to the sparsity of the lap-time signal, we use a pseudo-reward function during training that pro-

Figure 4. Learning results for CartPole, Car-Following, and TORCS RaceCar Problems using DDPG. (a) Reward improvement over control prior with different set values for or an adaptive . The right plot is a zoomed-in version of the left plot without variance bars for clarity. Values above the dashed black line signify improvements over the control prior. (b) Performance and variance in the reward as a function of the regularization , across different runs of the algorithm using random initializations/seeds. Dashed lines show the performance (i.e. reward) and variance using the adaptive weighting strategy. Variance is measured for all episodes across all runs. Adaptive and intermediate values of exhibit best learning. Again, performance is baselined to the control prior, so any performance value above 0 denotes improvement over the control prior.

vides a heuristic estimate of the agent’s performance at each time step during the simulation (details in Appendix E).

Once more, Figure 4a shows that regularized controllers perform better on average than the baseline DDPG algorithm, and that we improve upon the control prior with proper regularization. Figure 4b shows that intermediate values of exhibit good performance, but using the adaptive strategy for setting in the TORCS setting gives us the highest-performance policy that significantly beats both the control prior and DDPG baseline. Also, the variance with the adaptive strategy is significantly lower than for the DDPG baseline, which again shows that the learning process reliably learns a good controller.

Note that we have only shown results for DDPG. Results for PPO and TRPO are similar for CartPole and Car-following (different for TORCS), and can be found in Appendix G.

7. Conclusion

A significant criticism of RL is that random seeds can produce vastly different learning behaviors, limiting application of RL to real systems. This paper shows, through theoretical results and experimental validation, that our method of control regularization substantially alleviates this problem, enabling significant variance reduction and performance improvements in RL. This regularization can be interpreted as constraining the explored action space during learning.

Our method also allows us to capture dynamic stability properties of a robust control prior to guarantee stability during learning, and has the added benefit that it can easily incorporate different RL algorithms (e.g. PPO, DDPG, etc.). The main limitation of our approach is that it relies on a reasonable control prior, and it remains to be analyzed how bad of a control prior can be used while still aiding learning.

Acknowledgements

This work was funded in part by Raytheon under the Learning to Fly program, and by DARPA under the PhysicsInfused AI Program.

References

Achiam, J., Held, D., Tamar, A., and Abbeel, P. Constrained Policy Optimization. In International Conference on Machine Learning (ICML), 2017.

Arulkumaran, K., Deisenroth, M. P., Brundage, M., and Bharath, A. A. Deep reinforcement learning: A brief survey. IEEE Signal Processing Magazine, 2017.

Baxter, J. and Bartlett, P. Reinforcement learning in POMDP’s via direct gradient ascent. International Conference on Machine Learning, 2000.

Benjamin, A. S., Rolnick, D., and Kording, K. Measuring and Regularizing Networks in Function Space. arXiv:1805.08289, 2018.

Berkenkamp, F., Turchetta, M., Schoellig, A. P., and Krause, A. Safe Model-based Reinforcement Learning with Stability Guarantees. In Neural Information Processing Systems (NeurIPS), 2017.

Chow, Y., Nachum, O., Duenez-Guzman, E., and Ghavamzadeh, M. A Lyapunov-based Approach to Safe Reinforcement Learning. In Advances in Neural Information Processing Systems (NeurIPS), 2018.

Doyle, J. Robust and Optimal Control. In Conference on Decision and Control, 1996.

Doyle, J., Glover, K., Khargonekar, P., and Francis, B. State- space solutions to standard H/sub 2/ and H/sub infinity / control problems. IEEE Transactions on Automatic ControlTransactions on Automatic Control, 1989. ISSN 00189286. doi: 10.1109/9.29425.

Duan, Y., Chen, X., Schulman, J., and Abbeel, P. Bench- marking Deep Reinforcement Learning for Continuous Control. In International Conference on Machine Learning (ICML), 2016.

Farshidian, F., Neunert, M., and Buchli, J. Learning of closed-loop motion control. In IEEE International Conference on Intelligent Robots and Systems, 2014.

Garc´ıa, J. and Fern´andez, F. A Comprehensive Survey on Safe Reinforcement Learning. JMLR, 2015.

Ge, J. I., Avedisov, S. S., He, C. R., Qin, W. B., Sadeghpour, M., and Orosz, G. Experimental validation of connected automated vehicle design among human-driven vehicles. Transportation Research Part C: Emerging Technologies, 2018.

Ghosh, D., Singh, A., Rajeswaran, A., Kumar, V., and Levine, S. Divide-and-conquer reinforcement learning. In Neural Information Processing Systems (NeurIPS), volume abs/1711.09874, 2018.

Greensmith, E., Bartlett, P., and Baxter, J. Variance reduc- tion techniques for gradient estimates in reinforcement learning. JMLR, 2004.

Henderson, P., Islam, R., Bachman, P., Pineau, J., Precup, D., and Meger, D. Deep Reinforcement Learning that Matters. In AAAI Conference on Artificial Intelligence, 2018.

Islam, R., Henderson, P., Gomrokchi, M., and Precup, D. Reproducibility of Benchmarked Deep Reinforcement Learning of Tasks for Continuous Control. In Reproducibility in Machine Learning Workshop, 2017.

Johannink, T., Bahl, S., Nair, A., Luo, J., Kumar, A., Loskyll, M., Aparicio Ojea, J., Solowjow, E., and Levine, S. Residual Reinforcement Learning for Robot Control. arXiv e-prints, art. arXiv:1812.03201, Dec 2018.

Khalil, H. K. Nonlinear Systems (Third Edition). Prentice Hall, 2000.

Kuipers, M. and Ioannou, P. Multiple model adaptive control with mixing. IEEE Transactions on Automatic Control, 2010.

Le, H., Kang, A., Yue, Y., and Carr, P. Smooth Imitation Learning for Online Sequence Prediction. In International Conference on Machine Learning (ICML), 2016.

Le, H. M., Voloshin, C., and Yue, Y. Batch policy learn- ing under constraints. In International Conference on Machine Learning, 2019.

Lillicrap, T. P., Hunt, J. J., Pritzel, A., Heess, N., Erez, T., Tassa, Y., Silver, D., and Wierstra, D. Continuous control with deep reinforcement learning. In International Conference on Learning Representations, 2016.

Nagabandi, A., Kahn, G., Fearing, R. S., and Levine, S. Neural Network Dynamics for Model-Based Deep Reinforcement Learning with Model-Free Fine-Tuning. arXiv e-prints, art. arXiv:1708.02596, Aug 2017.

Perkins, T. J. and Barto, A. G. Lyapunov design for safe reinforcement learning. Journal of Machine Learning Research, 2003.

Recht, B. A Tour of Reinforcement Learning: The View from Continuous Control. Annual Review of Control, Robotics, and Autonomous Systems, 2(1):253–279, 2019.

Schulman, J., Levine, S., Moritz, P., Jordan, M., and Abbeel, P. Trust Region Policy Optimization. In International Conference on Machine Learning (ICML), 2015.

Schulman, J., Moritz, P., Levine, S., Jordan, M., and Abbeel, P. High-Dimensional Continuous Control Using Generalized Advantage Estimation. International Conference on Learning Representations, 2016.

Schulman, J., Wolski, F., Dhariwal, P., Radford, A., and Klimov, O. Proximal Policy Optimization Algorithms. arXiv e-prints, art. arXiv:1707.06347, Jul 2017.

Silver, D., Lever, G., Heess, N., Degris, T., Wierstra, D., and Riedmiller, M. Deterministic Policy Gradient Algorithms. Proceedings of the 31st International Conference on Machine Learning (ICML-14), 2014.

Sutton, R., McAllester, D., Singh, S. P., and Mansour, Y. Policy Gradient Methods for Reinforcement Learning with Function Approximation. Advances in Neural Information Processing Systems, 1999.

Thodoroff, P., Durand, A., Pineau, J., and Precup, D. Tem- poral Regularization for Markov Decision Process. In Advances in Neural Information Processing Systems, 2018.

Verma, A., Murali, V., Singh, R., Kohli, P., and Chaudhuri, S. Programmatically interpretable reinforcement learning. In International Conference on Machine Learning (ICML), 2018.

Weaver, L. and Tao, N. The Optimal Reward Baseline for Gradient-Based Reinforcement Learning. In Uncertainty in Artificial Intelligence (UAI), 2001.

Williams, R. J. Simple statistical gradient-following algo- rithms for connectionist reinforcement learning. Machine Learning, 1992.

Wu, C., Rajeswaran, A., Duan, Y., Kumar, V., Bayen, A. M., Kakade, S., Mordatch, I., and Abbeel, P. Variance Reduction for Policy Gradient with Action-Dependent Factorized Baselines. In International Conference on Learning Representations, 2018.

Wymann, B., Espi´e, E., Guionneau, C., Dimitrakakis, C., Coulom, R., and Sumner, A. TORCS, The Open Racing Car Simulator. http://www.torcs.org, 2014.

Zhao, T., Hachiya, H., Niu, G., and Sugiyama, M. Analysis and improvement of policy gradient estimation. Neural Networks, 2012.

Zhao, T., Niu, G., Xie, N., Yang, J., and Sugiyama, M. Regularized Policy Gradients : Direct Variance Reduction in Policy Gradient Estimation. Proceedings of the Asian Conference on Machine Learning, 2015.

Appendix: Control Regularization for Reduced Variance Reinforcement Learning

A. Proof of Lemma 1

Lemma 1. The policy in Equation (6) is the solution to the following regularized optimization problem,

which can be equivalently expressed as the constrained optimization problem:

where constrains the policy search. Assuming convergence of the RL algorithm, converges to the solution,

Proof.

Equivalence between (6) and (16) : Let be a Gaussian distributed policy with mean . Thus, describes exploration noise. From the mixed policy definition (6), we can obtain the following Gaussian distribution describing the mixed policy:

where the second equality follows based on the properties of products of Gaussians. Let us define be the determinant of . Then, distribution (19) can be rewritten as the product,

where X(s) is a random variable with P(X(s)) representing the probability of taking action X from state s under policy (6). Further simplifying this PDF, we obtain:

Since the probability P(X(s)) is maximized when the argu- ment of the exponential in Equation (21) is minimized, then the maximum probability policy can be expressed as the solution to the following regularized optimization problem,

Therefore the mixed policy from Equation (6) is the solution to Problem (16) .

Convergence of (16) to (18): Note that and are parameterized by the same and represent the iterative solution to the optimization problem

at the latest policy iteration. Thus, assuming convergence of the RL algorithm, we can rewrite problem (22) as follows,

Equivalence between (16) and (17) : Finally, we want to show that the solutions for regularized problem (16) and the constrained optimization problem (17) are equivalent.

First, note that Problem (16) is the dual to Problem (17), where is the dual variable. Clearly problem (16) is convex in u. Furthermore, Slater’s condition holds, since there is always a feasible point (e.g. trivially ). Therefore strong duality holds. This means that such that the solution to Problem (17) must also be optimal for Problem (16).

To show the other direction, fix and define R(u) =

. Let us denote as the optimal solution for Problem (16) with (note we can choose supposed is not optimal for Problem (17). Then there exists such that and . Denote the difference in the two rewards by Thus the following relations hold,

This leads to the conditional statement,

For fixed , there always exists such that the condition holds. However, this leads to a contradiction, since we assumed that is optimal for Problem (16). We can conclude then that such that the solution to Problem (16) must be optimal for Problem (17). Therefore, Problems (16) and (17) have equivalent solutions.

B. Proof of Theorem 1

Theorem 1. Consider the mixed policy (5) where is an RL controller learned through policy gradients, and denote the (potentially local) optimal policy to be variance (4) of the mixed policy arising from the policy gradient is reduced by a factor when compared to the RL policy with no control prior.

However, the mixed policy may introduce bias proportional to the sub-optimality of the control prior. More formally, if we let , then the policy bias (i.e. ) is bounded as follows:

where represents the total variation distance between two probability measures (i.e. policies). Thus, if and are large, this will introduce policy bias.

Proof. Let us define the stochastic action (i.e. random variable) . Then recall from Equation (4) that assuming a fixed, Gaussian distributed policy,

Based on the mixed policy definition (5), we obtain the following relation between the variance of

mixed policy and RL policy, respectively),

Compared to the variance (4), we achieve a variance reduction when utilizing the same learning rate . Taking the same policy gradient from (4), var, then the variance is reduced by a factor of by introducing policy mixing.

Lower variance comes at a price – potential introduction of bias into policy. Let us define the policy bias as , and let us denote . Since total variational distance, is a metric, we can use the triangle inequality to obtain:

We can further break down the term

This holds for all ), we can obtain the lower bound in (26),

To obtain the upper bound, let the policy gradient algorithm with no control prior achieve asymptotic convergence to the (locally) optimal policy (as proven for certain classes of function approximators in (Sutton et al., 1999)). Denote this policy as , such that as . In this case, we can derive the total variation distance between

the mixed policy (5) and the optimal policy as follows,

Note that this represents an upper bound on the bias, since it assumes that is uninfluenced by during learning. It shows that is a feasible policy, but not necessar- ily optimal when accounting for regularization with Therefore, we can obtain the upper bound:

C. Proof of Lemma 2

Lemma 2. For any state s, satisfaction of the condition,

implies that

Proof. Recall that we are analyzing the Lyapunov function , where P is taken from the Algebraic Riccati Equation (50). Let us take the time derivative of the Lyapunov function as follows:

The second equality comes from the Algebraic Riccati Equation (50), which the dynamics satisfy by design of the controller. From here, it follows directly that if,

D. Proof of Theorem 2

Theorem 2. Assume a stabilizing control prior within the set C for our dynamical system (14). Then asymptotic stability and forward invariance of the set

is guaranteed under the mixed policy (5) for all . The set contracts as we (a) increase robustness of the control prior (increase ), (b) decrease our dynamic uncertainty/nonlinearity , or (c) increase weighting on the control prior.

Proof.

Step (1): Find a set in which Lemma 2 is satisfied.

Consider the condition in Lemma 2. Since the right hand side is positive (quadratic), we can consider a bound on the stability condition as follows,

Clearly any set of s that satisfy condition (35) also satisfy the condition in Lemma 2. To find such a set, we bound the terms in Condition (35) as follows,

where the first inequality follows from the triangle inequality; the second inequality uses our bounds on the disturbance, and control input difference , as well as the Cauchy-Schwarz inequality. Now consider the right hand side of Condition (35). Recall that , the minimum singular value. Then the following holds,

Using the bounds in (36) and (37), we can say that Condition (35) is guaranteed to be satisfied if the following holds,

The set for which this condition (38) is satisfied can be described by,

Recall that C is the set in which the stabilizing controller exists. From Lemma 2, described by the set (39).

Step (2): Establish stability and forward invariance of The Lyapunov function decreases towards the origin, and we have established that the time derivative of the Lyapunov function is negative for s in set (39). Therefore, any state s described by the set (39) (intersected with C) must move towards the origin (i.e. towards ). This follows directly from the properties of Lyapunov functions. Therefore, the set described in (34) must be asymptotically stable and forward invariant for all

E. Description of Experiments

E.1. Experimental Car-Following

In the original car-following experiments, a chain of 8 cars followed each other on an 8-mile segment of a single-lane public road. We obtain position (via GPS), velocity, and acceleration data from each of the cars. We cut this data into 4 sets of chains of 5 cars, in order to maximize the data available to learn from. We then cut this into 10 second “episodes” (100 data points each). We shuffle these training episodes randomly before each run and feed them to the algorithm, which learns the controller for the chain.

The reward function we use in learning is described below:

where , and denote the position of the controlled car, the car in front of it, and the car behind it. Also, a denotes the control action (i.e. acceleration/deceleration), and denotes the velocity of the controlled car. Therefore, the first term represents the fuel efficiency of the controlled car, and the other terms encourage the car to maintain headway from the other cars and avoid collision.

The control prior we utilize is a simple bang-bang controller that (inefficiently) tries to keep us between the car and front and back. It is described by,

where , and denote the velocity of the controlled car, the car in front of it, and the car behind it. We set the constants and . Essentially, the control prior tries to maximize the distance from the car in front and behind, taking into account velocities as well as positions.

E.2. TORCS Racecar Simulator

In its full generality TORCS provides a rich environment with input from up to 89 sensors, and optionally the 3D graphic from a chosen camera angle in the race. The controllers have to decide the values of up to 5 parameters during game play, which correspond to the acceleration, brake, clutch, gear and steering of the car. Apart from the immediate challenge of driving the car on the track, controllers also have to make race-level strategy decisions, like making pit-stops for fuel. A lower level of complexity is provided in the Practice Mode setting of TORCS. In this mode all race-level strategies are removed. Currently, so far as we know, state-of-the-art DRL models are capable of racing only in Practice Mode, and this is also the environment that we use. In this mode we consider the input from 29 sensors, and decide values for the acceleration, steering and brake actions.

The control prior we utilize is a linear controller of the form:

Where is the most recent observation provided by the simulator for a chosen sensor, and N is a predetermined constant. We have one controller for each of the actions, acceleation, steering and braking.

The pseudo-reward used during training is given by:

Here V is the velocity of the car, is the angle the car makes with the track axis, and trackPos provides the position on the track relative to the track’s center. This reward captures the aim of maximizing the longitudinal velocity, minimizing the transverse velocity, and penalizing the agent if it deviates significantly from the center of the track.

E.3. CartPole Stabilization

The CartPole simulator is implemented in the OpenAI gym environment (’CartPole-v1’). The dynamics are the same as in the default, as described below,

where the only modification we make is that the force on the cart can take on a continuous value, , rather than 2 discrete values, making the action space much larger. Since the control prior can already stabilize the CartPole, we also modify the reward to characterize how well the control stabilizes the pendulum. The reward function is stated below, and incentivizes the CartPole to keep the pole upright while minimizing movement in the x-direction:

F. Control Theoretic Stability Guarantees

This section in the Appendix goes over the same material in Section 5, but goes into more detail on the definition. Consider the linear dynamical system described by:

where is the disturbance vector, is the control input vector, is the error vector (controlled output), is the observation vector, and is the state vector. The system transfer function is denoted,

where are defined by the system model (46). Let us make the following assumptions,

• The pairs and are stabilizable and observable, respectively.

• The algebraic Riccati equation has positive-semidefinite solution P,

• The algebraic Riccati equation has positive- semidefinite solution

• The matrix is positive definite.

Under these assumptions, we are guaranteed existence of a stabilizing linear controller, Doyle et al., 1989). The closed-loop transfer function from disturbance, w, to controlled output, z, is:

Let denotes the maximum singular value of the argument, and recall that the controller solves the problem,

to give us controller . This generates the maximally robust controller so that the worst-case disturbance is attenuated by factor in the system before entering the controlled output. We can synthesize the controller using techniques described in (Doyle et al., 1989).

The controller is defined as , where P is a positive symmetric matrix satisfying the Algebraic Riccati equation,

A

(50) where () are defined in (46). The result is that the control law stabilizes the system with disturbance attenuation

Since we are not dealing with a linear system, we need to consider a modification to the dynamics (46) that linearizes the dynamics about some equilibrium point and gathers together all non-linearities and disturbances,

where d(s, a) captures dynamic uncertainty/nonlinearity as well as disturbances. To keep this small, we could use feedback linearization based on our nominal nonlinear model (1), but this is outside the scope of this work.

Consider the Lyapunov function , where P is taken from Equation (50). We can analyze stability of the uncertain system (14) under the mixed policy (5) using Lyapunov analysis. We can utilize Lemma 2 in this analysis (see Appendix C) in order to compute a set such that in a region outside that set. Satisfaction of this condition guarantees forward invariance of that set (Khalil, 2000), as well as its asymptotic stability (from the region for which

By bounding terms as described in Section 5, we can conservatively compute the set is shown in Theorem 2. See Appendix D for the derivation of the set (i.e. proof of Theorem 2).

G. PPO + TRPO Results

We also ran all experiments using Proximal Policy Optimization (PPO) or Trust Region Policy Optimization (TRPO) in place of DDPG. The results are shown in Figures 5 and 6. The trends mirror those seen in the main paper using DDPG. Low values of exhibit significant deterioration of performance, because of the larger policy search space. High values of also exhibit lower performance because they heavily constrain learning. Intermediate allow for the best learning, with good performance and low variance. Furthermore, adaptive strategies for setting allows us to better tune the reward-variance tradeoff.

Note that we do not show results for the TORCS Racecar. This is because we were not able to get the baseline PPO or TRPO agent to complete a lap throughout learning. The code for the PPO, TRPO, and DDPG agent for each environment can be found at https://github.com/ rcheng805/CORE-RL.

Figure 5. Learning results for CartPole and Car-Following Problems using PPO. (a) Reward improvement over control prior with different set values for or an adaptive . The right plot is a zoomed-in version of the left plot without variance bars for clarity. Values above the dashed black line signify improvements over the control prior. (b) Performance and variance in the reward as a function of the regularization , across different runs of the algorithm using random initializations/seeds. Dashed lines show the performance (i.e. reward) and variance using the adaptive weighting strategy. Variance is measured for all episodes across all runs. Again, performance is baselined to the control prior, so any performance value above 0 denotes improvement over the control prior.

Figure 6. Learning results for CartPole and Car-Following Problems using TRPO. (a) Reward improvement over control prior with different set values for or an adaptive . The right plot is a zoomed-in version of the left plot without variance bars for clarity. Values above the dashed black line signify improvements over the control prior. (b) Performance and variance in the reward as a function of the regularization , across different runs of the algorithm using random initializations/seeds. Dashed lines show the performance (i.e. reward) and variance using the adaptive weighting strategy. Variance is measured for all episodes across all runs. Again, performance is baselined to the control prior, so any performance value above 0 denotes improvement over the control prior.

designed for accessibility and to further open science