Physics simulations provide a possibility of generating vast diverse amounts of data at a low cost. However, samplebased optimization has been known to be optimistically biased [1]. The problem is worsened when the data used for optimization does not originate from the same environment, also called domain. In this case, we observe a simulation optimization bias, which leads to an overestimation of the policy’s performance [2]. Generally, there are two ways to overcome the gap between simulation and reality. One can improve the generative model to closely match the reality, e.g. by using system identification. Increasing the model’s accuracy has the advantage of leading to controllers with potentially higher performance, since the learner can focus on a single domain. On the downside, this goes in line with a reduced transferability of the found policy, especially if the model does
Fabio Muratore, Christian Eilers and Jan Peters are with the Intelligent Autonomous Systems Group, Technical University Darmstadt, Germany. Fabio Muratore, Christian Eilers and Michael Gienger are with the Honda Research Institute Europe, Offenbach am Main, Germany.
Figure 1: The Quanser Qube used as evaluation platform on an under-actuated swing-up and balancing task [3]. not include all physical phenom- ena, caused by the previously mentioned optimistic bias. Moreover, we might face a situation where it is not affordable to improve the model. Alternatively, one can add variability to the generative model, e.g. by turning the physics simulator’s parameters into random variables. Learning from randomized simulations poses a harder problem for the learner due to the additional variability of the observed data. But the recent successes in the field of sim-to-real transfer argue for domain randomization being a promising method [4, 5]. Most state-of-the-art approaches randomize the physics simulator according to a static handcrafted distribution. Even though static randomization is in many cases sufficient to cross the reality gap, it is desirable to automate the process as far as possible. One reason is that hand-tuning the domain parameter distribution becomes increasingly cumbersome for higher dimensions. Moreover, using a fixed distribution does not allow to update the prior knowledge about the uncertainty over domain parameters. Most importantly, closing the feedback loop over the real system will lead to policies with higher performance on the target domain since the feedback enables the optimization of the domain parameter distribution. Contributions: we advance the state-of-the-art by introducing Bayesian Domain Randomization (BayRn), a method which is able to efficiently close the reality gap by learning from randomized simulations and adapting the distribution over simulator parameters based solely on real-world returns. The proposed algorithm can be seen as a way to automate the finding of source domain distribution in sim-to-real settings, which is typically done by trial and error. We validate our approach by conducting a sim-to-sim as well as a sim-to-real experiment on an under-actuated nonlinear swing-up task (Figure 1). The sim-to-sim setup examines the domain parameter adaptation mechanism, and shows that BayRn is able to find a specified ground truth parameter set. In the sim-to-real experiment, we compare the performance of a policy trained with BayRn against two baselines. The remainder of this paper is organized as follows: first, we introduce the necessary fundamentals (Section II) for
BayRn (Section III). Next, we evaluate the devised method experimentally (Section IV). Subsequently, we put BayRn into context with the related work (Section V). Finally, we conclude and mention possible future research directions (Section VI).
Optimizing control policies for Markov Decision Processes (MDPs) with unknown dynamics is generally a hard problem (Section II-A). It is specifically hard due to the simulation optimization bias [2], which occurs when transferring the polices learned in on domain to another. Adapting the source domain based on real-world data requires a method suited for expensive objective function evaluations. Bayesian Optimization (BO) is a prominent choice for these kind of problems (Section II-B).
A. Markov Decision Process
Consider a time-discrete dynamical system
with the continuous state , and continuous action
at time step t. The environment, also called domain, is instantiated through its parameters
(e.g., masses, friction coefficients, or time delays), which are assumed to be random variables distributed according to the probability distribution
parametrized by
. These parameters determine the transition probability density function
that describes the system’s stochastic dynamics. The initial state
is drawn from the start state distribution
. Together with the reward function
, and the temporal discount factor
, the system forms a MDP described by the set
.
The goal of a Reinforcement Learning (RL) agent is to maximize the expected (discounted) return, a numeric scoring function which measures the policy’s performance. The expected discounted return of a stochastic domain-independent policy , characterized by its parameters
, is defined as
While learning from experience, the agent adapts its policy parameters. The resulting state-action-reward tuples are collected in trajectories, a.k.a. rollouts, , with
. To keep the notation concise, we omit the dependency on
.
B. Bayesian Optimization with Gaussian Processes
Bayesian Optimization (BO) is a sequential derivativefree global optimization strategy, which tries to optimize an unknown function on a compact set X. In order to do so, BO constructs a probabilistic model, typically a Gaussian Process (GP), for f. GPs are distributions over functions
defined by a prior mean
and positive definite covariance function
. This probabilistic model is used to make decisions about where to evaluate the unknown function next. A distinctive feature of BO is to use the complete history of noisy function evaluations
with
and
where
is the variance of the observation noise. The next evaluation candidate is then chosen by maximizing a so-called acquisition function
, which typically balances exploration and exploitation. Prominent acquisition functions are Expected Improvement and Upper Confidence Bound.
Through the use of priors over functions, BO has become a popular choice for sample-efficient optimization of black-box functions that are expensive to evaluate. Its sample efficiency plays well with the algorithm introduced in this paper where a GP models the relation between domain distribution’s parameters and the resulting policy’s return estimated from real-world rollouts, i.e. and
. For further information on BO and GPs, we refer the reader to [6] as well as [7].
The problem of source domain adaptation based on returns from the target domain can be expressed in a bilevel formulation
where we refer to (1) and (2) as the upper and lower level optimization problem respectively. Thus, the two equations state the goal of finding the set of domain distribution parameters that maximizes the return on the real-world target system
, when used to specify the distribution
during training in the source domain. In the following, we abbreviate
with
.
At the core of BayRn, first a policy optimizer, e.g., a RL algorithm, is employed to solve the lower level problem (2) by finding a (locally) optimal policy for the current distribution of stochastic environments. This policy is evaluated on the real system for
rollouts, providing an estimate of the return
. Next, the upper level problem (1) is solved using BO, yielding a new domain parameter distribution which is used to randomize the simulator. In this process the relation between the domain distribution’s parameters
and the resulting policy’s return on the real system
is modeled by a GP. The GP’s mean and covariance is updated using all recorded inputs
and the corresponding observations
. Finally, BayRn terminates when the estimated performance on the target system exceeds
which is the task-specific success threshold. Since the GP requires at least a few (about 5 to 10) samples to provide a meaningful posterior, BayRn has an initialization phase before the loop. In this phase,
source domains are randomly sampled from
, and subsequently for each of these domains a policy is trained. After evaluating the
initial policies, the GP is fed with the inputs
and the corresponding observations
.
The complete BayRn procedure is summarized in Algorithm 1. In principal, there are no restrictions to the choice of algorithms for solving the two stages (1) and (2). Connection to System Identification: Unlike related methods (Section V), BayRn does not include a term in the objective function that drives the system parameters to match the observed dynamics. Instead, the BO component in BayRn is free to adapt the domain distribution parameters (e.g., mean or standard deviation of a body’s mass) while learning in simulation such that the resulting policies perform well in the target domain. This can be seen as an indirect system identification, since with increasing iteration count the BO process will converge to sample from a region where the real-world return is high. The sequence of sampled domain distribution parameter sets highly depends on the acquisition function and the complexity of the given problem. We argue that not including system identification into the upper level objective (1) is sensible for the presented sim-to-real algorithm, since it learns from a randomized physics simulator, hence attenuates the benefit of a well-fitted model.
Table I: Range of domain distribution parameter values used during the sim-to-real expriments. All domain parameters were randomized such that they stayed physically plausible.
mean rotary pole mass mean pendulum pole mass
mean rotary pole length
mean pendulum pole length
std rotary pole mass
std pendulum pole mass
std rotary pole length
std pendulum pole length
We study Bayesian Domain Randomization (BayRn) on an under-actuated rotary inverted pendulum, also known as Furuta pendulum (Figure 1), where the task is to swing the pendulum pole into an upright position. First, we set up a simplified sim-to-sim experiment to check if the proposed algorithm’s believe about the domain distribution parameters converges to a specified set of ground truth values. Next, we evaluate BayRn as well as two baseline methods in a sim-to-real experiment. A detailed system description can be found in Appendix A.
A. Experiments Description
Before applying BayRn to a physical system, we conduct a sim-to-sim experiment to examine the domain distribution parameter sampling process of the BO component. In order to provide a (qualitative) visualization, we chose to only randomize the means of the poles’ masses, i.e., . The hyper-parameters used for executing BayRn are identical to the ones used in the sim-to-real experiment described below.
In our sim-to-real experiment, we compare BayRn with Uniform Domain Randomization (UDR), and Proximal Policy Optimization (PPO) [8]. UDR can be seen as the straightforward way of randomizing a simulator. Each domain parameter is assigned to an independent probability distribution, and at the beginning of every rollout a new set of parameters is sampled. BayRn and UDR randomize the same domain parameters with identical nominal values, given by the platform’s data sheet [3]. We chose normal distributions to vary the masses and lengths of both poles (Table I). We decided for these domain parameters because they are the most sensitive.
For each of the three algorithms, we selected the best policy and executed 20 evaluation rollouts on the Quanser Qube (Figure 1). Every rollout ran for 6 s at 100 Hz, collecting 600 time steps with a reward . The procedure includes an automatic calibration as well as a controller which drives the Qube to its initial position with the rotary pole centered and the pendulum hanging down. Due to the underacted nature of the dynamics, the pendulum has be swung back and forth a couple of times to put energy into the system before being able to swing the pendulum up.
Regarding BayRn, we used the BO implementation from BoTorch [9]. Notably, we decided for the expected improvement acquisition function and a zero-mean GP prior with a Mat´ern 5/2 Kernel. For training the GP, all inputs were normalized and the output standardized. Additional details
Figure 2: Target domain returns (a) and the associated standard deviation (b) modeled by the GP learned with BayRn in a sim-to-sim setting (brighter is higher). The ground truth domain parameters as well as the maximum a posteriori domain parameters found by BayRn are displayed as a red and orange circle, respectively. The crosses mark the sequence of domain parameter configurations (darker is later).
on the experiments, such as the chosen hyper-parameters for learning the policies, can be found in Appendix B.
B. Sim-to-sim Results
As stated in Section III, BayRn was designed without an (explicit) system identification objective. However, we can see from Figure 2a that the GP’s maximum a posterior domain parameters closely match the ground truth parameters
. Moreover, Figure 2b displays how the uncertainty about the target domain return is reduced in the vicinity of the sampled parameter configurations. There are two decisive factors for the domain distribution parameter sampling process: the acquisition function (Algorithm 1 Line 12), and the quality of the found policy (Algorithm 1 Line 14). Concerning the latter, a failed training of the lower level problem (2) is indistinguishable to a successful training of a policy which fails to transfer to the
Figure 3: Violinplot of returns on the real-world platform for different algorithms. Each algorithm has been evaluated 20 times. The medians are displayed by white cirles, and the horizontal lines represent the individual samples. The dashed line at 400 marks an approximate threshold where the tasks are considered solved, i.e., the pole is stabilized on top in the center.
target domain. An easy solution to this problem would be to retrain the policy if the expected return in simulation does not exceed a certain threshold.
C. Sim-to-real Results
Figure 3 visualizes the results of the sim-to-real experiment described in Section IV-A. The discrepancy between the performance of PPO (without domain randomization) and the other algorithms reveals that domain randomization is an integral part for sim-to-real transferability. Note that all reported policies solved the nominal simulation environment excellently. Comparing BayRn and UDR, we see that each median performance is above the threshold. However, UDR has a significantly higher variance. During the experiments we noticed that UDR sometimes fails unexpectedly. We suspect a high dependency on the initial state.
Comparing the nominal values and the means among the domain distribution parameters
of BayRn’s final iteration, we see that the domain parameters’ means changed by approximately 5 % each. Thus, the maximum posterior domain parameters are well within the boundaries of the BO search space (Table I). Even though the individual changes might seem small, in combination they result in significantly different system dynamics. We see this as the reason why the PPO baseline failed to transfer.
A video demonstrating the sim-to-real transfer of the policy learned with BayRn can be found at www.ias.informatik.tu- darmstadt.de/Team/FabioMuratore.
We divide the related research on robot reinforcement learning from randomized simulations into approaches which use static (Section V-A) or adaptive (Section V-B) distributions for sampling the physics parameters. Bayesian Domain Randomization (BayRn) as introduced in Section III belongs to the second category.
A. Domain Randomization with Static Distributions
Learning from a randomized simulator with fixed domain parameter distributions has bridged the reality gap in several cases [4, 10, 2]. Most prominently, the robotic in-hand manipulation reported in [4] showed that domain randomization in combination with careful model engineering and the usage of recurrent neural networks enables direct sim-to-real transfer on an unprecedented difficulty level. Similarly, Lowrey et al. [10] employed Natural Policy Gradient to learn a continuous controller for a positioning task, after carefully identifying the system’s parameters. Their results show that the policy learned from the identified model was able to perform the sim-to-real transfer, but the policies learned from an ensemble of models was more robust to modeling errors. Mordatch et al. [11] used a finite model ensembles to run trajectory optimization on a small-scale humanoid robot. In contrast, Peng et al. [12] leveraged on model-free RL and recurrent neutral network policies trained with experience replay to push an object controlling a robotic arm. The usage of risk-averse objective function has been explored on MuJoCo tasks in [13]. The authors also provide a Bayesian point of view.
Aside from to the previously methods, Muratore et al. [2] proposes a method to estimate the transferability of a policy learned from randomized physics simulations. Moreover, the authors propose a meta-algorithm which provides a probabilistic guarantee on the performance loss when transferring the policy between two domains form the same distribution.
Static domain randomization has also been successfully applied to computer vision problems. A few examples that are: (i) object detection [14], (ii) synthetic object generation for grasp planning [15], and (iii) autonomous drone flight [16].
B. Domain Randomization with Adaptive Distributions
Ruiz et al. [17] proposed the meta-algorithm “learning to simulate” which is based on a bi-level optimization problem highly similar to the one of BayRn (1, 2). However, there are two major differences. First, BayRn uses Bayesian optimization on the acquired real-wold data to adapt the domain parameter distribution, whereas “learning to simulate” updates the domain parameter distribution using REINFORCE. Second, the approach in [17] has been evaluated in simulation on synthetic data, except for a semantic segmentation task. Thus, there was no dynamics-dependent interaction of the learned policy with the real world.
With SimOpt, Y. Chebotar et al. [5] presented a trajectorybased framework for closing the reality gap. It iteratively adapts the domain parameter distribution’s parameters by minimizing discrepancy between observations from the real-world system and the simulation. The authors validated their approach on two state-of-the-art sim-to-real robotic manipulation tasks. While BayRn formulates the upper level problem (1) solely based on the real-world returns, SimOpt minimizes a linear combination of the L1 and L2 norm between simulated and real trajectories. Moreover, SimOpt employs Relative Entropy Policy Search to update the simulator’s parameters, thus turning it into a RL problem.
Klink et al. [18] derived a relative entropy RL algorithm that endows the agent to adapt the domain parameter distribution, typically from easy to hard instances. Hence, the overall training procedure can be interpreted as a curriculum learning problem. The authors were able to solve a robotic ball-in-the-cub task by directly transferring the learned policy from simulation to reality. One key difference to BayRn is that the target distribution has to be known beforehand.
The approach called Active Domain Randomization [19] also formulates the adaption of the domain parameter distribution as a RL problem where different simulation instances are sampled and compared against a reference environment based on the resulting trajectories. This comparison is done by a discriminator which yields rewards proportional to the difficulty of distinguishing the simulated and real environments, hence providing an incentive to generate distinct domains. Using this reward signal, the domain parameters of the simulation instances are updated via Stein Variational Policy Gradient. Mehta et al. evaluated their method in a sim-to-real experiment where a robotic arm had to reach a point in task space.
Paul et al. [20] introduce Fingerprint Policy Optimization which, like the BayRn, employs BO to adapt the distribution of domain parameters such that using these for the subsequent training maximizes the policy’s return. At first sight the approaches might look similar, but there is a major difference in how the upper level problem (1) solved. Fingerprint Policy Optimization models the relation between the current domain parameters, the current policy and the return of the updated policy with a GP. This design decision requires to feed the policy parameters into the GP which is prohibitively expensive if done straightforwardly. Therefore, [20] create abstractions of the policy, so-called fingerprints, as for example the Gaussian approximation of the stationary state distribution. These handcrafted features approximate the policy to reduce the input dimension. The authors tested Fingerprint Policy Optimization on sim-to-sim MuJoCo tasks. Contrarily, BayRn has been designed without the need to approximate the policy. Moreover, we validated the presented method in a sim-to-real setting.
Unlike the previously mentioned approaches, Yu et al. [21] suggest a policy which conditions on the domain parameters. Since these parameters can not be assumed as known, they have to be estimated using online system identification. The implementation is done using a neural network for computing the actions given the states and domain parameters in combination with another neural network to regress the domain parameters from the observed rollouts. Applying this approach to simulated continuous control tasks, the authors showed that adding the online system identification module can enable an adaption to sudden changes in the environment.
In Ramos et al. [22], likelihood-free inference in combination with mixture density random Fourier networks is employed to perform a fully Bayesian treatment of the simulator’s parameters. Analyzing the obtained posterior over domain parameters, Ramos et al. showed that BayesSim is, in a sim-to-sim setting, able to simultaneously infer different parameter configurations which can explain the observed trajectories. The key difference between BayRn and BayesSim is the objective for updating the domain parameters. While BayesSim maximizes the model’s posterior likelihood, BayRn updates the domain parameters such that the policies return on the physical system is maximized.
We have introduced Bayesian Domain Randomization (BayRn), a policy search algorithm tailored to crossing the reality gap. At its core, BayRn learns from a randomized simulator while using Bayesian optimization for adapting the source domain distribution during learning. In contrast to previous work, the presented algorithm constructs a probabilistic model of the relation between domain distribution parameters and the policy’s return after training with these parameters in simulation. Hence, BayRn only requires little interaction with the real-world system. We experimentally validated that the presented approach is able to robustly solve a sim-to-real swing-up task on an under-actuated nonlinear system. Comparing the results against the baselines showed that adapting the domain parameter distribution lead to policies with higher median performance and less variance. In future work, we want investigate how BayRn scales to problems with a higher number of adaptable domain parameters.
Fabio Muratore gratefully acknowledges the financial support from Honda Research Institute Europe.
Jan Peters received funding from the European Unions Horizon 2020 research and innovation programme under grant agreement No 640554.
[1] B. F. Hobbs and A. Hepenstal, “Is optimization optimisti- cally biased?” Water Resources Research, vol. 25, no. 2, pp. 152–160, 1989.
[2] F. Muratore, M. Gienger, and J. Peters, “Assessing trans- ferability from simulation to reality for reinforcement learning,” PAMI, vol. PP, pp. 1–1, 11 2019.
[3] Quanser, “Quanser platforms,” 2019, www.quanser.com/ (last accessed October 2 2019). [Online]. Available: www.quanser.com/products/
[4] OpenAI et al., “Learning dexterous in-hand manipulation,” ArXiv eprints, vol. 1808.00177, 2018.
[5] Y. Chebotar et al., “Closing the sim-to-real loop: Adapting simulation randomization with real world experience,” in ICRA, Montreal, QC, Canada, May 20-24, 2019, pp. 8973–8979.
[6] J. Snoek, H. Larochelle, and R. P. Adams, “Practical bayesian optimization of machine learning algorithms,” in NIPS, Lake Tahoe, Nevada, United States, December 3-6, 2012, pp. 2960–2968.
[7] C. E. Rasmussen and C. K. I. Williams, Gaussian processes for machine learning, ser. Adaptive computation and machine learning. MIT Press, 2006.
[8] J. Schulman, F. Wolski, P. Dhariwal, A. Radford, and O. Klimov, “Proximal policy optimization algorithms,” ArXiv e-prints, 2017.
[9] M. Balandat, B. Karrer, D. R. Jiang, S. Daulton, B. Letham, A. G. Wilson, and E. Bakshy, “BoTorch: Programmable bayesian optimization in pytorch,” ArXiv e-prints, 2019.
[10] K. Lowrey, S. Kolev, J. Dao, A. Rajeswaran, and E. Todorov, “Reinforcement learning for non-prehensile manipulation: Transfer from simulation to physical system,” in SIMPAR 2018, Brisbane, Australia, May 16-19, 2018, pp. 35–42.
[11] I. Mordatch, K. Lowrey, and E. Todorov, “Ensemble- cio: Full-body dynamic motion planning that transfers to physical humanoids,” in IROS, Hamburg, Germany, September 28 - October 2, 2015, pp. 5307–5314.
[12] X. B. Peng, M. Andrychowicz, W. Zaremba, and P. Abbeel, “Sim-to-real transfer of robotic control with dynamics randomization,” in ICRA, Brisbane, Australia, May 21-25, 2018, pp. 1–8.
[13] A. Rajeswaran, S. Ghotra, B. Ravindran, and S. Levine, “Epopt: Learning robust neural network policies using model ensembles,” in ICLR, Toulon, France, April 24-26, 2017.
[14] J. Tobin et al., “Domain randomization for transferring deep neural networks from simulation to the real world,” in IROS, Vancouver, BC, Canada, September 24-28, 2017, pp. 23–30.
[15] ——, “Domain randomization and generative models for robotic grasping,” in IROS, Madrid, Spain, October 1-5, 2018.
[16] F. Sadeghi and S. Levine, “CAD2RL: real single-image flight without a single real image,” in RSS, Cambridge, Massachusetts, USA, July 12-16, 2017.
[17] N. Ruiz, S. Schulter, and M. Chandraker, “Learning to simulate,” ArXiv e-prints, vol. 1810.02513, 2018.
[18] P. Klink, H. Abdulsamad, B. Belousov, and J. Peters, “Self-paced contextual reinforcement learning,” ArXiv e-prints, vol. 1910.02826, 2019.
[19] B. Mehta, M. Diaz, F. Golemo, C. J. Pal, and L. Paull, “Active domain randomization,” ArXiv e-prints, vol. 1904.04762, 2019.
[20] S. Paul, M. A. Osborne, and S. Whiteson, “Fingerprint policy optimisation for robust reinforcement learning,” ArXiv e-prints, vol. 1805.10662, 2018.
[21] W. Yu, J. Tan, C. K. Liu, and G. Turk, “Preparing for the unknown: Learning a universal policy with online system identification,” in RSS, Cambridge, Massachusetts, USA, July 12-16, 2017.
[22] F. Ramos, R. Possas, and D. Fox, “Bayessim: Adaptive domain randomization via probabilistic inference for robotics simulators,” in Robotics: Science and Systems XV, University of Freiburg, Freiburg im Breisgau, Germany, June 22-26, 2019., 2019.
The Furuta pendulum is modeled as an under-actuated nonlinear second-order dynamical system given by the solution of
with the rotary angle and the pendulum angle
, which are defined to be zero when the rotary pole is centered and the pendulum pole is hanging down vertically. While the system’s state is defined as
, the agent receives observations
. The horizontal pole is actuated by commanding a motor voltage (action) a which regulates the servo motor’s torque
. The domain parameters as well as the parameters derived from them are sampled from distribution specified by the parameters in Table I. We formulate the reward function based on an exponentiated quadratic cost
Thus, the reward is in range ]0, 1] for every time step.
The hyper-parameters for training the policies during the experiments in Section IV are given in Table II. The reported values have been tuned but not fully optimized.
Table II: Hyper-parameter values for training the policies in Section IV. The first part of the table lists the hyper-parameters common the algorithms.