My stuff
MAMPS: Safe Multi-Agent Reinforcement Learning via Model Predictive Shielding

Reinforcement learning is a promising approach to learning control policies for performing complex multi-agent robotics tasks. However, a policy learned in simulation often fails to guarantee even simple safety properties such as obstacle avoidance. To ensure safety, we propose multi-agent model predictive shielding (MAMPS), an algorithm that provably guarantees safety for an arbitrary learned policy. In particular, it operates by using the learned policy as often as possible, but instead uses a backup policy in cases where it cannot guarantee the safety of the learned policy. Using a multi-agent simulation environment, we show how MAMPS can achieve good performance while ensuring safety.

Reinforcement learning [1], [2] has been shown to be a promising technique for learning control policies for complex robotics tasks ranging from autonomous vehicles [3] to home service robots [4], or to “compress” an expensive model predictive controller (MPC) into a much faster neural network policy [5]. A major challenge in using reinforcement learning is safety [6], [7], [8], [9], [10], [11], [12], [13]—control policies learned using reinforcement learning typically do not provide any safety guarantees, even when the safety property is explicitly considered by the learning algorithm.

As a consequence, there has been much interested in algorithms that provide safety guarantees for a learned control policy  ˆπ. We are interested in the setting where  ˆπis learned in simulation, and we want to ensure safety after it is deployed on a robot (assuming that our model of the dynamics is correct). One approach is to formally prove that ˆπis safe [14], [15], [16]. An alternative approach, called shielding, is to synthesize a backup controller  πbackupthat is guaranteed to be safe on some subset of states, which we call recoverable states [6], [7], [12], [17], [18]. Then, the shield policy  πshielduses  πbackupwhenever  ˆπwould bring the robot to an irrecoverable state on the next step; otherwise it uses  ˆπ. A key challenge with existing approaches is that they rely on verifying either  ˆπor  πbackup, which typically does not scale to high-dimensional systems. For systems with obstacles in the environment or other agents that must be encoded in the state, verifying safety quickly becomes intractable.

A promising alternative is model predictive shielding (MPS) [17], [18], [19], which performs shielding on-the-fly instead of ahead-of-time. The intuition is that checking

1W. Zhang and V. Kumar are with the General Robotics, Automation, Sensing & Perception (GRASP) Laboratory, University of Pennsylvania, USA

2O. Bastani is with the Department of Computer and Information Science, University of Pennsylvania, USA {zwenbo, obastani,



Fig. 1. Overview of multi-agent model predictive shielding (MAMPS). On step t, the current state of the multi-agent system is  xt. The red solid box shows the entire MAMPS control policy is  πshieldmodule. There are three basic components in MAMPS: the “current policy”  πb, the closed-loop dynamics  f(π)(x), and the subroutine IsRecoverable (shorted to IsRec). The current policy is a combination of learned policies and recovery policies for different agents. First, MAMPS initializes  πb to ˆπfor each agent. Then, it iteratively determines whether using  πbtransitions the system to a recoverable state  x′. If not, then it switches agents that were unsafe to  πrec(the blue line). Once it has found  b such that f(πb)(x)is recoverable, it can safely output action  πb(xt)(the green line). The red dashed box shows the internal simulation used by MAMPS to check recoverability; note that all dynamics applications in this box are according to a simulation run by the MAMPS algorithm, not according to the true dynamics. The true dynamics are only applied once  πshieldreturns an action  πb(xt).

whether a single state is recoverable (i.e., the next one) is much more efficient, even on-the-fly, than exhaustively checking recoverability for all states ahead-of-time. In particular, to check whether a given state is recoverable, MPS simply simulates the closed-loop dynamics (assumed to be deterministic) with  πbackupand checks whether it is safe. While we focus on deterministic dynamics, there has been recent work extending MPS to stochastic dynamics [19].

In this paper, we study the safety problem for multi-agent systems [20], [21]. In this setting, approaches to safety have been proposed based on restricting the velocities of the agents [22], [23], [24]; however, these approaches only apply to systems that are holonomic [23], [24] or approximately holonomic [22]. Instead, we propose an approach based on MPS, which applies to general robot dynamics. We can in principle treat the multi-agent system as a high-dimensional single-agent system, and then apply MPS to ensure safety. However, this approach can achieve very suboptimal results since even if a single robot is about to be irrecoverable, then the shield uses  πbackupfor every single robot in the system.

We propose an approach, called multi-agent model predictive shielding (MAMPS), summarized in Fig. 1, which avoids this problem by incrementally switching each agent from using  ˆπto using  πbackup. The iterative process is needed since switching one agent to using  πbackupcan cause other agents to subsequently become unsafe. The general workflow is shown in Fig. 1. We prove that this modification preserves the key safety guarantees provided by shielding—i.e., if the system starts in a recoverable state, then using  πshieldkeeps the system safe over an infinite horizon.

Contributions. Our key contributions are: (i) the multi-agent model predictive shielding (MAMPS) algorithm for ensuring safety of a learned policy (Section II & III), (ii) a theoretical guarantee regarding the safety of this algorithm (Section III), and (iii) an experimental evaluation demonstrating how this algorithm outperforms the standard MPS algorithm in the multi-agent setting (Section IV).

In this section, we formulate the problem of shielding a learned policy to ensure safety.

Dynamical system. We consider a deterministic, discrete time dynamical system with continuous states  x ∈ X ⊆RnX, continuous action  ui ∈ Ui ⊆ RnU(for each agent  i ∈[N] = {1, 2, ..., N}), dynamics  f : X ×U1×· · ·×UN → X, and probability distribution  d0over initial states. We assume that  x ∈ Xencodes the state of all the agents as well as goals and obstacles in the environment.

As a running example, consider a system of nonholonomic robots with acceleration control and steering control—i.e., each robot  i ∈ [N]has state  xi =(xi, yi, vi, θi)and action  ui = (ai, δi), where  (xi, yi)is the position of the robot in  R2, viis its velocity,  θiis its heading, aiis its acceleration, and  δiis its steering angle. In addition, we consider N goals with positions  gi = (gi, hi) ∈ R2for each  i ∈ [N], as well as M obstacles with positions zi = (zi, wi) ∈ R2for each  i ∈ [M]. The goal is for agent i to reach goal i. Together, the multi-agent state is x = (x1, ..., xN, g1, ..., gN, z1, ..., zN).

Control policy. Given a vector of control policies  π =(π1, ..., πN), where  πi : X → Uifor each  i ∈ [N], we use f (π)(x, u) = f(x, π1(x), ..., πN(x))to denote the closed-loop dynamics. An infinite-horizon trajectory generated using πfrom initial state  x0 ∈ Xis the sequence  ζ = (x0, x1, ...), where  xt+1 = f (π)(xt). Similarly, given a finite horizon T ∈ N, a finite-horizon trajectory generated using  πfrom initial state  x0 ∈ Xis the sequence  ζ = (x0, x1, ..., xT −1).

Safe states. We assume given sets  Xsafe ⊆ Xindicating that agent  i ∈ [N]is safe. Then, the system as a whole is safe if every agent is safe—i.e.,


The goal is to ensure that the system never transitions to an unsafe state  x ̸∈ Xsafe. Given a trajectory  ζ = (x0, x1, ...)we say  ζis safe if  xt ∈ Xsafefor all  t ≥ 0.

In our example, an agent  i ∈ [N]is safe if it has not collided with an obstacle or any other robots. In particular,


where  rrobotis the robot radius,  robstacleis the obstacle radius, m is a safety margin, and d is the Euclidean distance. As a consequence, the overall system is safe if the above constraints are satisifed for all  i ∈ [N].

Stable controller and stable states. To ensure safety for an infinite horizon, we assume that we are given a control policy  πstableand a subset  Xstable ⊆ Xsafeof states such that using  πstableguarantees safety indefinitely. In particular, we assume given sets  X istable ⊆ X isafeindicating that agent  i ∈ [N]is stable. Then, the system as a whole is stable if every agent is stable—i.e.,


Our key assumption is that for any  x ∈ Xstable, the trajectory ζgenerated using  πstablefrom initial state  x0 = xis safe.

In our running example, we have  x ∈ X istableif  vi = 0and  x ∈ Xsafe—i.e., agent i is at rest and the overall system is safe. Furthermore, the backup control policy is  πi(x) =(0, 0)—i.e., the steering angle and acceleration are both zero. As a consequence, we have  x ∈ Xstableif  vi = 0for every agent  i ∈ [N]and furthermore  x ∈ Xsafe. In other words, in a stable state, all agents are at rest, and the backup control policy keeps them at rest.

Reward. We consider a reward function  r : X × U → Rthat we seek to maximize. Then, given a finite time horizon T, initial states X and an initial state distribution  d0, our goal is to find a policy  ˆπthat maximizes


where  xt+1 = f (π)(xt, ut)and  ut = π(xt). In our example, the reward might be to minimize the distance of each robot to a goal state:


where  gi ∈ R2is a goal state. 1 The rewards can also encode soft constraints that encourage the robot to remain in  Xsafe.

Learned policy. We assume given a vector of policies ˆπ = (ˆπ1, ..., ˆπN), where each policy can be arbitrary—e.g., ˆπcan be learned using the multi-agent deep deterministic policy gradient algorithm (MADDPG) [25], [26].

Shielding problem. A policy  πis safe if for any trajectory ζ = (x0, x1, ...)starting from  x0 ∈ Xstable, ζis safe. 2 Then, our goal is to construct a vector of control policies  πshield

Algorithm 1 Compute the MAMPS policy for state x. function MAMPS(x):


that is safe. To try and maximize reward, our construction of  πshieldleverages  ˆπ—in particular, it tries to maximize the number of states  x ∈ Xfor which  πshield(x) = ˆπ(x).

We propose an extension of model-predictive shielding (MPS) [18] to the setting of multi-agent systems, which we call multi-agent model predictive shielding (MAMPS).

Background on MPS. Recall that  πstablecan guarantee safety for an infinite horizon starting from any stable state x ∈ Xstable. Thus, we can use the shielding approach using πstable. In particular, suppose we start at a state  x ∈ Xstable. To decide whether to use  ˆπor  πstable, we check if  x′ =f (ˆπ)(x) ∈ Xstable. If so, then it is safe to use  ˆπ(x), since πstableis guaranteed to be safe starting from  x′, so we can continue to guarantee safety. Otherwise, we use  πstable(x)(which is guaranteed to be safe).

Constructing a stable controller  πstablealong with stable states  Xstablethat is large can be a challenging problem. In our example, the proposed set of stable states (i.e., states where all robots are at rest) has zero volume in the state space. Thus, using shielding with  πstableas the backup policy will result in poor performance; indeed, in our example, the robots will never be able to move. The idea behind MPS is to expand the


Fig. 2. Comparison of MAMPS to the naïve approach. Image 1 is the initial state. In this state, the middle agent will collide with the obstacle if it uses the learned policy  ˆπ. The naïve approach treats the system as a single-agent system and uses MPS [18]. In this approach, all agents will switch to using the recovery policy  πrec(Image 2). In contrast, when MAMPS iteratively checks whether successively more conservative configurations can transition the system to a recoverable state. In this example, MAMPS first switches the middle agent to using the recovery policy (Image 3). However, this change causes a new problem—the middle agent will now collide with the right-most agent. Thus, in the second iteration, MAMPS switches the right-most agent to using the recovery policy (Image 4). This choice transitions the system to a recoverable state, so MAMPS returns these actions. In this example, MAMPS allows the left-most agent to use the learned policy, whereas the naïve approach switches it to the recovery policy. Thus, MAMPS can achieve a significantly higher reward.

set of states for which we can guarantee safety for an infinite horizon by using a recovery policy  πrecto try and transition the system to a state  x′ ∈ Xstable. More precisely, given a state x, suppose that there exists T such that the finite-horizon trajectory  ζ = (x0, x1, ..., xT −1)generated using  πrecfrom x0 = xis safe and reaches  Xstable(i.e.,  xT −1 ∈ Xstable). Then, we can guarantee safety starting from x by first using πrecfor  T−steps, and then using  πstableafterwards; we call such a state x recoverable, since it can be recovered to a stable state using  πrec. Thus, using this combination of  πrecand  πstableas the backup policy substantially expands the set of states where we can use  ˆπ.

We can in principle apply this approach to multi-agent systems (with centralized control), where we treat the system as a single high-dimensional system. However, this approach can work poorly—if even a single agent needs to switch to the recovery policy, then every agent must be switched to using the recovery policy. In contrast, our proposed algorithm, multi-agent MPS (MAMPS), considers different choices of policy for different robots. For example, Fig. 2 shows an instance where the naïve approach of treating the system as a single-agent system and using MPS causes all the agents to switch to the recovery policy, whereas our MAMPS algorithm avoids this failure.

Recovery policy. As with MPS, our approach uses a recovery policy  πrec = (π1rec, ..., πNrec)that tries to transition the system from any state  x ∈ Xto a stable state  x′ ∈ Xstable. This policy can be manually specified or learned using reinforcement learning methods. Like  ˆπ, our algorithm works with any choice of recovery policies. In our example, we can use the policy  πirecthat decelerates the agent as fast as possible until it reaches a stop. Once all robots come to a stop (assuming no robot hits an obstacle or another robot), then the system is in a stable state.

Backup policy. The backup policy  πbackupis a combination of the recovery policy and the stable policy. In particular, for agent  i ∈ [N], it uses  πistableif agent i is stable (i.e., x ∈ X istable), and uses  πirecotherwise:


for each  i ∈ [N].

Recoverable states. Specific to the multi-agent setting, we can decompose the notion of recoverability and define recoverability for an individual agent. In particular, a state x ∈ Xis recoverable for agent  i ∈ [N], denoted  x ∈ X irec, if there exists T such that the finite-horizon trajectory  ζ =(x0, x1, ..., xT −1)generated from  x0 = xusing  πbackupis (i) safe for agent i (i.e.,  xt ∈ X isafefor all  t ∈ {0, 1, ..., T − 1}), and (ii) reaches a stable state for agent i (i.e.,  xT −1 ∈ X istable). Then, the set of recoverable states is


We say a state  x ∈ Xrecis recoverable. It is easy to see that if x is recoverable, then there exists  T ≤ Tmaxsuch that the finite-horizon trajectory  ζ = (x0, x1, ..., xT −1)generated from  x0 = xusing  πbackupis (i) safe (i.e.,  xt ∈ Xsafefor all  t ∈ {0, 1, ..., T − 1}), and (ii) reaches a stable state (i.e.,  xT −1 ∈ Xstable). In this definition,  Tmax ∈ Nis a hyperparameter that bounds the length of the trajectory we need to check if recoverability holds, making it feasible to check recoverability in simulation.

In particular, we can check whether a state is recoverable (either for an individual agent or for overall) by simulating πbackup. This check concludes that x is recoverable if and only if x is actually recoverable, then x is guaranteed to be recoverable. Algorithm 2 performs this check, and returns a vector  ρ = (ρ1, ..., ρN) ∈ {0, 1}N, where  ρi ∈ {0, 1}indicates whether  x ∈ X irec(ρi = 1) or not (ρi = 0).

Multi-agent model predictive shielding. Our algorithm, multi-agent model predictive shielding (MAMPS), 3 chooses whether to use the learned policy  ˆπior the recovery policy πirecfor each agent  i ∈ [N]. In contrast to the MPS approach described above, which either uses  ˆπifor every agent i or uses  πirecfor every agent i, MAMPS considers different choices of learned policy or recovery policy for different agents. We represent the possible choices by a configuration b = (b1, ..., bN) ∈ B = {0, 1}N. In particular,  bi ∈ {0, 1}indicates whether to use  ˆπi(bi = 1) or  πirec(bi = 0). For any b ∈ B, we use  πbto denote the corresponding combination of policies for each agent—i.e.,


Our key insight is that we can use  πbfor any configuration b ∈ Bas long as  f (πb)(x) ∈ Xrec. Since we can check recoverability of any state x using simulation, we can simply enumerate over all configurations  b ∈ Bto find one that satisfies this condition. If there are multiple choices of b, then we want to choose the one that maximizes  ∥b∥1—i.e., the one that maximizes the number of agents  ˆπusing their learned control policy. Thus, we want to compute


where I is the indicator function (taking values in {0, 1}).

For systems with many agents, iterating over all combinations  b ∈ Bcan become very expensive, since |B| = 2Nis exponential in the number of agents. Especially for systems with limited computational resources, computing b∗can be intractable. Thus, MAMPS instead solves (1) approximately using a greedy iterative search strategy. In particular, MAMPS starts off by considering the best possible candidate configuration b = (1, 1, ..., 1)—i.e., every agent uses the learned policy. Then, it checks whether  f (πb) ∈ Xrec. If so, then we can use  πb. Otherwise, there are agents for which x is irrecoverable—i.e.,


For agents  i ∈ I, MAMPS switches to using the backup policy—i.e.,  bi = 0.

Note that switching an agent from  bi = 1to  bi = 0may cause a different agent that was previously recoverable to become irrecoverable. As an example, the illustration of the MAMPS policy in Fig. 2 shows a case where switching one agent causes another to become irrecoverable. Thus, we have to again check for additional agents that have become irrecoverable. We iteratively perform this process until we find a configuration b such that  f (πb)(x) ∈ Xrec. Thus, we know that  f (πbackup)(x) ∈ Xrec, so we can safely use  πb.

The full MAMPS algorithm is shown in Algorithm 1. For convenience, when representing b, this algorithm uses true instead of 1, and false instead of 0. One subtlety is that while  ∥b∥1is monotonically decreasing in this algorithm, but it nevertheless get “stuck” at some point b that is not guaranteed to be safe, but there are also no agents switching from  bi = 1to  bi = 0. Thus, the algorithm includes a check to see whether b converges. In this case, it sets b = (0, ..., 0), so  πb = πbackupis guaranteed to be safe. We have the following guarantee (proved in Appendix V-A): Theorem 1: The MAMPS policy  πMAMPSis safe.


Fig. 3. Training curves for the MADDPG algorithm. We show the cases of 3 agents (left) and 4 agents (right).

In our experiments, we aim to answer the following research questions:

How does MAMPS compare to using the learned policy without any shield (in terms of reward and safety)?

How does MAMPS compare to the naïve shield (i.e., treat the system as a single-agent system and use MPS)?

How does the performance of MAMPS vary with respect to the number of agents or obstacles? All experiments are performed on a server with an Intel Xeon Gold 6148 CPU an Nvidia RTX 2080 Ti GPUs.

A. Setup

We perform our experiments using the multi-agent particle environment [26]. This environment consists of a multi-agent system in which a set of N agents is trying to reach a set of N goals. In our setup, each agent is assigned a specific goal that they are trying to reach. There is also a collection of M obstacles that the agents must avoid; in addition, the agents must avoid colliding with one another.

The learned policy  ˆπis trained using multi-agent deep deterministic policy gradients (MADDPG) [25], which uses decentralized actors and a centralized critic. Our reward function is


where the reward for approaching goals is


the bonus reward for arriving at a goal is 4


the penalty for collisions is


λ, λ′, ϵ ∈ R>0are hyperparameters, and  I(C) ∈ {0, 1}is the indicator function, which indicates whether condition C holds. Parameter choices are shown in Appendix V-B.

For the recovery policy  πrec, we use the policy that applies the maximum possible acceleration to decrease the robot’s velocity (i.e., apply the brakes as hard as possible). Note that if the robot uses recovery policy while it is at rest (v = 0), then it will stay in the same state. Better choices may be possible, but we find that this simple choice is very effective.

B. Results

In Fig. 3, we show learning curves for the MADDPG algorithm. As can be seen, it successfully learns how to control each system. In Fig. 4, we compare our approach, MADDPG with MAMPS (MADDPG+MAMPS), with two baselines: (i) using just the learned policy (MADDPG), and (ii) using MADDPG with the naïve shield (MADDPG+Naïve). For completeness, we have show the naïve algorithm in Appendix V-C Our results are averaged over 500 episodes. We compare two metrics: probability of safety (left) and probability of reaching the goal (right). For safety, we measure the fraction of the agent/episodes pairs for which safety is ensured for the entire duration of the episode—i.e.,


where K = 500 is the number of episodes,  δsafeij ∈ {0, 1}indicates whether agent j is safe during the entirety of episode i (i.e.,  xit ∈ X isafe). For reaching the goal, we


Fig. 4. Probability of safety (left) and probability of reaching the goal (right) for multi-agent systems. Here, “3A3O", “3A4O", "4A3O", and “4A4O" represent “3 agents and 3 obstacles", “3 agents and 4 obstacles", “4 agents and 3 obstacles", and “4 agents and 4 obstacles", respectively.

count the fraction of agent/episode pairs for which the agent reaches the goal at any point during the episode—i.e.,


where  δgoalij ∈ {0, 1}indicates whether agent j reaches its goal at any point in episode i.

Ensuring safety. Note that MADDPG alone performs quite poorly in terms of safety—it can guarantee safety less than 50% of the time when there are 4 agents and 4 obstacles. The poor performance in terms of safety happens even though MADDPG includes a penalty for collisions. The difficulty is that there are a huge number of possible config-urations of the state space, and it is not possible to ensure that MADDPG trains the neural network to account for all of them. As expected, for MADDPG, the probability of safety decreases as there are more agents or more obstacles. Finally, both MADDPG+MAMPS and MADDPG+Naïve guarantee the safety of multi-agent for all agents in all episodes.

Reaching goals. As expected, MADDPG+MAMPS achieves its goals less frequently than MADDPG since MADDPG is allowed to have unsafe collisions without affecting this metric. More interestingly, MADDPG+MAMPS substantially outperforms MADDPG+Naïve in terms of performance, often by an order of magnitude. Furthermore, the performance of MADDPG+MAMPS is quite close to the performance of MADDPG alone in settings where there are 3 agents. The relative performance degrades substantially when there are 4 agents, likely because agent-agent collisions increase significantly, which causes agents to use the recovery policy, and therefore fail to reach their goals.

Discussion. Overall, these results demonstrate the substantial promise of the MAMPS approach. In some settings, it is able to guarantee safety without sacrificing very much performance. There is inevitably some tradeoff between safety and good performance. Nevertheless, we believe that there is much potential to improve MAMPS and reduce how much performance must be sacrificed to ensure safety.

We have proposed a novel algorithm, MAMPS, for ensuring the safety of a learned control policy for multi-agent systems. Our algorithm comes with strong theoretical guarantees on safety. Furthermore, our experimental results show how MAMPS can ensure safety without sacrificing much performance, and that MAMPS can substantially outperform a naïve approach. There is much room for future work—e.g., allowing for partially observed environments and closing the gap in performance between MAMPS and the learned policy.


A. Proof of Theorem 1

Note that to prove the theorem statement, it suffices to prove that  x ∈ Xrec, then  f (πshield)(x) ∈ Xrecas well. In particular, by induction, this claim implies that a trajectory (x0, x1, ...)generated using  πshieldfrom initial state  x0 ∈Xrecsatisfies  xt ∈ Xrecfor all  t ≥ 0. Furthermore, by definition we have  Xstable ⊆ Xrec ⊆ Xsafe. Thus, a trajectory (x0, x1, ...)generated using  πshieldstarting from  x0 ∈ Xstablesatisfies  x(t) ∈ Xsafe—i.e.,  πshieldis safe.

Next, we prove the remaining claim. Consider the action u = πb(x)returned by Algorithm 1. Since Algorithm 1 checks that  x′ = f (πb)(x) ∈ Xrec, we are guaranteed that x′is recoverable. The challenge is proving that Algorithm 1 actually returns an action.

First, we show that for the choice b = (0, ..., 0), we have f (πb)(x) ∈ Xrec. To this end, note that in this case, we have πb = πbackup. Then, note that since x is recoverable, we know that there exists  T ∈ {0, 1, ..., Tmax − 1}such that the trajectory  (x0, x1, ..., xT )generated using  πbackupfrom x0 = xsatisfies the following: (i) it is safe (i.e.,  xt ∈ Xsafe


for all  t ∈ {0, 1, ..., T}), and (ii) it reaches  Xstable(i.e.,  xT ∈Xstable). Thus, for  x′ = f (πbackup)(x), consider the trajectory (x1, ..., xT )generated using  πbackupfrom  x1 = x′. Trivially, this trajectory is also safe and reaches  Xstable. Thus,  x′is recoverable, as claimed. Finally, the check in Algorithm 1 eventually considers b = (0, ..., 0); thus, it is guaranteed to terminate. The claim follows.  ■

B. Parameters

Environment parameters. Maximum episode length: 300. Total size of the environment:  [−1, 1]2. Object dimensions:  rrobot = 0.1, robstacle = 0.1, rgoal = 0.05. Allowed accelerations:  [−1, 1]. Allowed velocities: [0, 3]. Time step: 0.025.

MADDPG parameters. Learning rate:  10−3. Discount factor  γ: 0.95. Size of minibatch sample: 1024. Actor and critic networks: 8 fully connected layers with 128 hidden units each for three-agent settings and 10 fully connected layers with 128 hidden units each for four-agent settings.

MAMPS parameters. Maximum trajectory length  Tmaxfor recovery check: 120.

C. Naïve Approach

Algorithm 3 shows the naïve approach of treating the system as a single-agent system, and then using MPS.

[1] S. Gu, E. Holly, T. Lillicrap, and S. Levine, “Deep reinforcement learning for robotic manipulation with asynchronous off-policy updates,” in 2017 IEEE international conference on robotics and automation (ICRA). IEEE, 2017, pp. 3389–3396.

[2] M. Andrychowicz, B. Baker, M. Chociej, R. Jozefowicz, B. McGrew, J. Pachocki, A. Petron, M. Plappert, G. Powell, A. Ray, et al., “Learning dexterous in-hand manipulation,” arXiv preprint arXiv:1808.00177, 2018.

[3] H. Tanner, A. Jadbabaie, and G. J. Pappas, “Coordination of multiple autonomous vehicles,” in IEEE Mediterranean Conference on Control and Automation, 2003, pp. 869–876.

[4] S. S. Srinivasa, D. Ferguson, C. J. Helfrich, D. Berenson, A. Collet, R. Diankov, G. Gallagher, G. Hollinger, J. Kuffner, and M. V. Weghe, “Herb: a home exploring robotic butler,” Autonomous Robots, vol. 28, no. 1, p. 5, 2010.

[5] S. Levine and V. Koltun, “Guided policy search,” in International Conference on Machine Learning, 2013, pp. 1–9.

[6] J. H. Gillula and C. J. Tomlin, “Guaranteed safe online learning via reachability: tracking a ground target using a quadrotor,” in 2012 IEEE International Conference on Robotics and Automation. IEEE, 2012, pp. 2723–2730.

[7] A. K. Akametalu, J. F. Fisac, J. H. Gillula, S. Kaynama, M. N. Zeilinger, and C. J. Tomlin, “Reachability-based safe learning with gaussian processes,” in 53rd IEEE Conference on Decision and Control. IEEE, 2014, pp. 1424–1431.

[8] I. Saha, R. Ramaithitima, V. Kumar, G. J. Pappas, and S. A. Seshia, “Automated composition of motion primitives for multi-robot systems from safe ltl specifications,” in 2014 IEEE/RSJ International Conference on Intelligent Robots and Systems. IEEE, 2014, pp. 1525–1532.

[9] J. F. Fisac, N. F. Lugovoy, V. Rubies-Royo, S. Ghosh, and C. J. Tomlin, “Bridging hamilton-jacobi safety analysis and reinforcement learning,” in 2019 International Conference on Robotics and Automation (ICRA). IEEE, 2019, pp. 8550–8556.

[10] S. Liu, M. Watterson, K. Mohta, K. Sun, S. Bhattacharya, C. J. Taylor, and V. Kumar, “Planning dynamically feasible trajectories for quadrotors using safe flight corridors in 3-d complex environments,” IEEE Robotics and Automation Letters, vol. 2, no. 3, pp. 1688–1695, 2017.

[11] W. Schwarting, J. Alonso-Mora, L. Pauli, S. Karaman, and D. Rus, “Parallel autonomy in automated vehicles: Safe motion generation with minimal intervention,” in 2017 IEEE International Conference on Robotics and Automation (ICRA). IEEE, 2017, pp. 1928–1935.

[12] M. Alshiekh, R. Bloem, R. Ehlers, B. Könighofer, S. Niekum, and U. Topcu, “Safe reinforcement learning via shielding,” in ThirtySecond AAAI Conference on Artificial Intelligence, 2018.

[13] D. Amodei, C. Olah, J. Steinhardt, P. Christiano, J. Schulman, and D. Mané, “Concrete problems in ai safety,” arXiv preprint arXiv:1606.06565, 2016.

[14] F. Berkenkamp, M. Turchetta, A. Schoellig, and A. Krause, “Safe model-based reinforcement learning with stability guarantees,” in Advances in neural information processing systems, 2017, pp. 908– 918.

[15] O. Bastani, Y. Pu, and A. Solar-Lezama, “Verifiable reinforcement learning via policy extraction,” in Advances in Neural Information Processing Systems, 2018, pp. 2494–2504.

[16] R. Ivanov, J. Weimer, R. Alur, G. J. Pappas, and I. Lee, “Verisig: verifying safety properties of hybrid systems with neural network controllers,” in Proceedings of the 22nd ACM International Conference on Hybrid Systems: Computation and Control. ACM, 2019, pp. 169– 178.

[17] K. P. Wabersich and M. N. Zeilinger, “Linear model predictive safety certification for learning-based control,” CoRR, vol. abs/1803.08552, 2018. [Online]. Available: http://arxiv.org/abs/1803.08552

[18] O. Bastani, “Safe reinforcement learning via online shielding,” arXiv preprint arXiv:1905.10691, 2019.

[19] S. Li and O. Bastani, “Robust model predictive shielding for safe reinforcement learning with stochastic dynamics.” [Online]. Available: https://obastani.github.io/docs/rmps.pdf

[20] J. P. Desai, J. P. Ostrowski, and V. Kumar, “Modeling and control of formations of nonholonomic mobile robots,” 2001.

[21] J. Alonso-Mora, S. Baker, and D. Rus, “Multi-robot navigation in formation via sequential convex programming,” in 2015 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS). IEEE, 2015, pp. 4634–4641.

[22] J. Alonso-Mora, A. Breitenmoser, M. Rufli, P. A. Beardsley, and R. Siegwart, “Optimal reciprocal collision avoidance for multiple nonholonomic robots,” in DARS, 2010.

[23] J. van den Berg, M. Lin, and D. Manocha, “Reciprocal velocity obstacles for real-time multi-agent navigation,” 05 2008, pp. 1928– 1935.

[24] A. Khan, C. Zhang, S. Li, J. Wu, B. Schlotfeldt, S. Y. Tang, A. Ribeiro, O. Bastani, and V. Kumar, “Learning safe unlabeled multi-robot planning with motion constraints,” arXiv preprint arXiv:1907.05300, 2019.

[25] R. Lowe, Y. Wu, A. Tamar, J. Harb, P. Abbeel, and I. Mordatch, “Multi-agent actor-critic for mixed cooperative-competitive environments,” Neural Information Processing Systems (NIPS), 2017.

[26] I. Mordatch and P. Abbeel, “Emergence of grounded compositional language in multi-agent populations,” arXiv preprint arXiv:1703.04908, 2017.

Designed for Accessibility and to further Open Science