Autonomous quadcopter UAVs often suffer from loss of one or multiple propeller(s) mid-flight [1], [2]. Unless the controller is robust enough to enable flight in propeller-deficient condition, the UAV crashes, causing damage to itself as well as the surroundings. This paper proposes a fault detection (FD) system to detect propeller failure mid-flight and reinforcement-learning (RL) based controllers to control the propeller-deficient quadcopter.
Controllers used in quadcopters consists of two loops in the control model (Fig. 1; See Ref. [3]); the outer loop for waypoint tracking and inner one for stability. The decision-making system, proposed in this paper, has a similar structure, but with a RL agent in the outer loop, and a PD controller in the inner loop. Previous work on propeller loss scenarios [4] have developed separate control systems based controllers for 3, 2 (opposing), and 1 propeller lost quadcopters. We have also done the same for 2 (opposing) and 1 propeller loss scenarios, but by using RL, which allows learning more complex behaviour and is adaptable to different conditions. Earlier methods [5], [6] based on RL were developed for quadcopters with no propeller failure.
Although [4] designed controllers for quadcopters with propeller failure, it lacked an online FD system to switch between controllers during flight. We propose a method using deep learning that detects specific propeller loss using information collected from on-board sensors. No additional
D. Ghose is a professor, at the Guidance, Control, and Decision Systems Laboratory (GCDSL), Dept. of Aerospace, Indian Institute of Science, Bangalore, India. Emails: rohitkumar97@gmail.com;
Fig. 1: The inner and outer control loop of a quadcopter.
sensors are used, thus avoiding addition of any extra weight to the quadcopter. The two systems are combined to achieve both propeller failure detection and controller switching in mid-flight. We show that RL controllers are capable of waypoint tracking even with multiple lost propellers, thus enabling the quadcopter to complete the mission.
The RL setup consists of two major components, an agent and an environment. The agent is the decision-maker which gives control commands to the four motors and the environment is the quadcopter which the agent is acting on. The quadcopter changes its position and orientation when acted upon by the agent. Most RL algorithms follow a similar pattern. First, the environment passes the initial state to the agent, which then acts in order to proceed to the next state. The environment then returns the new state along with the reward of the previous action. Based on the reward, the agents learn which action-state pair maximize the rewards. This loop continues until the terminal state is reached.
We use a model-free RL algorithm with deterministic policy, as we do not intend to learn the complex dynamics of the environment. This is unlike model-based RL algorithms which need to learn the complete state transition probability from the pair of current state and action to the next state.
The organization of the paper is as follows: Section II discusses previous related work, Section III describes the proposed RL decision making system for quadcopter control with failed propellers. Section IV describes the fault-detection system, and finally Section V combines these two systems together. A thorough comparison of results is done in Section VI and conclusions are drawn in Section VII.
In [6], model-based RL was used to find an optimal policy for the altitude control loop, yielding a stable controller for a quadcopter. In [5], a PD controller was used for stability and an RL agent for waypoint tracking. The RL agent also learns many different complexities such as maintaining orientation and stability. In [7], only attitude control was achieved using RL, focusing on the inner loop as a first step. These systems do not address propeller-loss conditions.
Fig. 2: Fault detection system and RL controllers
Fault-tolerant control systems for single propeller loss has been proposed in [8], [9], which focus on the angular velocity along the vertical axis and then carry out path following using three propellers. This idea was used in [4], where they have derived the control equation and constraints required to fly a quadcopter with only 3, 2 or 1 functioning propeller. They showed take-off and waypoint tracking but not mid-flight propeller loss detection or switching between controllers. Similarly, in [10], hover conditions are derived for 3, 2 or 1 propeller lost scenarios. In [11], a combined fault detection and controller was developed for loss of a single actuator only. The papers ([8], [9], [10], [11]) use control theoretic methods and not RL. Fault detection, diagnosis, and control for unmanned rotorcraft systems, from a control theoretic perspective, are surveyed in [12]. Some related work monitors structural health in real-time [13], and offline propeller fault detection using neural networks [14] and spectral analysis [15] for operational check before flight.
In contrast, we present a learning based combined FD system for multiple propeller failures and an RL adaptive controller to stabilize the quadcopter post-failure.
Three different RL-based controllers for no propeller loss, 1 propeller loss, and 2 propeller loss, are designed. A FD system using recurrent neural networks (RNNs) to identify the propeller(s) that have failed mid-flight, is also implemented. After fault detection, we switch to the appropriate controller mid-flight, thus enabling control and waypoint tracking even after the loss of 1 or 2 propellers. A schematic of our system can be seen in Fig. 2. Out of the two control loops in the
Fig. 3: The policy network and the value function network. The cyclic allotment of the output is also shown.
quadcopter controller, shown in Fig. 1, the inner loop is PD control based and the outer loop is RL-control based.
A. Quadcopter control using RL
There exist several policy optimization algorithms like PPO (Proximal Policy Optimization) [16], DDPG (Deep Deterministic Policy Gradient) [17] and TRPO (Trust Region Policy Optimization) [18]. Among these, DDPG, has convergence issues and TRPO is complex and computationally intensive. Hence, we selected PPO which has been found to be simpler and computationally efficient [5].
The agents consists of two networks for training, a value network, and a policy network. Both networks have 2 hidden layers of 64 nodes with tanh activation function. The network architecture is given in Fig. 3. The input is the quadcopter state which is an 18-element vector s (all quantities are defined in the inertial frame):
where, is the flattened form of quadcopter’s rotation matrix, (x, y, z) is the quadcopter position, (
) are the linear, and (
) are the angular, velocities of the quadcopter. The output is an n-element vector where n is the number of functional propellers. We used Huber loss function [19] for loss calculation of value network and standard gradient descent [20] for the policy network.
a2-c3 In order to ensure safety during learning, the episode (trajectory) that comes close to violating a safety constraint can be terminated or given a large cost, as in [21].
The value function is trained using Monte-Carlo (MC) samples that are obtained from the on-policy trajectories. Terminal value, that is, the tail cost of the trajectory, is taken from the current value function.
where, are the parameters of the approximated value function, T is the length of the trajectory,
is the discount factor and r is reward as given in (3) below.
Policy Optimization: As mentioned, we have used PPO, a policy optimization based algorithm, given in Algorithm 1. The details of PPO can be found in [16]. We define our policy as with parameters
.
5: Update times using Huber loss. 6: Update
once using standard gradient descent. 7: end while
The quadcopter simulation environment1 takes actions as input and returns the updated state of the quadcopter along with the rewards. These states and rewards are stored and used to train the RL network. The actions are given by the partially trained RL network. Multiple environments can be run in parallel which helps in exploration when running the Monte-Carlo simulations. Propeller failure was simulated as mentioned in Section IV-C.
During policy optimization, the policy is trained with the origin of the inertial frame as the target waypoint. During operation, the origin of the inertial frame is shifted to the target waypoint. This is done so that the policy need not be explicitly trained on waypoint tracking. The quadcopter is initialized in a random normally distributed state (that is, random position, orientation, angular velocity, and linear velocity) with a reasonable bound such that we can easily explore the feasible state space. We have initialized the various parameters by sampling from a truncated Gaussian distribution with limits of as follows: Position
(in meters); Orientation is a quaternion vector with 4 elements, each of which are sampled from N(0, 1) and normalized; Angular velocity
(in rad/sec); Linear velocity
(in meters/sec). We use large limits to train the agent on extreme conditions and thus ensure higher levels of robustness. Each epoch of the training is done on 500 trajectories, each of which has 500 timesteps. The control frequency is 100 Hz and therefore, each trajectory is of 5 seconds. The 3 controllers (4, 3 and 2 propeller controllers) were trained for 4500 epochs. The stopping criteria was value loss < 0.0001. Each training session was for 14 hours on an Nvidia Geforce 960M with 2GB memory.
As mentioned, the quadcopter uses a PD controller to maintain stability. The motor output of the RL network is converted into force and torque using quadcopter dynamics which are then added to the force and torque output of the PD controller, respectively. The simulator then applies the total force and torque on the quadcopter model. The PD controller alone is insufficient, but helps in avoiding extreme movements and therefore, aids in stabilizing the learning process. Without it, the quadcopter simply goes out of bounds due to the random state initialization. The PD controller is given as [5] where,
is the virtual torque produced on the main body as a result of the thrust forces, q is the euler orientation vector, R is the rotation matrix and w is the angular velocity. The values of
and
are
and
for the x and y direction, and
and
for the z direction, which is one-sixth of the values used for x and y directions, as prescribed in [5]. Reward at any time t is defined as,
where, and
are the current position and angular velocities, respectively. The angle between the quadcopter’s vertical axis and z-axis of the inertial frame is
. Position has a high coefficient since waypoint tracking is the priority of the RL agent. Discount factor
.
We have used the same network to train different controllers for no propeller loss, 1 propeller loss, 2 propeller loss with some important modifications for each case. These are discussed in subsequent sections.
B. Quadcopter control with no propeller loss
The RL agent is similar to that defined above. The output of the agent is 4 action values which are motor speeds (w = []). These values are combined with the values from the PD controller to give the final output.
C. Quadcopter control with one propeller loss
For the quadcopter to be controllable, it was shown in [4] that , where
) is the unit vector governed by the differential equation:
(where
is the quadcopter’s angular velocity in the body reference frame). Hence, we conclude that the quadcopter should rotate about a body axis whose vertical component is not 0. Fig. 4 shows the simulation of the quadcopter with 2 propellers lost. The red dot shows the first target waypoint and the blue dot shows the shifted waypoint.
The RL agent gives 3 action outputs (w = ). These outputs are assigned to the propeller starting from the first working propeller in a cyclic manner. Fig. 3 shows this for 1 and 2 propeller loss. Here, output 1 of the 3 propeller network would be assigned to propeller 1, output 2 would be assigned to propeller 2 and output 3 would be assigned to propeller 4.
Fig. 4: Simulation of the quadcopter with 2 failed propellers
D. Quadcopter control with two opposite propeller loss
Similar to the previous section, following [4], to control the quadcopter with two opposite propellers lost and
. Note that we are not considering the failure of two adjacent propellers, which is an unsolved problem in the literature. The RL agent gives two outputs (
), which acts on the 2 opposing functional motors. These outputs are assigned in a similar cyclic manner as mentioned in the previous subsection and shown in Fig. 3.
Recurrent Neural Networks (RNNs) [22] have been used to exploit the temporal relationship between elements of a sequence by recursively (’unrolling’) processing each element. For 1 and 2 lost propellers, the quadcopter exhibits unique set of states at each timestep, which can be classified by an RNN. One problem with RNNs is the exponential growth or decay in the gradient vector for long sequences during training, which prohibits learning long-distance correlations in the sequence. Therefore, we have used LSTM (Long Short Term Memory) [23] which does not have the above issue.
A. Fault-detection
For the remainder of this paper, the nomenclature of n is used to denote failure cases, where m is the number of functional propellers before fault occurs, and n, the number of functional propellers after the fault. For example,
denotes the quadcopter going from 4 functional propellers to 3 functional propellers after a propeller failure.
The task is to map the quadcopter states (as defined in (1)) to possible propeller failure outcomes. There are 5 possible outcomes when going from 4 functional propellers to 3 functional propellers: no propeller lost or one of the 4 propellers lost. Then, the elements of the output vector denote probability of of one of the five outcomes.
When going from 3 functional propellers to 2 functional propellers, we have considered only the opposite propeller failure option. Therefore, . We can now denote the neural network as a function f trained to map:
B. Network architecture
1) 4 → 3 fault-detection (FD) network: It consists of
96 LSTM cells in the first layer and 64 in the second layer (See Fig. 5). The second layer passes into the output feedforward layer which has 5 nodes for the 5 possible classes/outcomes. The optimizer used is Stochastic Gradient Descent (SGD) [20] with momentum. Momentum allows the
Fig. 6: FD network for detection system.
network to converge over a wide range of learning rates [24] thus allowing us to prototype multiple networks quickly.
2) 3 → 2 fault-detection (FD) network: It consists of 96
LSTM cells in the first layer and 32 in the second layer (See Fig. 6). It passes through the feedforward layer with 2 nodes for the 2 possible classes. The optimizer used is Adam [25].
The training parameters are listed in Table I. Note that we trained both the networks using both SGD and Adam.The final optimizer chosen for each network was the one that gave better final training accuracy. We chose a computationally cheap network with only two LSTM layers for two reasons. Firstly, the network would ultimately need to run in real-time on the drone, which generally have light CPU/GPU. Secondly, even small networks can learn very complex contextual information. Our network gave highly accurate results and thus there was no need for a complicated network.
We have used a window size (T) of 100 for FD network and 200 for
FD network as shown in Fig. 5 and 6, respectively. Using a lower window size of 50 and 100 for the two networks, respectively, gave lower accuracy in the predictions because we needed more timesteps to establish the relationship between quadcopter behavior and propeller loss. Larger window length is likely to increase the fault-detection time as well which is undesirable.
The first 150 timesteps for the FD network and 250 timesteps for the
FD network are skipped because the quadcopter is initialized in a random state and we do not want the network to learn this initial erratic behaviour that occurs until stabilization. Instead, we want the network to learn the erratic behaviour when a propeller loss occurs. Thus, the
TABLE I: Training parameters for the 2 networks
For the FD network, the 4 propeller controller was used to control the quadcopter even with a propeller loss mid-flight. The data was collected from 500 simulations of the propeller loss scenario and the network was trained on this data. For the
FD network, the same procedure was followed but the quadcopter started with 3 working propellers and lost 1 propeller while it was in-flight and being controlled by the 3 propeller controller. In simulation propeller failure was implemented by turning off either one or two of the propellers. The data collected were of position, orientation, angular velocity and linear velocity. The labels were collected from the number of working propellers and encoded using one-hot encoding.
The complete system, combining the fault detection and RL controller, was shown in Fig. 2. We assume that the quadcopter starts with 4 working propellers. The RL agent based controller for 4 propellers is engaged. The FD network continuously checks for propeller failure at every loop. There are four cases here for each of the four propellers of the quadcopter, that is, either propeller 1 fails or propeller 2 fails and so on. Once a propeller failure is detected, the same is updated and the controller for 3 propellers is engaged. From this point onward, the
FD network takes over and checks for the second propeller loss. Similar to the above, if it encounters the second (opposing) propeller failure, it switches to 2 propeller controller.
A. Removing offset
Deep-RL suffers from the bias vs. variance paradigm. PPO algorithm has less variance but suffers from large bias. We also observe a small, but constant offset between the quadcopter’s position and the required position. One likely reason could be the bias in the function-approximator. Since it is a constant offset we handle it using a moving average filter with a window of 15 time-steps, and average the quadcopter’s actual position within this window. This could also have been done by computing the integral error. Since our required position is the origin of the inertial frame ([0, 0, 0]), and our model is trained for that position, we add the moving average value to the quadcopter’s actual position.
Fig. 7: Quadcopter with no propeller loss. Zoomed height plot is also shown.
Let the quadcopter’s actual position at the time instant be x(t), y(t), z(t), and let x(t), y(t), z(t) be the moving average in x, y, and z direction. Then,
Now add p offset to the quadcopter’s actual position, to make it the quadcopter’s observed position.
We evaluated the performance of the 3 controllers indi- vidually, as well as in combination, while the quadcopter is performing waypoint tracking. We plotted its position (x, y, z) and angular velocity to demonstrate waypoint tracking and stability, respectively. The quadcopter is initialized at the origin and is directed to reach height (z) of 5 m, simulating a take-off. After 10 seconds, the targetwaypoint is shifted by 1 m in the positive Y direction. In the propeller loss scenarios, the propeller is turned off manually and time taken to regain stability is calculated.
A. Implementation for no propeller loss case
As can be seen from Fig. 7, the policy is able to track the waypoint accurately while keeping the quadcopter stable with near 0 angular velocities. It also accommodates the waypoint shift. These results are comparable with [5].
Fig. 8: One propeller lost.
TABLE II: Comparison with [4] for one and two propeller failed quadcopter.
B. Implementation for one propeller loss case
As shown in [4], on losing a single propeller, the quadcopter loses one degree of freedom and to maintain stability, a non-zero angular velocity about the vertical axis has to be enforced. Fig. 8 shows the position and angular velocity of the quadcopter when it takes off with a single broken propeller. As can be seen from the graphs, the quadcopter has a constant yaw rate of approximately 1.0 rad/s. The graphs also demonstrate that even after losing a propeller, the quadcopter is able to track the waypoint shift occurring at 10 seconds into the simulation. There is a constant offset in position which is solved as described in subsection V-A.
Comparing with [4], we observe much less oscillations in the X position, and much less angular frequency along the Z-axis, in our case (see Fig. 8), as against those shown in Fig. 4 of [4]. We also demonstrate taking off with one failed propeller. The maximum errors in waypoint tracking for [4] and our quadcopter are tabulated in Table II.
Fig. 9: Two propellers lost.
C. Implementation for two propeller loss case
Fig. 9 shows results for the loss of two propellers on a quadcopter system. There is a constant yaw rate of around 1.2 rad s, which is quite close to the yaw rate of one propeller-lost system. The offset in position is larger compared to the single propeller-lost case, but can be solved using the method described in subsection V-A. The graphs demonstrate that the quadcopter is stable and is able to carry out waypoint tracking even with waypoint shift. We can, therefore, adjust the target waypoints in a similar manner to perform soft landing in real-life situations.
Doing a similar comparison between Fig. 8 and [4], we again see high frequency oscillations, and high angular frequency along the Z-axis, in Fig. 5 of [4]. We also demonstrate taking off with two failed propellers. The maximum errors in waypoint tracking for [4] and our quadcopter are tabulated in Table II.
D. Comparison of the quadcopter dynamics
We would like to point out that the vehicle data in [4], is similar to a large extent with our quadcopter making the comparison feasible, although they do differ in some aspects. For the sake of completion, this data is given below with data from [4] given in parentheses. Mass: 0.4 (0.5) kg; Arm length: 0.17 (0.17) m; Drag coefficient: 16x10(2.75x10
) N m s rad
; I
and I
: 7x10
(3.2x10
) Kg m
; I
: 12x10
(5.5x10
) Kg m
.
Fig. 10: First propeller lost in mid-flight.
TABLE III: Average time taken to detect propeller failures
E. Integrating fault detection with the control agents
1) Transition from 4 working propellers to 3 working
propellers: Fig. 10 shows the behavior of the quadcopter when a propeller fails mid-flight. The FD system identifies the failed propeller and switches to the appropriate control agent. The 2 vertical lines in the graphs represents the actual time at which the propeller failed and time at which the fault was detected, respectively. The graph shows a 1 second delay between the failure and its detection. Table III shows the average time of detection for 5 runs with failure occurring at random timesteps.
2) Transition from 3 working propellers to 2 working
propellers: Fig. 11 shows the behavior of the quadcopter when the second propeller also fails mid-flight. The FD system kicks in and identifies the failed propeller and switches to the appropriate control agent. The two vertical lines in the graphs represents the actual time at which the propeller failed and the time at which the fault was detected, respectively. As can be seen, there is a delay of around 2.24 seconds between the propeller failure and its detection. From Table III, we can also see the average time of detection for 5 runs, in which the failure occurs at random timesteps.
Fig. 11: Three propeller quadcopter with second propeller (opposing) lost in mid-flight.
TABLE IV: The failure rate (in 500 runs) for independent control agents and various propeller loss cases.
F. Failure rate calculation
In this paper, the failure condition for the quadcopter is whenever it hits the ground. The failure rate is recorded from 500 runs with random initialization of orientation, position, linear and angular velocity. The orientation was sampled uniformly in SO(3) and the other quantities were sampled uniformly in . For the 4, 3, and 2 propeller scenarios, the target height was fixed at 5 meters above the ground. The results are given in Table IV. The random initialization in the isolated 4, 3, and 2 propeller scenarios caused some quadcopters to start in an irrecoverable state, for example, upside down or very high linear velocity towards the ground and so on. Recovery from these states becomes harder due to the lost degree-of-freedom and insufficient thrust, thus increasing the failure rate.
Fig. 12 shows the failure rate when propellers are lost mid flight from both and
cases. The target height of the quadcopter in the 500 runs was evenly distributed between 0.5m-1.5m for both
and
scenarios.
Fig. 12: Failure rate of and
in 500 runs spread across a height range of 1 meter.
These heights were chosen based on Figs. 10 and 11, which show an approximate drop in height of 0.5m for both first and second propeller failure. From Fig. 12, we are able to find a height threshold at which the failure recovery system is not able to switch in time and stop the quadcopter from crashing to the ground. Videos2 showing the experiments in detail and the implementationcode3 are available in the given link.
In this paper, we have proposed a system for mid-flight failure detection and control in case of multiple propeller loss in a quadcopter. Firstly, we showed how RL agents can learn to control quadcopters with 0, 1 and 2 (opposing) propeller(s) lost. We showed that the quadcopter learned to do waypoint tracking while maintaining stability, even with 1 and 2 (opposing) propeller(s) failed. Secondly, we developed a novel FD system using deep learning which can detect the propeller(s) failure and switch to the appropriate controller. This method requires only the previous states of the quadcopter and is able to detect the propeller loss within 2.5 seconds, thus removing the need and maintenance for any additional sensor hardware on the quadcopter. We have also shown, in simulation, that the detection and switching can happen in real-time, preventing the quadcopter from crashing and enabling it to either land or continue its mission.
Future scope of this work can be to replace the inner loop based on PD controller, with an RL agent to make the whole system model-free, and completely discard the need to develop a mathematical model of the quadcopter. One can also use transfer-learning for training the controllers for propeller loss. This may allow training the other controllers using less number of trajectories. The implementation of this system on a physical quadcopter is the next step of this research work. This can be done by transferring the weights from simulation to physical quadcopter where it would be trained. A motion capture system can ensure fairly accurate values of position and orientation for training the network.
[1] A. Birk et al., “Safety, security, and rescue missions with an unmanned aerial vehicle (UAV),” Journal of Intelligent & Robotic Systems, vol. 64, no. 1, pp. 57–76, Oct 2011.
[2] D. Erdos, A. Erdos, and S. E. Watkins, “An experimental UAV system for search and rescue challenge,” IEEE Aerospace and Electronic Systems Magazine, vol. 28, no. 5, pp. 32–37, May 2013.
[3] N. Cao and A. F. Lynch, “Innerouter loop control for quadrotor uavs with input and state constraints,” IEEE Transactions on Control Systems Technology, vol. 24, no. 5, pp. 1797–1804, Sept 2016.
[4] M. W. Mueller and R. D’Andrea, “Stability and control of a quadrocopter despite the complete loss of one, two, or three propellers,” in IEEE Int. Conf. on Robotics and Automation (ICRA), May 2014, pp. 45–52.
[5] J. Hwangbo et al., “Control of a quadrotor with reinforcement learning,” IEEE Robotics and Automation Letters, vol. 2, no. 4, pp. 2096– 2103, Oct 2017.
[6] S. L. Waslander et al., “Multi-agent quadrotor testbed control design: integral sliding mode vs. reinforcement learning,” in IEEE/RSJ Int. Conf. on Intelligent Robots and Systems, Aug 2005, pp. 3712–3717.
[7] W. Koch et al., “Reinforcement learning for UAV attitude control,” ACM Trans. on Cyber-Physical System, vol. 3, no. 2, pp. 22:1–22:21, Feb. 2019.
[8] A. Akhtar, S. L. Waslander, and C. Nielsen, “Fault tolerant path following for a quadrotor,” in IEEE Conf. on Decision and Control, Dec 2013, pp. 847–852.
[9] A. Freddi, A. Lanzon, and S. Longhi, “A feedback linearization approach to fault tolerance in quadrotor vehicles,” IFAC Proceedings, vol. 44, no. 1, pp. 5413 – 5418, 2011.
[10] M. W. Mueller and R. DAndrea, “Relaxed hover solutions for multicopters: Application to algorithmic redundancy and novel vehicles,” The Int. J. of Robotics Research, vol. 35, no. 8, pp. 873–889, 2016.
[11] N. Nguyen and S. Hong, “Fault diagnosis and fault-tolerant control scheme for quadcopter UAVs with a total loss of actuator,” Energies, vol. 12, no. 6, p. 1139, Mar 2019.
[12] Y. Zhang et al., “Development of advanced FDD and FTC techniques with application to an unmanned quadrotor helicopter testbed,” Journal of the Franklin Institute, vol. 350, no. 9, pp. 2396 – 2422, 2013.
[13] Y. K. Yap, “Structural health monitoring for unmanned aerial systems,” Master’s thesis, EECS Department, UC, Berkeley, 2014.
[14] G. Iannace, G. Ciaburro, and A. Trematerra, “Fault diagnosis for UAV blades using artificial neural network,” Robotics, vol. 8, p. 59, Jul 2019.
[15] B. Ghalamchi and M. Mueller, “Vibration-based propeller fault diagnosis for multicopters,” in Int. Conf. on Unmanned Aircraft Systems (ICUAS), June 2018, pp. 1041–1047.
[16] J. Schulman et al., “Proximal policy optimization algorithms,” arXiv:1707.06347 [cs.LG], Aug 2017.
[17] T. P. Lillicrap et al., “Continuous control with deep reinforcement learning,” arXiv:1509.02971 [cs.LG], Sept 2015.
[18] J. Schulman et al., “Trust region policy optimization,” in Int. Conf. on Machine Learning, (ICML), vol. 37, July 2015, pp. 1889–1897.
[19] P. J. Huber, “Robust estimation of a location parameter,” The Annals of Mathematical Statistics, vol. 35, no. 1, pp. 73–101, 1964.
[20] J. Kiefer and J. Wolfowitz, “Stochastic estimation of the maximum of a regression function,” Ann. Math. Statist., vol. 23, no. 3, pp. 462–466, 1952.
[21] R. Cheng et al., “End-to-end safe reinforcement learning through barrier functions for safety-critical continuous control tasks.” AAAI, pp. 3387–3395., 2019.
[22] J. L. Elman, “Finding structure in time,” Cognitive Science, vol. 14, no. 2, pp. 179 – 211, 1990.
[23] S. Hochreiter and J. Schmidhuber, “Long short-term memory,” Neural Comput., vol. 9, no. 8, pp. 1735–1780, 1997.
[24] N. Qian, “On the momentum term in gradient descent learning algorithms,” Neural Networks, vol. 12, no. 1, pp. 145 – 151, 1999.
[25] D. P. Kingma and J. Ba, “Adam: A method for stochastic optimization,” in Int. Conf. on Learning Representations (ICLR), Dec 2015.