Aggressive Perception-Aware Navigation using Deep Optical Flow Dynamics and PixelMPC

2020·Arxiv

Abstract

Abstract

Recently, vision-based control has gained traction by leveraging the power of machine learning. In this work, we couple a model predictive control (MPC) framework to a visual pipeline. We introduce deep optical flow (DOF) dynamics, which is a combination of optical flow and robot dynamics. Using the DOF dynamics, MPC explicitly incorporates the predicted movement of relevant pixels into the planned trajectory of a robot. Our implementation of DOF is memory-efficient, data-efficient, and computationally cheap so that it can be computed in real-time for use in an MPC framework. The suggested Pixel Model Predictive Control (PixelMPC) algorithm controls the robot to accomplish a high-speed racing task while maintaining visibility of the important features (gates). This improves the reliability of vision-based estimators for localization and can eventually lead to safe autonomous flight. The proposed algorithm is tested in a photorealistic simulation with a high-speed drone racing task.

Supplementary video: https://youtu.be/NzL2YRcOh I

Index Terms—Model Learning for Control, Optimization and Optimal Control, Visual Servoing, Visual Tracking, Visual-Based Navigation

I. INTRODUCTION

WE introduce a novel mechanism which combines visioninto a model predictive control (MPC) framework.Deep learning (DL)-based perceptual control using end-to-end imitation learning has shown great success in many robotics disciplines including autonomous driving [1], [2], [3], manipulation [4], and autonomous drone flying [5], [6]. In this paper, instead of taking a fully end-to-end approach ([5], [6]), we deploy the power of DL in novel system modeling. In a traditional (not end-to-end) navigation, DLaided vision pipeline played a big role in detecting objects and obstacles as a perception module and sometimes as a part of state estimation (e.g. VSLAM [7]). A controller then performed its task of navigation, avoidance, or tracking using the information provided from the vision part [8]. The visual object tracking or visual servoing technologies have been developed over the past few decades and can be found in some commercial drone products. However, most of the work in literature [5], [6], [9] are all based on reactive controllers; the robot turns left if the object is on the rightside of a robot’s view, and vice versa. This reactive visual servoing requires the drone to fly at a slow speed or hover

Manuscript received: September, 10, 2019; Revised December, 6, 2019; Accepted January, 6, 2020.

This paper was recommended for publication by Editor Eric Marchand upon evaluation of the Associate Editor and Reviewers’ comments. This work was supported by NASA.

The authors are with the Autonomous Control and Decision Systems Laboratory, Georgia Institute of Technology, Atlanta, GA 30332, USA.

Fig. 1: MPC-predicted future pixel trajectory (Green) of a target pixel, the center of a gate. PixelMPC computes the optimal control which accomplishes a racing task and drives the target pixel to the center of the image.

until it finishes servoing. Here we propose a predictive visual tracking controller for high-speed racing with a data-driven optical flow dynamics model composed of optical flow and robot dynamics.

In a drone racing scenario, the optical flow mostly comes from a moving camera and a static environment. Since the controller moves the robot through space, the changes in scene, the optical flow, can be thought of as indirect dynamics.

Recently, there has been a lot of progress in DL-based optical flow techniques [10], [11], [12]. However, all prior work relies on large convolutional neural networks with a lot of parameters to estimate the optical flow of the entire image. In our work, the application of the optical flow is to predict the relative motion of a ‘single’ pixel, so we use a small fullyconnected feedforward network.

The main problem we address in this paper is the visibility/field of view of a moving camera, especially when it comes to high-speed racing. The more the robot observes through a camera, the more information we use to perform accurate state estimation and navigation. Therefore, it is important to control the robot to see more information, for example, by pitching up or rolling/yawing. However, this conflicts with the high-speed flying task for a drone because a quadrotor needs to pitch down to fly at a high speed and this results in losing more visual information.

To solve the problem of limitation in the field of view by visual servoing, [13], [14] proposed a Sequential Quadratic Programming-based approach where the visibility is formulated in hard constraints. However, these methods do not fit into our problem formulation which requires real-time planning and control.

In the visual servoing literature, to the best of our knowledge, the real-time predictive controllers used for a visual tracking task are [15], [16]. Although [15] formulated an

c2020 IEEE. Personal use of this material is permitted. Permission from IEEE must be obtained for all other uses, in any current or future media, including reprinting/republishing this material for advertising or promotional purposes, creating new collective works, for resale or redistribution to servers or lists, or reuse of any copyrighted component of this work in other works. 1

MPC problem for a viewpoint optimization, the goal of the paper was controlling the drone to stabilize a gimbal to get a good quality of a video. In [16], the most relevant work to us, the authors derived the target pixel velocity based on the information of the relative 3D position (x, y, z) of the target and the robot. With the pixel velocity information, the authors were able to form an MPC problem along with vision and perform visual object tracking control in a predictive way.

However, in our work, we implement a data-driven deeplearning approach, that does not require any prior information of camera intrinsics, extrinsics, or the 3D global position of the target. Instead, our algorithm requires an object detector that detects a target in image space. Thanks to the great success in the field of computer vision, we can use real-time object detectors [17], [18] with GPUs. Although our method requires prior knowledge of the image of the targets (gates) and a trained detector, we believe this is less restrictive than full knowledge of the global 3D position of features like in [16]. Furthermore, we believe our case is less restrictive since our proposed approach can be used for any moving target objects located anywhere in the scene.

In summary, the contributions of this work are twofold:

• We introduce data-driven Deep Optical Flow (DOF) dynamics, learned from the optical flow of consecutive images and robot dynamics. DOF dynamics are efficient in memory and computation.

• We introduce the Pixel Model Predictive Control (PixelMPC) algorithm which predicts the relative motion of pixels by actuating the robot to visually track important features (targets) while accomplishing the high-level tasks (e.g. racing or chasing). The algorithm makes the vision-based state estimation more robust as it explicitly allows the control algorithm to prioritize visual information.

The remaining of the paper is organized as follows: In Section II, we briefly review some preliminaries used in our work. The DOF dynamics are introduced in Section III and in Section IV, we introduce our PixelMPC algorithm. Section V details vision-based drone racing and state estimation experiments with analysis and comparisons of the proposed methods. Finally, we conclude and discuss future directions in Section VI.

II. PRELIMINARIES

In this section, we provide the building blocks of the proposed Pixel Model Predictive Control (PixelMPC): MPC, drone dynamics model, and the optical flow.

A. Model Predictive Optimal Control

Model Predictive Control (MPC)-based optimal controllers (e.g. Model Predictive Path Integral (MPPI) [19]) provide planned control trajectories given an initial state and a cost function by solving the optimal control problem. An optimal control problem whose objective is to minimize a task-specific cost function J(X, U) can be formulated as follows:

subject to dynamics

where represents the system states, represents the control, is the state cost at the final time , l is the running cost, and V is the value function. By solving this local optimization problem, we get the optimal control sequences. This can be solved in a receding horizon fashion in an MPC framework and it allows us to have a real-time optimal controller with feedback.

In our work, a sampling-based receding-horizon stochastic optimization algorithm, MPPI controller [19] is used as an MPC controller. We chose MPPI for several reasons, first off being the generality of cost functions and dynamics allowed. Most variants of MPC require us to have a convex cost function and first or second-order approximations of the dynamics. MPPI has neither of these requirements. Therefore we can directly encode our task into the cost function without any modifications to the high-level objective. Second, MPPI has been shown to be highly successful at aggressive autonomous racing on ground vehicles with general cost functions and neural network dynamics [19].

For a short summary of MPPI algorithmically, it samples N trajectories by applying noise into the control channels and forward propagating the dynamics. Each sample can be rolled out in parallel, and then each corresponding trajectory and cost are combined to generate a final control vector. The optimization can be run K times to further refine the solution before executing it. The previous control solution is used as the center value of the Gaussian sampling to warm start the optimization each round.

B. Quadrotor Dynamics

We use the quadrotor dynamics model provided in the FlightGoggles simulator [20] used in this paper. The defined 10 states are , where is the world-coordinate position vector, is the vehicle attitude unit quaternion vector, and

is the world-coordinate linear velocity vector. The vehicle dynamics are given by

where g is the gravitational acceleration, m is the quadrotor mass, is the rotation matrix from body to world frame, is the total thrust, is the aerodynamic drag, and is the stochastic force vector to capture unmodeled dynamics (e.g. vibrations and turbulance). The rotation matrix from body to world frame is

and the relation between quaternions and the angular rates is

where the angular rates and are part of the control inputs we used along with the total thrust . The control U is . We make a small assumption here that the model immediately follows the control inputs, especially the angular rates. Indeed, the quadrotor in the FlightGoggles takes U as an input and the low-level PID controller controls the robot to follow the commands. Since we directly input the angular rates, we do not use the dynamics of the angular rates, described in [20] when we propagate the model in MPC. The robot dynamics we used in this paper is also described in Fig. 4.

C. Optical Flow

Optical flow estimates the instantaneous motion of objects and features in a visual scene from a sequence of ordered images. The motion comes from the relative motion between an observer and a scene. In our case, the motion comes from a moving observer (a camera attached on a robot) and a static environment. To compute the optical flow, two strict assumptions are required: 1) The brightness of any observed object point on images is constant over time, 2) In the image plane, neighborhood points move similarly with similar velocity. The first constraint can be written as:

where I represents the intensity of a pixel (u, v) and represent the displacement of the pixel position between two consecutive images observed at time t and . This equation can be written in a form of Taylor series by assuming that the movement is small:

which results in

and , only with one equation, so all the optical flow calculation methods make additional assumptions to estimate the actual flow.

We used one of the most popular algorithms [21] to calculate the dense optical flow. The algorithm approximates each neighborhood of both frames by quadratic polynomials.

Fig. 2: Ground truth optical flow and the DOF prediction in the case of a quadrotor flying forward, pitching down. Deep optical flow provides more smooth optical flow compared to the ground truth. DOF predicts a single pixel’s optical flow instead of the whole image’s flow.

The details of the algorithm can be found in [21] and the implementation of the algorithm is available in OpenCV [22]. For a better calculation of the dense optical flow, we used a sequence of downsized gray-scaled images instead of original RGB images. The parameters used for calculating optical flow with [21] were: pyr scale=0.5, levels=10, winsize=51, iterations=15, poly=5, poly=1.1. The visualization of the dense optical flow as a vector or in color can be found in Fig. 2 and the supplementary video includes the optical flow of a full run of racing.

III. DEEP OPTICAL FLOW DYNAMICS

By taking advantage of the algorithms [21] calculating the optical flow, deep optical flow learning becomes selfsupervised learning, which does not require any manual labeling. Our proposed neural network-based Deep Optical Flow (DOF) dynamics have two major selling points:

1) Computationally efficient: DOF dynamics predict an optical flow/vector of a single pixel while most of the DLbased optical flow [10], [11], [12] predicts the next timestep’s image of optical flow, with the same size of the input images. This allows us to have a very small network, so we can use the model in a real-time optimal controller that performs optimization within 20-50ms. If we build a U-Net-like convolutional neural network, which predicts an image from an input image, we have to propagate the deep CNN every timestep in MPC framework to generate optical flow, which is computationally very expensive and slow. For the parameters used in the paper, our MPC algorithm samples over a million times per second.

2) Data-efficient: Given an image, size of WH, DOF can use WH data points for training, whereas typical DL-based optical flow [10], [11], [12] only uses a single data point (an image of the whole optical flow).

DOF dynamics predict, just like typical robot dynamics models, the derivative of the states. Here, in DOF, it predicts the velocity of a pixel. DOF takes 3 components as input: pixel state (position) , control actions U, and robot orientation q. The pixel position means the position in (u, v) coordinate system on the image plane, where the top left corner is the origin (0, 0). Control actions command angular velocities in x, y, z frame and total thrust, which affects both robot motion/acceleration and the image stream. The main point here in the DOF input is the robot orientation part. We incorporate the orientation of the robot into the DOF dynamics

Fig. 3: Optical flow depending on the robot orientation and control. For a legend of the colormap, please refer to Fig. 2a.

because even with the same control input, the optical flow changes depending on the roll, pitch, and yaw angles of the robot as shown in Fig. 3.

We train the DOF dynamics with a neural network (NN) model to predict the magnitude l and the angle of a single optical flow/vector. By defining the state of the pixel , we can write the optical flow as

where l and are the optical flow vector component, predicted from the DOF. Therefore, the final DOF dynamics is

where the PolarToEuler mapping is Eq. (11).

Algorithm 1 describes the training process of DOF dynamics. In the first for-loop of Algorithm 1, the robot state can be either ground truth or estimated states. The first for-loop describes collecting training dataset of robot states, optimal control actions, images, and optical flow between two consecutive images. Then the following for-loops update the weights and biases of DOF dynamics model with respect to the mean squared error (MSE) loss between the target magnitude and the angle of optical flow and the prediction.

We normalize the pixel state into [0.0, 1.0][0.0, 1.0] space and do regression. This allows the original discrete image space [0, W][0, H] to be a continuous 2D space [0.0, 1.0][0.0, 1.0] and same for the pixel state space, as well.

We designed a feed-forward NN with 5 layers having [10, 128, 128, 128, 2] neurons per each, where 10 is for an input layer and 2 is for the output. The Rectified Linear Unit (ReLU) function, f(x) = max(0, x), is used for the activation function in layers 1-4, and the output layer has a linear activation. All the layers are fully connected with regularization via 10% of dropouts. The motivation behind using the stated number of neurons was to achieve real-time performance with MPPI. We had to balance the accuracy of the model with real-time constraints. A more accurate model would need more neurons and more layers, but this would prevent real-time usage in MPPI. We empirically choose an architecture that was accurate and could still be run in real-time. These two goals conflicts: To achieve a more accurate model, we need more neurons and more layers, but this would result in slow inference time. As a result, we empirically chose the numbers to achieve both goals. For training the neural network, the Adam [23] optimizer was used with Tensorflow [24].

The usage of the trained model can be found in Algorithm 2. Given a center of the object (u, v) from a Detector (e.g. YOLOv3 [17]), a trained DOF dynamics model takes the center position (u, v), robot orientation, and control action as an input. The output of the trained DOF dynamics is the magnitude l and the angle of a predicted optical flow of that single point (u, v). From predicted l and , the velocity of the single point is calculated as in Eq. (11).

We have included a comparison table, Table I, that shows the differences in runtimes of our optical flow prediction with another state-of-the-art network. However, even though our DOF dynamics approach cares about accuracy, our primary constraint was speed. Therefore, we compared our network with the fastest and smallest of state-of-the-art networks, the SpyNet[12]. We refer to Table 9 in [25] for benchmark results for optical flow. The table shows the accuracy and the runtime of the state-of-the-art approaches ([10], [11], [12], etc). Note

TABLE I: The runtime comparison of our DOF dynamics NN and the state-of-the-art whole-image based optical flow prediction NN, the SpyNet [12]. is the number of samples (batch) used in a sampling-based controller, MPPI, to propagate in parallel with GPU and is for multi-step prediction (MPC), which requires sequential computation. OOM (Out of Memory) shows that the full-size optical flow prediction cannot be run in some cases. The runtime was measured with Intel Xeon(R) CPU E5-1650 v4 @ 3.60GHz x 12 CPU and NVIDIA GeForce GTX 1060 6GB.

that the total number of parameters in DOF dynamics NN is 34,690, whereas the SpyNet has 1,200,250 parameters. We tested DOF dynamics both with a single pixel prediction case and the whole-image prediction case (31,212 pixels). We clearly see that for multi-step prediction (), running DOF dynamics for a whole image () to predict the optical flow is too slow (3 Hz) and does not fit into real-time MPC algorithms. Since the SpyNet requires the input image pairs to have width and height to be multiple of 32, we resized the image to have a similar size as our training data set: 192160=30,720 pixels. From Table I, it is apparent that the single pixel approach with our DOF dynamics can only fit into a real-time “sampling-based” MPC framework.

We believe comparing the accuracy of our method and the standard full-image optical flow method is unfair because both approaches use different information to predict the optical flow. While the full-image approach uses more perceptual information, our DOF approach uses more non-visual information; the robot orientation and controls.

We report the prediction error of our DOF dynamics on the test dataset in Average Endpoint Error (AEE) of 2.45. The endpoint error calculates the Euclidean distance between the ground truth optical flow vectors and the predicted vectors. In the optical flow literature, depending on the training dataset, the state-of-the-art methods report AEE of 0.5-10.0.

IV. PIXEL MODEL PREDICTIVE CONTROL

In this chapter, we introduce Pixel Model Predictive Con- trol (PixelMPC) algorithm for visual object tracking and autonomous racing. PixelMPC literally predicts the future state trajectory of a “pixel model”, the deep optical flow (DOF) dynamics, and calculates the optimal control sequence (Fig. 1).

Assuming we have a visual object detector, for example, detecting custom classes of objects using You Look Only Once (YOLO) [17] algorithm. Given some detected objects, we can predict the future trajectories of their center points/pixels [u, v]. For a visual tracking task, one cost function for the optimal control of the DOF dynamics can be the L1 distance between the object pixel position [u, v] and the center of the image. This L1 cost function will force the pixel to be close to the center of the image O:

This cost function is reasonable for visual object tracking because the closer the target is to the center of the image, the longer we observe the target. In addition, the center of the image has the least distortion, which means the lowest information lost.

The autonomous racing task-related cost function for a finite-horizon optimal control problem can be designed in a form of Eq. (1). For example, to follow the desired position, orientation, and velocity and :

which control cost is ignored and is an indicator function which returns 1,000 if a robot crashes into a gate or a value between [-1, 1]. A smaller return represents the robot being closer to the desired path. The ordered waypoints (gates) are assumed to be given with the map of the entire racing track (e.g. Fig. 5). Note that this information including the position of the targets is only for the racing task along with a real-time path planning, not for the visual-servoing task. If the task of PixelMPC is similar to [16], where the task is following given waypoints, then prior information of target locations is not required.

Now, the total cost function for the optimization Eq. (2) is formed as

where a new state X is defined as [p, q, v, u; v] = [x, y, z, q, q, q, qz, u, v.

The total dynamics F(X, U) used to optimize Eq. (16) can be written as a combination of two dynamics Eqs. (4)-(7) and Eq. (13). Our formulation allows us to emphasize one task over another by tuning the cost function. If we want to achieve a faster speed instead of more visibility, then we can weight it more heavily.

Algorithm 3 shows the PixelMPC algorithm. Either from a ground truth or a state estimator, we receive a new robot state and an image from a monocular camera. A detector (e.g. YOLOv3 [17]) detects the center of a target (gate, in a racing scenario) (u, v) on the image space and an optimal model predictive controller solves the optimization problem with respect to the total cost J, Eq. (16), with a receding time horizon T. After propagating the combined model dynamics and running the optimization step, we execute the first control action and use the remaining control trajectory solution for the next optimization loop as a warm start. Then again we receive a new robot state with an image and repeat the optimization at a rate of 40 Hz.

Fig. 4: The total model dynamics used in the PixelMPC. The model is composed of the deep optical flow (DOF) dynamics and robot dynamics.

V. EXPERIMENTS/RESULTS

A. Experimental Setup

We tested our algorithm in the FlightGoggles simulation [20], which is developed for agile flight simulation with high fidelity. The racing scenario is from the AlphaPilot–Lockheed Martin AI Drone Racing Innovation Challenge1 (Fig. 5).

We used the quadrotor’s dynamics model introduced in Section II-B Eqs. (4)-(7).

Fig. 5: The race course in the FlightGoggles [20] used in this paper. Image credit: the AlphaPilot–Lockheed Martin AI Drone Racing Innovation Challenge1.

To derive our DOF dynamics from optical flow data, we collected 10 rounds of autonomous flight using a nominal MPPI controller, which took around 30 seconds for each round. To fully explore the state space we varied the target speed between 6m/s and 14m/s across rounds. The timestep in MPPI was 0.025 seconds. In total 14,000 images from a monocular camera along with drone states and controls were collected. The images were each downsized to a size of [204, 153]. This provided 204153=31,212 data points. As a result, around 437 million data points for training DOF were collected from 5 minutes of flying data. The states are collected from ground truth provided in the FlightGoggles simulator.

For object detection, we used one of the state-of-the-art algorithms, YOLOv3 [17], which allows us to detect multiple objects in real-time. 3,000 downsized images were used to train the YOLOv3 model to predict 7 classes of gates in the FlightGoggles racing environment (Fig. 5).

B. Model Predictive Path Integral control (MPPI)

In this work, out of many real-time MPC algorithms, we adopt the sampling-based stochastic optimal control algorithm, the Model Predictive Path Integral control (MPPI) [19]. MPPI allows us to handle stochasticity and it provides the easiness of designing and tuning non-quadratic cost functions, compared to other optimal control algorithms where most of them require a quadratic cost function.

For a drone racing task along with the visual object tracking task, the cost function parameters used for MPPI are =0.025 (=400, =250, =8.0, T=80, and K=1. The control variance had noise profiles: =0.2, =0.2, =0.3, and =2.2. was tuned between 3.0e+6 and 9.0e+6 and the resulting different behaviors are reported in Table II and Table III. The reason why the parameter is chosen 4 orders of magnitude higher than all other cost function parameters is because we only normalized the pixel position term from [0, W][0, H] to [0.0, 1.0][0.0, 1.0]. A total of 512 samples were used to propagate the 12 states with a time horizon of 80 in 40 Hz, which results in a 2 second trajectory. The number of samples depends on the hardware (GPU, CPU, RAM, etc.) and the size of the DOF NN dynamics. The nominal MPPI case for the racing task used the cost function only composed of Eq. (15) and the same parameters described above were used to give a fair comparison.

Fig. 6: A block diagram showing the algorithmic flow of PixelMPC.

Although the drone dynamics we introduced in this paper are the simplified linear dynamics from the simulator [20], any nonlinear dynamics model can also be used as robot dynamics model in PixelMPC.

C. Drone Racing with Object Tracking

We compare the visibility in percentage; how long the robot grabs the target in its view. In the PixelMPC framework, there are some additional DOF dynamics-related parameters we can tune: 1) the time horizon considered for the pixel cost and 2) the cost coefficient means the pixel cost Eq. (14) only penalizes the pixel trajectory within 1.0 second.

Table II shows the time the drone loses the target in sec (). We consider the ‘loss’ as visually losing the target after the robot first sees it. In this experiment, we used the ground truth provided by the FlightGoggles for robot states.

In the nominal case, without considering DOF dynamics in MPPI, the time the robot has less than 50% visibility of a target was .6 sec, which is more than 42% of the total flying time (31.8 sec). With PixelMPC, we can decrease it to 1.5 sec, 4.5% of the flying time (33.5 sec). The time of robot having less than 0% visibility of a target also decreased from 3.6 sec (4% of flying time) to 0.2 sec (less than 1% of flying time). Notice that in both 0% and 50% cases, the 2of the lost time is very large in the nominal case, compared to the PixelMPC cases. This can be explained in Fig. 8, where the plots show how smooth the movement is when we use PixelMPC. Also, compared to the nominal MPPI, PixelMPC showed 29% decrease in linear and angular accelerations in mean, which resulted in a slower speed but it provided much smoother behavior; please see Fig. 8 (Best shown in the supplementary video). However, the smoothness behavior of PixelMPC is a byproduct of the visual target tracking, not the main goal. Also, the visual target tracking cannot be accomplished by simply applying a smoothing/filtering to a controller.

In Table III, we compare the race time for each case to see how much lap time delay we get to pay for more visual information. Table III shows the mean and the 2standard deviation from 10 laps of racing per each case. As expected, the PixelMPC loses lap time by achieving more visibility of the racecourse. However, we believe it is worth to pay 1.7 sec,

TABLE II: The time (sec) the robot visually loses the target. The results are from 10 races per each case in v.s. [] when the target speed was 14m/s. The top-left []=[0.0, 0.0] shows the nominal MPPI running without any DOF dynamics control.

TABLE III: Lap time from 10 races per each case in (sec) v.s. [] when the target speed is 14m/s.

sometimes less than 0.2 sec, to achieve 42% 4.5% decrease in time that the robot loses important information in its view.

We also report DOF dynamics’ multi-step prediction error in Fig. 7. We see the predicted pixel trajectory is shorter than the actual trajectory in general, but the predicted trajectory closely follows the actual trajectory direction-wise. Note that our MPC scheme solves this compounding error problem with feedback and real-time optimization.

D. Vision-based State Estimation with Particle Filter

For state estimation with sensors (IMU, cameras, etc.), having more visual information and smooth flying behavior will benefit the state estimation and result in fewer failures overall. The most likely cause of a collision is an inaccurate state estimate. In a racing scenario, we can still assume that

Fig. 7: Left: Cropped image showing PixelMPC-predicted pixel trajectory (Green) vs. Actual pixel trajectory (Red). Right: Mean absolute error and standard deviation of the multi-step prediction of pixel position on DOF dynamics.

Fig. 8: The total variation of roll, pitch, yaw angles of 25 laps of robot trajectories running the nominal MPPI and the PixelMPC. Both controllers were tested with ground truth (GT) and a particle filter (PF) state estimator. The error bar represents . The smaller total variation of the robot orientation implies less shaky robot behavior.

the racing map, i.e. the gates’ location information is given. Then, one of the biggest challenges will be estimating the robot’s state, to perform accurate path planning and control.

For estimating the robot’s state, we use a particle filter with an observation model using gate information from observed images. The particle filter is run with 6400 particles and uses the GPU to parallelize the motion and sensor updates.

1) Motion Update: The motion update of the particle filter is done by integrating the IMU measurements directly. Additional Gaussian noise is injected into the filter with mean 0 and variance 0.2 directly on position [m]. In addition to that, Gaussian noise is added to the IMU measurements directly both with mean 0 and variance 0.2 for acceleration and variance 0.1 for angular rates. These tunings allow the particle filter to quickly jump to whatever sensor update occurs, but make the state estimate very unstable. The filter’s covariance will quickly balloon without any feature detections.

2) Sensor Update: The only sensor model of the particle filter is to use the nominal locations of the gate corners in the 3D world and back project them into the image plane. Then we find the difference between the detected results and the expected ones. Any missing detection is penalized heavily by W where W is the width of the camera image. Our custom YOLOv3 [17] gate detector is used to generate the detection

of the 2D positions from an image along with a bounding box, which includes the third (depth) information.

Table IV shows that, with the target speed of 14 m/s, the success rate of both cases are the same (80%) but if we increase the target speed to 16 m/s with the same cost parameters, the PixelMPC reports a higher success rate. The failure (crash) cases came from losing target visibility which resulted in the divergence of the state estimation. The 25 trajectories of running PixelMPC (=1.0, =9.0e+6) and the nominal MPPI with a particle filter is shown in Fig. 9. Since the racetrack we used only allows few seconds of flying between two consecutive gates, it is not intuitive to see if the PixelMPC decreases the particle filter covariance because even nominal MPPI could see the target gates very often. Therefore, we did one more straight-line flying test to fully see the effect of PixelMPC on state estimation. In this case, we increased the target speed to 20 m/s, where MPPI has to pitch down a lot to hit the target speed. As soon as the detector detects the gates, the PixelMPC tries to grab the feature in its view and this results in a smaller covariance of the particle filter. The last column of Table IV shows the maximum covariance of position from 25 runs of nominal MPPI and PixelMPC.

TABLE IV: Success rate (%) of 25 laps (14 m/s and 16 m/s) and the success case lap time (sec) of running MPPI and PixelMPC (=1.0, =9.0e+6) with a particle filter. The maximum covariance of position was tested with a target speed 20 m/s on a straight lane.

VI. CONCLUSION

By fusing vision, path planning, and control into a single optimization framework, high-speed racing can be accomplished with more stable state estimation along with more visual information. Our algorithm can be generally used in any camera-based robot system for visual servoing. Testing our algorithm with real hardware will be our next step to move forward, but there is still room for improvement. The suggested deep optical flow (DOF) dynamics does not take the depth/distance of the target pixel and the robot’s velocity information into account. The current DOF approach works well thanks to the generalization property of the deep neural network, but incorporating the target pixel’s depth information will result in a more robust dynamics propagation. Another direction to robustify the suggested dynamics will be propagating the target bounding box, i.e. the 4 corners of it, like a particle filter approach. Lastly, although the constant target velocity settings for racing and other inputs indirectly include the velocity information, directly incorporating the velocity will be helpful also for other non-racing tasks dealing with variable speed and other specific maneuvers.

Fig. 9: 25 Laps of running the nominal MPPI (Left) and the PixelMPC (=9.0e+6 and =1.0) (Right) with a particle filter. The target speed was set to 14 m/s and 16 m/s. The color represents the total covariance in position.

REFERENCES

[1] Y. Pan, C.-A. Cheng, K. Saigol, K. Lee, X. Yan, E. A. Theodorou, and B. Boots, “Agile Autonomous Driving using End-to-End Deep Imitation Learning,” Robotics: Science and Systems, 2018. [Online]. Available: http://www.roboticsproceedings.org/rss14/p56.pdf

[2] M. Bojarski, D. Del Testa, D. Dworakowski, B. Firner, B. Flepp, P. Goyal, L. D. Jackel, M. Monfort, U. Muller, J. Zhang, X. Zhang, J. Zhao, and K. Zieba, “End to End Learning for Self-Driving Cars,” apr 2016. [Online]. Available: https://arxiv.org/abs/1604.07316

[3] J. Zhang and K. Cho, “Query-efficient imitation learning for end-to- end simulated driving,” in Thirty-First AAAI Conference on Artificial Intelligence, June 2017.

[4] S. Levine, C. Finn, T. Darrell, and P. Abbeel, “End-to-End Training of Deep Visuomotor Policies,” Journal of Machine Learning Research, vol. 17, no. 39, pp. 1–40, 2016. [Online]. Available: http://jmlr.org/papers/v17/15-522.html

[5] A. Giusti, J. Guzzi, D. Ciresan, F.-L. He, J. P. Rodriguez, F. Fontana, M. Faessler, C. Forster, J. Schmidhuber, G. Di Caro, D. Scaramuzza, and L. Gambardella, “A Machine Learning Approach to Visual Perception of Forest Trails for Mobile Robots,” IEEE Robotics and Automation Letters, 2016.

[6] N. Smolyanskiy, A. Kamenev, J. Smith, and S. T. Birchfield, “Toward low-flying autonomous MAV trail navigation using deep neural networks for environmental awareness,” 2017 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS), pp. 4241–4247, 2017.

[7] R. Mur-Artal and J. D. Tard´os, “ORB-SLAM2: an Open-Source SLAM System for Monocular, Stereo and RGB-D Cameras,” IEEE Transactions on Robotics, vol. 33, no. 5, pp. 1255–1262, 2017.

[8] E. Kaufmann, M. Gehrig, P. Foehn, R. Ranftl, A. Dosovitskiy, V. Koltun, and D. Scaramuzza, “Beauty and the Beast: Optimal Methods Meet Learning for Drone Racing,” 2019 IEEE International Conference on Robotics and Automation (ICRA), 2019.

[9] E. Kaufmann, A. Loquercio, R. Ranftl, A. Dosovitskiy, V. Koltun, and D. Scaramuzza, “Deep Drone Racing: Learning Agile Flight in Dynamic Environments,” in Proceedings of The 2nd Conference on Robot Learning, ser. Proceedings of Machine Learning Research, vol. 87. PMLR, 29–31 Oct 2018, pp. 133–145.

[10] E. Ilg, N. Mayer, T. Saikia, M. Keuper, A. Dosovitskiy, and T. Brox, “FlowNet 2.0: Evolution of Optical Flow Estimation with Deep Net- works,” in IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Jul 2017.

[11] D. Sun, X. Yang, M.-Y. Liu, and J. Kautz, “PWC-Net: CNNs for Optical Flow Using Pyramid, Warping, and Cost Volume,” in IEEE Conference on Computer Vision and Pattern Recognition (CVPR), June 2018.

[12] A. Ranjan and M. J. Black., “Optical flow estimation using a spatial pyramid network,” 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2017.

[13] B. Penin, R. Spica, P. R. Giordano, and F. Chaumette, “Vision-based minimum-time trajectory generation for a quadrotor UAV,” in 2017 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS), Sep. 2017, pp. 6199–6206.

[14] V. Murali, I. Spasojevic, W. Guerra, and S. Karaman, “Perception-aware trajectory generation for aggressive quadrotor flight using differential flatness,” in 2019 American Control Conference (ACC), July 2019, pp. 3936–3943.

[15] T. Ngeli, J. Alonso-Mora, A. Domahidi, D. Rus, and O. Hilliges, “Real- Time Motion Planning for Aerial Videography With Dynamic Obstacle Avoidance and Viewpoint Optimization,” IEEE Robotics and Automation Letters, vol. 2, no. 3, pp. 1696–1703, July 2017.

[16] D. Falanga, P. Foehn, P. Lu, and D. Scaramuzza, “PAMPC: Perception- Aware Model Predictive Control for Quadrotors,” in 2018 IEEE International Conference on Intelligent Robots and Systems (IROS), 2018.

[17] J. Redmon and A. Farhadi, “YOLOv3: An Incremental Improvement,” arXiv, 2018.

[18] S. Ren, K. He, R. Girshick, and J. Sun, “Faster R-CNN: Towards Real- Time Object Detection with Region Proposal Networks,” in Advances in Neural Information Processing Systems 28, C. Cortes, N. D. Lawrence, D. D. Lee, M. Sugiyama, and R. Garnett, Eds. Curran Associates, Inc., 2015, pp. 91–99.

[19] G. Williams, N. Wagener, B. Goldfain, P. Drews, J. M. Rehg, B. Boots, and E. A. Theodorou, “Information theoretic MPC for model-based reinforcement learning,” in 2017 IEEE International Conference on Robotics and Automation (ICRA), May 2017, pp. 1714–1721.

[20] W. Guerra, E. Tal, V. Murali, G. Ryou, and S. Karaman, “FlightGoggles: Photorealistic Sensor Simulation for Perception-driven Robotics using Photogrammetry and Virtual Reality,” arXiv, 2019.

[21] G. Farnebck, “Two-Frame Motion Estimation Based on Polynomial Expansion,” in Scandinavian Conference on Image Analysis, vol. 2749, 06 2003, pp. 363–370.

[22] G. Bradski, “The OpenCV Library,” Dr. Dobb’s Journal of Software Tools, 2000.

[23] D. P. Kingma and J. Ba, “Adam: A Method for Stochastic Optimization,” Proceedings of the 3rd International Conference on Learning Representations (ICLR), vol. abs/1412.6980, 2014.

[24] M. Abadi, A. Agarwal, P. Barham, E. Brevdo, Z. Chen, C. Citro, G. S. Corrado, A. Davis, J. Dean, M. Devin, S. Ghemawat, I. Goodfellow, A. Harp, G. Irving, M. Isard, Y. Jia, R. Jozefowicz, L. Kaiser, M. Kudlur, J. Levenberg, D. Man´e, R. Monga, S. Moore, D. Murray, C. Olah, M. Schuster, J. Shlens, B. Steiner, I. Sutskever, K. Talwar, P. Tucker, V. Vanhoucke, V. Vasudevan, F. Vi´egas, O. Vinyals, P. Warden, M. Wattenberg, M. Wicke, Y. Yu, and X. Zheng, “TensorFlow: Large- Scale Machine Learning on Heterogeneous Systems,” 2015, software available from tensorflow.org. [Online]. Available: http://tensorflow.org/

[25] E. Ilg, T. Saikia, M. Keuper, and T. Brox, “Occlusions, Motion and Depth Boundaries with a Generic Network for Disparity, Optical Flow or Scene Flow Estimation,” the European Conference on Computer Vision (ECCV), 2018.

Plain Text:

K. Lee, J. Gibson, and E. A. Theodorou, Aggressive Perception-Aware Navigation using Deep Optical Flow Dynamics and PixelMPC, in IEEE Robotics and Automation Letters, 2020.

BibTeX:

@ARTICLE{lee2020pixelmpc, author={Keuntaek {Lee} and Jason {Gibson} and Evangelos A. {Theodorou}}, journal={IEEE Robotics and Automation Letters}, title={{Aggressive Perception-Aware Navigation using Deep Optical Flow Dynamics and PixelMPC}}, year={2020} }