NEMO: Future Object Localization Using Noisy Ego Priors

2019·arXiv

Abstract

I. INTRODUCTION

Predicting the future motion of agents in dynamic environments is a critical component for deployment of navigation and control strategies for autonomous and semi-autonomous systems in next generation mobility systems [1–3]. In particular, for safety critical applications that involve collision mitigation, forecasting future trajectory of agents in the scene must effectively model interactions between agents and account for the ego-motion, scene context, and environment constraints. Although some approaches [4–6] have proposed a deterministic solution based on current and past history of the agents’ motion, future forecast is inherently multi-modal, particularly where multiple paths are plausible.

Other approaches [7–11] have focused on modeling a distribution of all possible paths to tackle the multi-modality of future forecast. However, their predictive distribution is either (i) naively learned in a data driven manner with no consideration for the uncertainty; or (ii) simply generated to sample different types of motion using deep generative models. To provide a more compelling and rigorous solution to the multi-modality of the problem, a distribution of the

Fig. 1: Given the uncertainty of future ego-motion over the velocity and yaw rate as a prior, the future trajectory and bounding box of the target agent is sampled from its uncertainty distribution.

agents’ behavior should be jointly learned with uncertainty estimates, including uncertainty in the data (i.e., aleatoric), as well as uncertainty in the prediction model (i.e., epistemic). In this way, the predicted modalities can quantify the level of uncertainty/noise, thereby increasing the confidence in the accuracy of the predicted future motions generated from the predictive distribution within the modality.

Research efforts in single-modal future forecast [12], [13] have shown that uncertainty embedding improves the overall performance for predicting the agents’ future motion. However, [12] restricts their uncertainty to be epistemic, and overlooks noise inherent in the dataset, which makes it infeasible to recover from a small number of observations. Also, their problem setting (i.e., aerial RGB imagery as input) does not consider egomotion from a mobile platorform, making it difficult to deploy to autonomous driving (AD) and advanced driving assistance systems (ADAS). In [13], both aleatoric and epistemic uncertainty are considered from the ego-car perspective, where ego-motion as a prior affects the future motion of other agents. However, the uncertainty of ego-motion prediction is not taken into account, which is most critical to accurately forecast the interactive behavior of other agents.

To address the limitations of existing approaches, we propose a multi-modal future forecast framework, NEMO, which aims to (i) model both aleatoric and epistemic uncertainty of ego-vehicle as well as other agents; (ii) condition future object localization on multiple modes of ego-motion priors, which results in different types of target agents’ behavior; and (iii) apply such a framework to immediate applications of autonomous driving (AD) and ADAS with readily equipped front-facing RGB camera.

Fig. 1 is a visual illustration of the proposed framework. Our approach (NEMO) first models both the aleatoric and epistemic uncertainty of future ego-behavior using the past motion history of the ego-vehicle. Then, the multiple modes of future ego-motion are sampled from the probability distribution with uncertainty estimates. Each modality is provided to the future object localization stream as a prior to assess interactive responses of the target agent with respect to the different types of future ego-motion. We further consider the uncertainty of target agent’s future motion and its multi-modality. An overview of the proposed approach is presented in Fig. 2. In this process, NEMO generates multi-modal future motions of the target over the uncertainty of future ego-motion, which reflects actual egocentric interactions observed in real traffic scenes. For more accurate ego-motion prediction, IMU data synchronized with the video streams in HEV-I [5] is released, which can extend the utility of HEV-I beyond future object localization problem to include visual odometry estimate [14] and other 2D image-based decision making and motion planning tasks [15], [16]. The updated IMU sensor data will be made available at https:

II. RELATED WORK

A. Uncertainty Modeling

Denker et al. [17] and MacKay et al. [18] studied the uncertainty of the model parameters using Bayesian neural networks (BNNs). Recently, Gal et al. [19], [20] have shown that Bayesian inference can be approximated with a traditional network architecture. They model epistemic uncertainty by sampling from the posterior distribution of the learned model using dropout during inference, which is equivalent to approximated Bayesian inference. In addition, Kendall et al. [21] shows that aleatoric uncertainty can be captured using negative log-likelihood loss by outputting the extra parameters for variance from the network output. It enables the network to learn the noise parameters that originate from the noise inherent in the dataset. Following the success of uncertainty modeling in single-modal forecast [12], [13], we embed the uncertainty of future prediction into our multi-modal pipeline.

B. Egocentric Vision

Videos captured from the egocentric perspective are easyavailable and contain the natural interactions of the egoagent with the surrounding environment and other agents. Egocentric videos have been widely used in various tasks such as object detection [22], [23], person re-identification [24–26], video summarization [27], gaze prediction [28], and action recognition [29–33]. Recent works have looked into ego-action estimation using ego-view. Park et al. [34] studied future ego-location estimation using egocentric view. Su et al. [35] predict future actions for basketball players captured from synchronized multiple views using siamese networks.

The studies in [4], [5], [13] are directly related to future object localization in first-person view. Yagi et al. [4] uses human poses as a prior to forecast the future motion of humans, but their model is not applicable to vehicles in traffic scenes. The work by [5], [13] consider driving scenarios, but focus on single-modal localization of road agents. In particular, Yao et al. [5] uses object appearance and ground-truth future ego-motion for future localization, but does not consider prediction uncertainty. Battacharya et al. [13] predict the future ego-motion and use the prediction as prior to localize other agents with uncertainty estimates. However, their approach overlooks the uncertainly in the future ego-motion, which is critical in determining the interactive reactions of other agents and their future behaviors toward the ego-vehicle. In contrast, we address the uncertainty of future ego-motion prediction and introduce noisy ego-priors for multi-modal future object localization.

C. Future Trajectory Forecast

The problem of future trajectory forecast from top-down views has been widely studied. Social-LSTM [7] introduces a social pooling module for interaction encoding, and SocialGAN [9] efficiently improves its performance by replacing the pooling with a multi-layer perceptron. Social-Attention [10] introduces a soft attention mechanism to find more useful interactions. Gated-RN [12] observes spatio-temporal interactions using images and infers relational behavior between agents. Their relational inference is adopted into DROGON [11] that uses intention as a prior for trajectory prediction, focusing on causation between intention and the future motion. SSP [36] predicts all agents trajectories in single-shot using composite fields.

However, these methods are not seamlessly applicable to egocentric videos captured from mobile vehicle platforms for the following reasons: (i) unlike the top-down view from a stationary camera, the distance between objects should be jointly assumed from the location and scale in frontal view images; (ii) the interactions between agents are relative to the ego-motion, but their models do not explicitly account for the ego-motion uncertainty; and (iii) their consideration of the uncertainty does not exist or is minimal, which does not provide a comprehensive solution to multi-modal predictions. To address these limitations, we present NEMO for future trajectory forecast from an egocentric view.

III. BAYESIAN UNCERTAINTY MODELING

In this section, we show how aleatoric and epistemic uncertainty can be jointly modeled using a single framework.

A. Aleatoric Modeling

Aleatoric uncertainty comes from inherent noise in the observations due to the probabilistic variability. To model this type of uncertainty during training, the network incorporates noise parameters () at time t, where denotes the mean and denotes the co-variance matrix for the ground-truth label . The co-variance matrix is learned using negative log-likelihood loss function as follows:

We predict () at T observed time-steps from time to . Eq. 1 is used to compute how likely the observations come from the posterior distribution . For numerical stability, having zeros in denominator is not suggested. Thus, we substitute log() with , which results in Eq. 2 as follows:

B. Epistemic Modeling

Epistemic uncertainty is caused by the model’s weight parameters that are inadequately measured from the observations. Thus, this type of uncertainty can be reduced by taking more measurements. Dropout is well-known in deep learning community, which is originally used as a regularization method to avoid over-fitting. However, a recent study in [20] introduced dropout to learn a distribution of weights to approximate variational inference in Bayesian modeling [19]. Given the dataset X, Y the posterior over weights P(w|X, Y ) is approximated using a dropout distribution q(w) [37]. During inference, we generate N samples from the distribution q(w) of the network’s learned weight parameters w using dropout. Then, N number of noisy outputs are used to compute the variance between the predicted outputs and ground-truth labels at each time-step t. The details are shown in Eq 3 as follows:

Note that the computation of the mean and variance is performed during inference using dropout.

C. Joint Modeling of Aleatoric and Epistemic Uncertainty

We update the noise parameters () by adding aleartoric uncertainty given in Eq. 2 to epistemic uncertainty in Eq. 3. The total variance and mean is computed as shown in Eq. 4. are set of N sampled outputs from for randomly sampled weights from the dropout distribution q(w).

As a result, we output the noise parameters for the data posterior distribution together with the learned distribution of the model’s weights q(w) during inference. In practice, different node connections are sampled for N times using dropout, and corresponding aleatoric and epistemic uncertainty is computed using Eq. 4.

IV. NEMO FRAMEWORK

The proposed NEMO framework is designed to properly model the uncertainty of future ego-motion, which is most important to determine other agents’ future motion in the egocentric view. As shown in Fig. 2, we divide the future forecast problem into two tasks: future ego-motion prediction and future object localization. For future ego-motion prediction, we first encode the past motion of the ego-vehicle and generate its future motion through the ego-motion decoder. To model the joint uncertainty of ego-motion, the model weights for the ego-motion decoder are drawn from the weight distribution . Here, we generate multiple modes of prediction over the uncertainty distribution over the velocity v and yaw rate , where is the future ego-motion and is past ego-motion.

From the other stream, the motion of other agents are encoded using the bounding box encoder and flow encoder, respectively. We then concatenate the encoded result and use the bounding box decoder to learn the weight parameters . To properly model their future behavior with respect to the noisy ego-motion, we use the output of the future ego-motion prediction as a prior. In this way, the bounding box decoder reacts to each modality of the ego-vehicle, while predicting other agents’ future motion. Similar to joint uncertainty modeling of the ego-motion decoder, the weights for the bounding box decoder are drawn from the weight distribution . We estimate the noise parameters for the center and the dimension (w, h) of the bounding box using the weights . Finally, we predict B = by sampling from the uncertainty distribution concatenation of past flow and bounding box encoding (is flow encoder, is past bounding box encoder, and is concatenation operator).

A. Future Ego-Motion Prediction

Given the past observations, predicting the future ego-motion must account for multiple possibilities. Thus, we model multi-modal predictions for the ego-motion with uncertainty estimates. We take the past observations from IMU odometry (, ) for time steps and encode the ego-motion using Gated Recurrent Units (GRU). Multi Layer Perceptron (MLP) is used to convert the past ego-motion to the embedding of the GRU. The prediction output of a GRUbased decoder is a 5-dimensional vector at each future time step from to , where is mean and is noise in velocity prediction, is mean and is noise in yaw rate prediction, and is correlation coefficient between those two dimensions. During inference, we sample velocity and yaw rate from the uncertainty distribution generated by the noise parameters.

Fig. 2: The proposed NEMO framework. The Future ego-motion prediction stream models the uncertainty of future ego-behavior. The Future object localization stream encodes past bounding box and flow information to predict future motion of the target agent conditioned on the sampled future ego-motion. The resulting distribution is multi-modal and uncertaintyaware. is a concatenation operator.

The input is and output is .

B. Future Object Localization

We use the past bounding box information and past ROI pooled Flow information , which are separately processed using the respective GRU encoders. In addition, we use the predicted future ego-motion as a prior to generate future motion of the target at time t. The output of future object localization is a 10-dimensional vector at each future time step, where and are a set of mean and co-variance parameters for the center and for the bounding box dimension, respectively. By assuming two 2D-Gaussian functions for the uncertainty, we reduced the number of parameters to regress from 20 (4D-Gaussian) to 10 (2 2D-Gaussian). The output center () and dimension ( ) of the bounding boxes are sampled from the uncertainty distribution generated by these noise parameters.

V. EXPERIMENTS

A. Dataset

The HEV-I dataset [5] is publicly available and consists of 2477 vehicles in 230 videos collected from urban driving scenarios. The dataset includes the motion of the ego-vehicle obtained by ORB-SLAM2 [38]. However, the estimated translation is a normalized unit vector which does not recover the full 3D motion of the ego-vehicle. Moreover, the dynamic motion of surrounding agents often causes association errors, which severely affect the rotation estimates. Therefore, we provide IMU odometry decoded from the CAN message of the ego-vehicle. We observed that a drift error is less than 0.2 meters for new IMU odometry as compared to LIDAR odometry for the HEV-I sequences.

B. Implementation

NEMO is trained with a TITAN Xp GPU using the PyTorch framework. We first train the ego-motion prediction stream from scratch. Then, the learned model is jointly optimized with the future object localization stream.

1) Future Ego Motion Prediction: We use a batch size of

32 and learning rate of 0.001 for negative log-likelihood loss as future ego motion loss with the RMSProp optimizer. For the learning rate, we drop the value by a factor of 2 after every 20 epochs. The network converges after 100 epochs. For evaluation, we reconstruct the 2D trajectory from the predicted velocity and yaw rate using Eq. 5 with respect to the last observed frame, assuming planar motion.

where is a 2D rotation matrix and is a 2D translation vector. We use a right handed coordinate system for the ego-motion. Note that velocity and yaw rate is converted into translation in meters and degrees between every time step using the time interval of 0.1 sec.

2) Future Object Localization: We use input image of size

W = 1920 and H = 1200 pixels. The bounding box centers and dimensions are normalized to a range of [0, 1]. While training the module, we use a batch size of 32 and learning rate of 0.001 that is reduced by a factor of 5 after every 20 epochs. We use a weighting for the pre-trained model for future ego motion prediction loss and for future object localization loss.

Fig. 3: Future ego-motion prediction. (a,b,c) velocity and (d,e,f) yaw rate. Given the past observation and future ground-truth , Const-Vel

models are compared with RNN-AE (Ours)

Fig. 4: Future ego-motion prediction using NEMO (RNN-AE) with the uncertainty. (a,b,c) velocity and (d,e,f) yaw rate of the ground-truth and RNN-AE (Ours)

is plotted with the uncertainty

VI. RESULTS

Prior works [5], [13] report that a 1 second prediction time is sufficient for safe operation of the vehicle travelling with a speed up to 25 MPH. However, for natural driving in urban areas, this is an underestimate since we found that vehicles travel with a speed up to 43 MPH in the HEVI dataset. Thus, we observe the past 1 second, and make predictions 2 seconds in the future. We sample k = 10 future predictions from the distribution and report the result with a minimum error as , where is trajectory prediction sample and y is ground truth trajectory. For evaluation, we compute the Average Distance Error (ADE) and Final Distance Error (FDE) for motion prediction, and Final Intersection over Union (FIOU) for bounding box prediction. The reported ADE/FDE for ego-motion prediction is in units of meters, while those for

bounding box prediction is in pixel units.

A. Future Ego Motion Prediction

We use Const-Vel [39] as one of our baselines where the output for future 20 time steps is the same as the input observed at time t = 10. The RNN baseline has a GRU-based encoder and decoder, which highly improves the performance compared to the Const-Vel baseline. For RNNE, we model the epistemic uncertainty with a dropout in the decoder’s MLP layer. RNN-A is with aleatoric uncertainty, where we sample 10 trajectories from the learned likelihood distribution parameters. RNN-AE is a combined aleatoric and epistemic uncertainty modeling. We observe that uncertainty modeling improves performance in all three cases (RNN-E, RNN-A, RNN-AE) compared to their counterparts (ConstVel, RNN). Overall, RNN-AE performs better than all the baselines as shown in Fig. 3 and Tbl. I, validating the efficacy

Fig. 5: Example scenarios for future object localization. Given the last observation at time and the future ground-truth at time , Const-Vel

, RNN-P (ORB)

, and RNN-P (IMU) models are compared to RNN-AE (Ours) . Also, the predicted trajectory of RNN-AE (Ours) is visualize with the ground-truth .

Fig. 6: Qualitative evaluation of NEMO (RNN-AE) with the uncertainty of future object localization. The predicted centers of the bounding box from to are shown as a trajectory with ground-truth . Also, the bounding box at is sampled from the probability distribution (red indicates high probability).

of uncertainty modeling. The uncertainty estimates are shown in Fig. 4.

TABLE I: Quantitative results for future ego-motion prediction. ADE/FDE errors are reported in meters.

B. Future Object Localization

We use the Const-Vel baseline for bounding box prediction in pixel coordinates. However, this baseline does not consider the scaling factors in the egocentric videos. For fair comparison, we linearly scale the bounding box dimensions using the transformation of the last two observations. As shown in Tbl II, linearly scaled bounding box dimensions improves the FIOU performance. RNN-NP does not use any priors of the ego-motion. Interestingly, we observe improvement in both ADE and FDE, but its FIOU is degraded as compared to the Const-Vel baseline. RNN-P (ORB) uses ORB-SLAM2 [38] based ego-motion as a prior [5], while RNN-P (IMU) uses IMU odometry for ego-motion prediction similar to [13]. We observe that IMUbased ego-motion, RNN-P (IMU), improves the performance when compared with the ORB-SLAM2-based ego-motion, which validates our claim to use IMU odometry for future object localization. For RNN-AP, we use the pre-trained ego-motion prediction module with aleatoric uncertainty and train jointly with future object localization. Similarly, RNNEP uses the pre-trained ego-motion prediction module with epistemic uncertainty and is trained jointly with future object localization. Use of these uncertainty models to condition other agents’ motion forecast significantly improves the overall performance. This comparison validates the rationale of our use of the uncertainty to model more robust interactions of other agents with the ego-vehicle. For RNN-A, both ego-motion prediction and future object localization modules are trained with aleatoric uncertainty. Similarly, RNN-E is trained with epistemic uncertainty for both tasks. These baseline models further decrease the error rate compared to RNN-AP and RNN-EP. Finally, RNN-AE (Ours) models aleatoric and epistemic uncertainty throughout the NEMO pipeline, which has the best performance in predicting future motion of agents as well as others’ bounding box locations and scales. Fig. 5 qualitatively evaluates how NEMO (RNNAE) performs against other methods, and Fig. 6 visualizes the uncertainty of future object localization. From these results, we conclude that NEMO properly captures the interactive behaviors of road agents with respect to the ego-vehicle with the uncertainty of future forecast in the egocentric view.

TABLE II: Quantitative results for future object localization. ADE/FDE are reported in pixel on an image of size 1200x1920.

Fig. 7: Example scenarios of multi modality for future object localization.

C. Multi Modality

We further evaluate on the multi-modal capability of our framework in Fig. 7. To generate multi-modal future trajectories, we sample 10 future positions from the prediction distribution at each time-step and condition the next time-step prediction on current time-step output. The resulting diverse trajectories at intersection scenarios are shown with different colors. The white vehicle in Fig. 7(a) shows multiple possible motions. Based on the uncertain ego-future, it may turn to its left when the ego-car is turning left (light green), or it may just slow down when the ego-car is going straight (orange). Based on the possible ego-car’s future predictions, the other car’s future object localization presents multiple modes. In Fig 7(b), the white car can either turn to its left (green) or go straight (cyan) when the ego car is stopped. As highlighted in these examples, each modality of the uncertain ego-motion properly models interactive reactions of other agents using the proposed framework.

VII. CONCLUSION

We introduced the NEMO framework to condition future object localization on the uncertainty of future ego-motion priors. For this, we jointly modeled aleatoric and epistemic uncertainty of ego-motion prediction to sample multiple modes of future ego-behavior. Then, each modality was used as a prior to capture interactive reactions of other agents with respect to the different types of ego-motion. We also considered the uncertainty of future object localization as well as its multi-modality. To this end, ablative tests were conducted using the public benchmark dataset, comparing NEMO with the state-of-the-art methods and self-generated baseline models. We observed that combined epistemic and aleatoric uncertainty modeling in both future ego-motion prediction and future object localization achieved the lowest prediction error from both future ego-motion and future bounding box prediction. In the future, we plan to extend our work to rank each of the predicted modes based on the uncertainty measure, which will result in making the system more pragmatic for real world applications.

REFERENCES

[1] D. Vasquez, F. Large, T. Fraichard, and C. Laugier, “High-speed autonomous navigation with motion prediction for unknown moving obstacles,” in 2004 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS)(IEEE Cat. No. 04CH37566), vol. 1. IEEE, 2004, pp. 82–87.

[2] K. Fujimura and H. Samet, “Time-minimal paths among moving obstacles,” in Proceedings, 1989 International Conference on Robotics and Automation. IEEE, 1989, pp. 1110–1115.

[3] I. Ulrich and J. Borenstein, “Vfh+: Reliable obstacle avoidance for fast mobile robots,” in Proceedings. 1998 IEEE international conference on robotics and automation (Cat. No. 98CH36146), vol. 2. IEEE, 1998, pp. 1572–1577.

[4] T. Yagi, K. Mangalam, R. Yonetani, and Y. Sato, “Future person localization in first-person videos,” in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2018, pp. 7593–7602.

[5] Y. Yao, M. Xu, C. Choi, D. J. Crandall, E. M. Atkins, and B. Dariush, “Egocentric vision-based future vehicle localization for intelligent driving assistance systems,” arXiv preprint arXiv:1809.07408, 2018.

[6] N. Nikhil and B. Tran Morris, “Convolutional neural network for trajectory prediction,” in Proceedings of the European Conference on Computer Vision (ECCV), 2018, pp. 0–0.

[7] A. Alahi, K. Goel, V. Ramanathan, A. Robicquet, L. Fei-Fei, and S. Savarese, “Social lstm: Human trajectory prediction in crowded spaces,” in Proceedings of the IEEE conference on computer vision and pattern recognition, 2016, pp. 961–971.

[8] N. Lee, W. Choi, P. Vernaza, C. B. Choy, P. H. Torr, and M. Chandraker, “Desire: Distant future prediction in dynamic scenes with interacting agents,” in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2017, pp. 336–345.

[9] A. Gupta, J. Johnson, L. Fei-Fei, S. Savarese, and A. Alahi, “Social gan: Socially acceptable trajectories with generative adversarial networks,” in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2018, pp. 2255–2264.

[10] A. Vemula, K. Muelling, and J. Oh, “Social attention: Modeling attention in human crowds,” in 2018 IEEE International Conference on Robotics and Automation (ICRA). IEEE, 2018, pp. 1–7.

[11] C. Choi, A. Patil, and S. Malla, “Drogon: A causal reasoning framework for future trajectory forecast,” arXiv preprint arXiv:1908.00024, 2019.

[12] C. Choi and B. Dariush, “Looking to relations for future trajectory forecast,” arXiv preprint arXiv:1905.08855, 2019.

[13] A. Bhattacharyya, M. Fritz, and B. Schiele, “Long-term on-board prediction of people in traffic scenes under uncertainty,” in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2018, pp. 4194–4202.

[14] A. Patil, S. Malla, H. Gang, and Y.-T. Chen, “The h3d dataset for full-surround 3d multi-object detection and tracking in crowded urban scenes,” arXiv preprint arXiv:1903.01568, 2019.

[15] Y. Mezouar and F. Chaumette, “Path planning for robust image-based control,” IEEE Transactions on Robotics and Automation, vol. 18, no. 4, pp. 534–549, 2002.

[16] K. Hashimoto, T. Kimoto, T. Ebine, and H. Kimura, “Manipulator control with image-based visual servo,” in Proceedings. 1991 IEEE International Conference on Robotics and Automation. IEEE, 1991, pp. 2267–2271.

[17] J. S. Denker and Y. Lecun, “Transforming neural-net output levels to probability distributions,” in Advances in neural information processing systems, 1991, pp. 853–859.

[18] D. J. MacKay, “A practical bayesian framework for backpropagation networks,” Neural computation, vol. 4, no. 3, pp. 448–472, 1992.

[19] Y. Gal and Z. Ghahramani, “Bayesian convolutional neural networks with bernoulli approximate variational inference,” arXiv preprint arXiv:1506.02158, 2015.

[20] ——, “Dropout as a bayesian approximation: Representing model uncertainty in deep learning,” in international conference on machine learning, 2016, pp. 1050–1059.

[21] A. Kendall and Y. Gal, “What uncertainties do we need in bayesian deep learning for computer vision?” in Advances in neural information processing systems, 2017, pp. 5574–5584.

[22] Y. J. Lee and K. Grauman, “Predicting important objects for egocentric video summarization,” International Journal of Computer Vision, vol. 114, no. 1, pp. 38–55, 2015.

[23] G. Bertasius, H. S. Park, S. X. Yu, and J. Shi, “First person actionobject detection with egonet,” arXiv preprint arXiv:1603.04908, 2016.

[24] S. Ardeshir and A. Borji, “Ego2top: Matching viewers in egocentric and top-view videos,” in European Conference on Computer Vision. Springer, 2016, pp. 253–268.

[25] C. Fan, J. Lee, M. Xu, K. Kumar Singh, Y. Jae Lee, D. J. Crandall, and M. S. Ryoo, “Identifying first-person camera wearers in third-person videos,” in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2017, pp. 5125–5133.

[26] M. Xu, C. Fan, Y. Wang, M. S. Ryoo, and D. J. Crandall, “Joint person segmentation and identification in synchronized first-and third-person videos,” in Proceedings of the European Conference on Computer Vision (ECCV), 2018, pp. 637–652.

[27] Y. J. Lee, J. Ghosh, and K. Grauman, “Discovering important people and objects for egocentric video summarization,” in 2012 IEEE Conference on Computer Vision and Pattern Recognition. IEEE, 2012, pp. 1346–1353.

[28] Y. Li, A. Fathi, and J. M. Rehg, “Learning to predict gaze in egocentric video,” in Proceedings of the IEEE International Conference on Computer Vision, 2013, pp. 3216–3223.

[29] Y. Li, Z. Ye, and J. M. Rehg, “Delving into egocentric actions,” in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2015, pp. 287–295.

[30] M. Ma, H. Fan, and K. M. Kitani, “Going deeper into first-person activity recognition,” in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2016, pp. 1894–1903.

[31] A. Fathi and A. Farhadi, “Rj: Understanding egocentric activities. computer vision (iccv),” in 2011 IEEE International Conference on. IEEE, 2011.

[32] H. Pirsiavash and D. Ramanan, “Detecting activities of daily living in first-person camera views,” in 2012 IEEE Conference on Computer Vision and Pattern Recognition. IEEE, 2012, pp. 2847–2854.

[33] K. M. Kitani, T. Okabe, Y. Sato, and A. Sugimoto, “Fast unsupervised ego-action learning for first-person sports videos,” in CVPR 2011. IEEE, 2011, pp. 3241–3248.

[34] H. Soo Park, J.-J. Hwang, Y. Niu, and J. Shi, “Egocentric future localization,” in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2016, pp. 4697–4705.

[35] S. Su, J. Pyo Hong, J. Shi, and H. Soo Park, “Predicting behaviors of basketball players from first person videos,” in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2017, pp. 1501–1510.

[36] I. Dwivedi, S. Malla, B. Dariush, and C. Choi, “Ssp: Single shot future trajectory prediction,” arXiv preprint arXiv:2004.05846, 2020.

[37] Y. Gal, “Uncertainty in deep learning,” Ph.D. dissertation, PhD thesis, University of Cambridge, 2016.

[38] R. Mur-Artal and J. D. Tard´os, “Orb-slam2: An open-source slam system for monocular, stereo, and rgb-d cameras,” IEEE Transactions on Robotics, vol. 33, no. 5, pp. 1255–1262, 2017.

[39] C. Sch¨oller, V. Aravantinos, F. Lay, and A. Knoll, “The simpler the better: Constant velocity for pedestrian motion prediction,” arXiv preprint arXiv:1903.07933, 2019.

Designed for Accessibility and to further Open Science