Human’s vision system plays a key role for perceiving and interacting with traffic participants under the complicated driving context. When looking into the dynamic scene, a driver can rapidly select the objects that are relevant for the driving task and make a control decision for effective and efficient driving. Inspired by this visual selection mechanism, driver’s attention has been studied in recent years in order to understand the human driving behavior and ultimately help the driving control system of autonomous vehicles. Existing works focus on pixel-level driver’s attention prediction by mimicking human gaze behavior [17], [22], [25]. However, there are at least two drawbacks of using human gaze: 1) human gaze is sometimes not directly related to the driving task. For example, drivers may look at the billboards for their own interests; 2) human gaze is sequential which makes it impossible to capture all the important information at the same time. Moreover, existing works only take the perceived driving video as input and do not consider the effect of the driver’s goal, while driver’s goal is an essential factor to select relevant objects. For example, objects relevant for making control decisions should be very different when the ego vehicle is turning right versus turning left.
To handle those limitations, we formulate the problem as Object Importance Estimation (OIE) in on-road driving videos. The important objects are defined as the road users, i.e., vehicles and persons, that are relevant for the ego vehicle’s driver to make the vehicle control decision. Our
Fig. 1. The scenario of our work. Bounding boxes with arrows indicate the moving road users, dotted line shows the planned path of the ego vehicle and the dotted circle includes the important object. Given the dynamic status of the road users, a driver’s driving-related attention usually lands on the road users that have influence on the control decision of the driver. Moreover, the attention highly depends on the driving goal of the vehicle.
definition ensures that the important objects are directly related to the driving task and that multiple important objects can be captured at the same time. Static semantic driving context, e.g., traffic lights, line marks and drivable areas, can also influence the driving behavior. However, we only focus on the interactions with the road users and leave the static semantic driving context for future work. Fig. 1 shows an example of the scenario that our work focuses on. Visual dynamics of road users are important for our model to understand the driving scene. Also, the driver’s goal (where the vehicle is going) is essential for object importance estimation. For example, in Fig. 1, if the ego vehicle is turning left instead, all the pedestrians on the cross walk at the right side will not be as important to the ego vehicle.
To solve the proposed OIE problem, we present a novel framework where both the features of the dynamic road users (visual model) and the driving goal (goal model) are incorporated. In order to evaluate our framework, we collect an on-road driving dataset in the real world and annotate the important objects given the context. To provide more complex interactions between the road users and the ego vehicle, our dataset focuses on traffic intersections. Experiments show that our method largely outperforms the baselines, especially for the scenarios that the ego vehicle is turning left/right which demonstrates that modeling the driving goal is very important for our task. To explore the possibility of using important objects to improve driving control prediction, we conduct an experiment on binary brake prediction. Results show that the binary brake prediction can be improved with the information of the object importance.
Fig. 2. The proposed approach has two branches, e.g., visual model and goal model. Object tracking is done for all the road users through the input clip. Visual features of objects are extracted at each time step. Goal model describes the driving goal at each time step using sampled points on the planned path in the real world. A common goal-oriented feature is concatenated with features of each object at the corresponding time to form the final feature representation. A shared LSTM model is used to predict the importance score for every object given the final features. Objects and their features are differentiated using different colors.
A. Driver’s Attention Prediction
Human Gaze based Approach. Existing works focus on driver’s attention prediction supervised by human gaze information [22], [17], [25]. Tawari and Kang propose a Bayesian framework for driver’s attention prediction where a fully convolutional network is utilized with only images as input in [22]. Palazzi et al. proposed a multi-branch model that incorporates RGB, optical flow and semantic segmentation clips in [17] and C3D [23] is used to extract features from multiple branches. In [25], Xia et al. propose a driver’s attention framework where a human weighted sampling strategy is used during training to handle critical situations. Kim et al. explore the idea of using driver’s attention to interpret the driving control prediction in [16].
Driver’s Attention Prediction Dataset. There are several datasets [20], [24], [7], [18], [1] can be used for driver’s attention prediction, but most of them are either restricted to limited settings or not publicly available. To the best of our knowledge, Dr(eye)ve [1] is the only public on-road driving dataset for the driver’s attention prediction task. It consists of 555,000 frames divided into 74 video sequences. Human gaze is captured by eye tracking glasses and projected to the corresponding on-road driving video frame. However, it is not suitable for our task, since 1) it has only per pixel saliency annotations based on human gaze which cannot be easily converted for important object labels; 2) it contains mostly scenarios of driving on the straight road (mostly the vehicle is trying to keep itself between lines or following another vehicle) which makes it not complicated enough for our task. Driving at the traffic intersections is a more appropriate scene for us, since it provides more opportunities for the ego vehicle to interact with other road users.
B. Region based Object Detector
CNN detectors have achieved great success [12], [11], [19], [13], [10], [21], [8], [28]. Region based CNN (RCNN) is one of the most popular frameworks. Girshick et al. initially proposed the two-stage R-CNN framework in [12] where object proposals are obtained first and then classified to different categories. Later, Fast R-CNN is proposed in [11] to speed up R-CNN [12] via end-to-end training/testing. However, it relies on external object proposal algorithms. Ren et al. present Faster R-CNN [19] which jointly trains the proposal generation and the detection branches in a single framework. Further more, He et al. extend Faster R-CNN in [13] and create an unified architecture for joint detection and instance segmentation. Our problem is related to R-CNN in a sense that we also assign some scores to the proposed object candidates. However, we estimate object importance under the driving context rather than differentiating object categories, e.g., dog and cat.
The problem is formulated as goal-oriented object importance estimation where the inputs are on-road driving video clip and the goal of the ego vehicle. The outputs are the detected objects with importance scores at the last frame of the video clip. The planned path information which can be obtained from autonomous driving (AD) path planning module when the vehicle is driving online, is used to represent the goal of the vehicle.
Inspired by the R-CNN frameworks, we propose a two-stage framework which firstly generates object tracklinks from videos as object proposals and then classify the proposals to the binary classes, e.g., important object and background. Different from R-CNN detectors which generate proposals from static images, we track every object from the input video clip and treat the entire track link of an object as a proposal, since unlike the general object detection scenario where object categories, e.g., dog and cat, can be determined just from a static image, the object importance depends on the dynamics of objects through the video.
As we mentioned in Sec. I, object importance depends on both the dynamic of the object itself and the driving goal of the ego vehicle. Thus, our method fuses the information from both parts. Due to the good performance of recurrent networks [26], [27], [9] on online action detection tasks, our framework is based on LSTM [15].
Our framework is shown in Fig. 2. The first branch describes our visual model. Multiple object tracking is performed on the input video clip. Thus, for each object candidate, i, its bounding-box location, , is obtained at each time step t. Note that each time step corresponds to each image frame in the input video clip. For each object candidate at every time step, high dimensional features f
are extracted to represent the appearance, motion and location of the object. We use a feature matrix F
to represent each object i, in the video where n is the length of the input clip. Without goal information, LSTM can be used directly with the F
as the input and the output is score
of being an important object at time t. We will use it as a baseline in our experiment section.
The second branch shows our goal model. We extract the goal-oriented feature gat time t from the AD path planning module. The extracted feature is concatenated with the features of each object in the image to form the fi-nal feature representation gof
, for the object. The representation for the object within the whole clip is GoF
gof
gof
gof
. A one-layer LSTM model followed by a fully connected (FC) layer performs over GoF
to output the importance score for each object i as shown in Eq. 1, where W and b indicate parameters of the FC layer. Softmax layer is used then to output the corresponding important probability.
Visual Feature. Appearance, motion and location features are combined to represent the dynamic changes of an object. Appearance feature is extracted from the fc7 layer of Faster R-CNN [19] pretrained on the Pascal VOC2007 [5] and VOC2012 [6] trainval sets with Resnet101 [14] as the backbone. The appearance feature describes both the appearance of the object and the local context around the object [19]. Histogram of flow [4] with BIN=12 of each object bounding box is extracted as the motion feature. Location feature is represented by where
and
indicate the left-top corner of
, its width and height.
and
indicate the width and height of image t. The visual feature, f
, is the concatenation of these three features.
Goal-oriented Feature. At each time step, the planned path (with regard to distance in the vehicle-centric coordinates) can be obtained from the AD path planning module for an online driving task. As shown in Fig. 3, at each time step, discrete points are uniformly sampled with respect to distance to represent the planned path. Each sampled point is represented by (x, y) which indicates the location of the point in the vehicle-centric coordinate in the real world. Radius of curvature, R, is directly related to the turning behavior, so it can be used to represent each point on the path which can be calculated as in Eq. 2 given the location (x, y). For the straight road, the value of R approaches infinity which is not appropriate for learning. So, we use instead to describe a certain point in the planned path. At time t, IR
is used to represent the whole planned path where IR(l) indicate the value of IR at the next l distance units and L indicates the maximum future distance our method considers. One FC layer is applied on IR
to extract the goal-oriented feature, g
.
where and
when turning right and
when turning left.
A. Object Importance Estimation Dataset
Dataset Description. We collect 743 on-road driving videos at traffic intersections in the real world. Data collec-
Fig. 3. Illustration of the planned path description. Points are sampled (per distance unit) on the planned path obtained from the AD path planning module. Radius of curvature can be used to describe each point. Thus, a path can be represented by a discrete set of point descriptions.
tion was conducted from two different locations- Mountain View and Sunnyvale, CA, USA, totalling 6.3 hours. Each location contains 3 sessions of data. We believe that intersections contain more complicated driving scenarios and are more challenging for our task, so from each of the raw videos, a short video is trimmed. Each short video contains one pass of an intersection (25 meters before and after the intersection). After trimming, 2.7 hours of useful data are obtained. All the annotations and our experiments are conducted on the trimmed videos. Annotations. When preparing the important object annotations, an annotator was asked to watch the on-road driving video and imagine he/she was driving the ego vehicle. All the objects that are relevant for the ego vehicle’s control decision are tightly located using bounding boxes. Note that the annotator was given the driving goal during the process of annotating each video sequence. For each video, important objects are labeled at every 30 frames. The frame sampling rate is 30 fps, thus labels were acquired at every second. Further more, in order to understand our performance on different driving goals, i.e., turn left, straight pass and turn right, per-frame goal are annotated. The goal of an image frame is annotated as ‘turn left’ if the vehicle is expected to turn left at the next frame and so on. Dataset Preprocessing. Important object labeling may be influenced by traffic signals. For example, when the red light is on, no objects are considered as important since none of them will influence the driver’s control decision. However, since we only consider the interactions with road users, we remove all the image frames where no important objects are labeled because of the traffic signals. Dataset Statistics. After preprocessing, 8, 166 image frames are annotated, where 4, 268 important objects are obtained. Among all the labeled frames, 56.6% images contain no important objects, 38.3% contain one important object and 5.1% frames include multiple important objects. The annotated frame numbers of turn left, straight pass and turn right are 1004, 6591 and 1016. The corresponding object numbers are 375, 3573 and 320. Although we focus on traffic intersections, there are still more straight-pass frames than left/right-turn ones, which motivates us to evaluate the models based on different goals in order to avoid the results being dominated by the straight-pass scenario. Train/test sets and statistics. The dataset with 6 sessions
Fig. 4. Statistics of the split parts. The 1st and 2nd rows show the annotated frame and important object numbers based on different per-frame goals. The 3rd row shows the number percentages of vehicles and persons.
is grouped into three parts 1, i.e., P1, P2 and P3. For cross validation, all models are evaluated at every part while trained on the other two parts. We ensure that data of each part was collected from different sessions, locations and times, and has similar amount of videos and category distributions of road users 2. Tab. I and Fig. 4 show characteristics of each part. As shown, different parts have very similar statistics.
B. Planned Path Approximation
Since the experiments are done in an off-line manner, data from the AD path planning module is not available. To evaluate our method, we recover (approximate) the planned path of our vehicle at a given time step as IR
where
is calculated as in Eq. 3. We believe that it is easy to replace
IR with IR when AD path planning module is available.
where and yr(l) indicates angular velocity, velocity (kilometers per hour) and yaw rate (angle per second) at the next l distance unit. One distance unite is
meters.
is a scale number.
Both yaw rate and velocity can be obtained from the CAN bus sensors. Yaw rate values are negative when turning left while positive when turning right.
Examples of for left turn, straight and right turn are shown in Fig. 5. As we can see, there are obviously discriminative patterns among the three driving goals, e.g., left turns have negative troughs, right turns have positive crests and straights are around zero .
C. Baselines
Upperbound. We estimate importance scores for all the object proposals (tracklinks), so the final results depend on
Fig. 5. Examples of the given different driving goals.
TABLE I OVERALL STATISTICS OF THE SPLIT PARTS (P1, P2 AND P3).
the quality of the detection and tracking algorithm. We assign the correct importance label for each proposal link in this baseline. Thus, it is the upper bound of our method and all the mistakes are due to the bad detection and tracking.
Random Chance. We randomly assign a value () to each proposed tracklink as its important probability in this baseline. So, it is the lower bound of our method.
Visual model. It contains only the first branch of our framework which has only the visual features as input to the LSTM model. We want to see how the goal information can improve the prediction results quantitatively.
Visual model-Image. This model does not utilize the temporal information and predicts object importance scores by just observing the target image frame. In order to do that, we replace the LSTM model with one FC layer. This baseline is to compare with the standard object detection framework and evaluate how much the temporal information can help.
Goal-Geometry Model. This baseline has the same twobranch structure as our method except that appearance feature is removed and only motion and location features are used. Comparing it with our method will show if the method performs good if semantic local context is not given.
D. Implementation Details
Tracking-by-detection [2] framework is used to conduct object tracking, where Faster R-CNN [19] with Resnet101 is used for detection and SORT [3] is used for tracking. Some of the objects may not start at the first frame or last till the end. We only keep the objects that still exist at the last frame and pad 0s in the front if they do not start at the first frame.
The length of video clip, n, is set to 30. We set L=40 which is roughly 10 meters in the real world. in Eq. 3 is set to 1. For the visual model, we set length of the LSTM hidden layer to be 256 and the FC layer in goal model is set to be 16. For image based visual model, the FC layer has
TABLE II COMPARISON BETWEEN OUR GOAL-VISUAL MODEL AND THE BASELINES IN TERMS OF AVERAGE PRECISION (%) ON turn left (LT), straight pass
1, 024 units. Weighted-cross-entropy loss is used to optimize our model and all the baselines. The weights for positive and negative samples are inversely proportional to their sample numbers in one training batch.
E. Experimental Results
Comparisons between our method, i.e., Goal-Visual Model, and the baselines using average precision (AP) are shown in Tab. II. Our method largely outperforms Random Chance (“by-chance” approach). Comparing Visual Model with Visual Model-Image, we see that the temporal information is essential for our task. Without temporal modelling, the overall AP drops by 32.3%. With the goal information, our Goal-Visual Model outperforms the Visual Model by about 2% in terms of AP.
To evaluate the effectiveness of local visual scene context, our method is compared with Goal-Geometry Model. The Goal-Geometry Model only captures the motion and location information of a road user and combines it with the goal of the ego vehicle, without knowing the scene semantic. As it is shown, our method largely outperforms this baseline which demonstrates the usefulness of the scene context.
To evaluate our performance on different driving goals, we validate our method and the baselines on turn left, straight pass and turn right frames separately. Intuitively, our goal model should help more on the turn left and turn right cases compared to the straight pass. From the results in Tab. II, our method largely improves the Visual Model by 9% AP for turn left and by 9.5% for turn right.
We are also interested in our performance on different object categories, i.e., person and vehicle. Since, we do not have ground truth of the object categories, we generate the class label using the detection results. We match each labeled important object to a detected object if they have the largest Intersection over Union (IoU) and the IoU > 0.5. It is not guaranteed that every important object will find a match, since the detector is not perfect. However, experiment shows that around 95% of important objects are matched, so we ignore the small amount of unmatched ones. Comparisons between our method and the baselines are shown in Tab. III, which demonstrates that our model outperforms all the baselines in terms of mAP. Specifically, we observe that performance on the ‘person’ category is largely improved with goal information. Goal-Visual Model improves by around 6% on ‘person’ compared to Visual Model. It may due to the fact that most important persons are those who are walking cross the road. It is essential for the model to know where the ego vehicle is going in order to infer if a pedestrian on a certain side is important.
Qualitative results on turn left and turn right are shown in Fig. 6. As it is shown, knowing the driving goal can help capture important objects on (or coming to) our future path, e.g., turn left(a)(c)(d) and turn right(d). It can also filter out objects that are impossible to block our way based on their motion and location, e.g., turn left(b) and turn right(a)(b)(c).
Three major failure cases are shown in Fig. 7. The first one is because of the bad detection/tracking results. When the detection of the important object fails, there is no way for our framework to correct it. That is why our upper bound is not 100% AP. The second case is a result of missing global scene context. The comparison shows that for the two parked car, one is thought as important, but the other one is not. Based on our observation, the annotator tends to annotate the parked car if the road is narrow. The third case is due to the lack of communication among road users. For example, if we remove the labeled car in the last image, all the pedestrians should be important. They are not labeled as important because there is a closer car stopping the ego vehicle hitting them. Since our method does not model the interactions among road users, it is hard for an object to know the status of other objects. Future works are needed to solve these three failure cases.
F. Are Road Users Equally Important?
For a proof-of-concept, we propose a binary brake prediction (BBP) framework with object importance as a input.
BBP is a simplified version of brake prediction task which has binary labels, , instead of continuous brake values (can be obtained from CAN bus data),
if
and
otherwise). The input of BBP is a video clip and output is the brake probability of the ego vehicle in the last frame.
We assume that brakes depend only on the interaction between the road users and the ego vehicle, since we have removed the traffic-light related frames from our dataset. The visual model in Fig 2 is used to predict brake score, s, at time t of the ego vehicle given road user,i, in the input video clip. The final brake score, s
, is obtained by fusing predicted scores based on all the road users in a weighted sum manner. Our model use the predicted
TABLE III COMPARISON BETWEEN OUR MODEL AND THE BASELINES IN TERMS OF MEAN AVERAGE PRECISION (%) BASED ON DIFFERENT OBJECT CATEGORIES.
Fig. 6. Qualitative comparisons between the Visual Model (1st, and 3rd rows) and our Goal-Visual Model (2nd and 4th rows) on the turn left and turn right frames. The red circles indicate ground truth, the blue boxes are the detected objects with the importance scores. For concise visualization, only objects with more than 0.5 importance scores are shown.
Fig. 7. Major failure cases of our method. The examples of the 1st column are due to miss detection, those of the 2nd column is due to the lack of global scene context and the 3rd-column ones are because of the lack of the interaction among road users. Red circle and blue box indicate ground truth and our result, respectively.
important probability to be the weight of each object. Our intuition is that more important objects will have bigger impacts on the brake decision. The baseline uses the same weight (0.5) for all the objects to indicate that all objects in
the scene equally contributed to the brake.
Experimental results suggest that our method improves the baseline by 4.3%, 1.7% and 1.3% AP in the P1, P2 and P3, respectively, which demonstrates the potential usefulness of the object importance.
We propose a new problem as Object Importance Estimation (OIE) in on-road driving videos to understand the human visual selection mechanism under the driving context. We present a novel framework to handle the problem where both the visual dynamics of road users and the goal of the ego vehicle are taken into consideration. To evaluate the problem, we collect an on-road driving dataset and annotate the important objects given the video clip. Experimental results demonstrate the effectiveness of our idea. Moreover, we explore the potential usage of the OIE by incorporating it into a binary brake prediction framework. Experiments show that important objects can help to improve the prediction.
[1] S. Alletto, A. Palazzi, F. Solera, S. Calderara, and R. Cucchiara. Dr (eye) ve: a dataset for attention-based tasks with applications to autonomous and assisted driving. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition Workshops, pages 54–60, 2016.
[2] M. Andriluka, S. Roth, and B. Schiele. People-tracking-by-detection and people-detection-by-tracking. In Computer Vision and Pattern Recognition, 2008. CVPR 2008. IEEE Conference on, pages 1–8. IEEE, 2008.
[3] A. Bewley, Z. Ge, L. Ott, F. Ramos, and B. Upcroft. Simple online and realtime tracking. In Image Processing (ICIP), 2016 IEEE International Conference on, pages 3464–3468. IEEE, 2016.
[4] N. Dalal, B. Triggs, and C. Schmid. Human detection using oriented histograms of flow and appearance. In European conference on computer vision, pages 428–441. Springer, 2006.
[5] M. Everingham, L. Van Gool, C. K. I. Williams, J. Winn, and A. Zisserman. The PASCAL Visual Object Classes Challenge 2007 (VOC2007) Results. http://www.pascalnetwork.org/challenges/VOC/voc2007/workshop/index.html.
[6] M. Everingham, L. Van Gool, C. K. I. Williams, J. Winn, and A. Zisserman. The PASCAL Visual Object Classes Challenge 2012 (VOC2012) Results. http://www.pascalnetwork.org/challenges/VOC/voc2012/workshop/index.html.
[7] L. Fridman, P. Langhans, J. Lee, and B. Reimer. Driver gaze region estimation without use of eye movement. IEEE Intelligent Systems, 31(3):49–56, 2016.
[8] M. Gao, A. Li, R. Yu, V. I. Morariu, and L. S. Davis. C-wsl: Countguided weakly supervised localization. In Proceedings of the European Conference on Computer Vision (ECCV), pages 152–168, 2018.
[9] M. Gao, M. Xu, L. S. Davis, R. Socher, and C. Xiong. Startnet: Online detection of action start in untrimmed videos. arXiv preprint arXiv:1903.09868, 2019.
[10] M. Gao, R. Yu, A. Li, V. I. Morariu, and L. S. Davis. Dynamic zoom-in network for fast object detection in large images. In IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2018.
[11] R. Girshick. Fast r-cnn. In Proceedings of the IEEE International Conference on Computer Vision, pages 1440–1448, 2015.
[12] R. Girshick, J. Donahue, T. Darrell, and J. Malik. Rich feature hierarchies for accurate object detection and semantic segmentation. In Computer Vision and Pattern Recognition, 2014.
[13] K. He, G. Gkioxari, P. Doll´ar, and R. Girshick. Mask r-cnn. In Computer Vision (ICCV), 2017 IEEE International Conference on, pages 2980–2988. IEEE, 2017.
[14] K. He, X. Zhang, S. Ren, and J. Sun. Deep residual learning for image recognition. arXiv preprint arXiv:1512.03385, 2015.
[15] S. Hochreiter and J. Schmidhuber. Long short-term memory. Neural computation, 9(8):1735–1780, 1997.
[16] J. Kim and J. F. Canny. Interpretable learning for self-driving cars by visualizing causal attention. In ICCV, pages 2961–2969, 2017.
[17] A. Palazzi, D. Abati, S. Calderara, F. Solera, and R. Cucchiara. Predicting the driver’s focus of attention: the dr(eye)ve project. IEEE Transactions on Pattern Analysis and Machine Intelligence, 2018.
[18] N. Pugeault and R. Bowden. How much of driving is preattentive? IEEE Transactions on Vehicular Technology, 64(12):5424–5438, 2015.
[19] S. Ren, K. He, R. Girshick, and J. Sun. Faster r-cnn: Towards realtime object detection with region proposal networks. In Advances in neural information processing systems, pages 91–99, 2015.
[20] L. Simon, J.-P. Tarel, and R. Br´emond. Alerting the drivers about road signs with poor visual saliency. In Proc. 2009 IEEE Intelligent Vehicles Symposium, pages 48–53, 2009.
[21] B. Singh and L. S. Davis. An analysis of scale invariance in object detection–snip. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 3578–3587, 2018.
[22] A. Tawari and B. Kang. A computational framework for driver’s visual attention using a fully convolutional architecture. In Intelligent Vehicles Symposium (IV), 2017 IEEE, pages 887–894. IEEE, 2017.
[23] D. Tran, L. Bourdev, R. Fergus, L. Torresani, and M. Paluri. Learning spatiotemporal features with 3d convolutional networks. In Proceedings of the IEEE international conference on computer vision, pages 4489–4497, 2015.
[24] G. Underwood, K. Humphrey, and E. Van Loon. Decisions about objects in real-world scenes are influenced by visual saliency before and during their inspection. Vision research, 51(18):2031–2038, 2011.
[25] Y. Xia, D. Zhang, A. Pozdnukhov, K. Nakayama, K. Zipser, and D. Whitney. Training a network to attend like human drivers saves
it from common but misleading loss functions. arXiv preprint arXiv:1711.06406, 2017.
[26] M. Xu, M. Gao, Y.-T. Chen, L. S. Davis, and D. J. Crandall. Temporal recurrent networks for online action detection. arXiv preprint arXiv:1811.07391, 2018.
[27] Y. Yao, M. Xu, C. Choi, D. J. Crandall, E. M. Atkins, and B. Dariush. Egocentric vision-based future vehicle localization for intelligent driving assistance systems. arXiv preprint arXiv:1809.07408, 2018.
[28] P. Zhou, X. Han, V. I. Morariu, and L. S. Davis. Learning rich features for image manipulation detection. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 1053– 1061, 2018.