Object detection is a crucial task in computer vision society, which is aimed to detect different objects from images with their classes and locations. A camera can give us better semantic understandings of visual scenes. However, it is not a robust sensor under severe conditions in autonomous driving, such as weak/strong lighting and bad weather, which lead to little/high exposure or blur/occluded images. All of these would impact the stability and reliability of vision-based autonomous driving systems.
Radar, on the other hand, is relatively more reliable in most harsh environments, e.g., dark, rain, fog, and other low visibility scenarios. Frequency modulated continuous wave (FMCW) radar, which operates in the millimeterwave (MMW) band (30-300GHz) that is lower than visible light, thus, has the following properties: 1) MMW has great capability to penetrate through fog, smoke, and dust; 2) the huge bandwidth and high working frequency give FMCW radar great range detection ability. Despite these good properties, it is still very challenging to allow radar to distinguish
Figure 2: Our vision-radio cross-modal supervision framework. First, we use vision information to detect, and 3D localize the objects. Then, the results from vision are transferred to radio domain as supervision of the RODNet. Finally, the RODNet learns to detect objects purely using radio signals. Overall, in the good lighting/weather condition, vision-based solution can handle object detection, while for vision-fail condition, radio signals are utilized as the only reliable input to our system.
different classes of objects because 1) semantic information from radio signals is difficult to obtain; 2) features to distinguish different objects usually cannot be extracted from one single radar frame; 3) many surrounding obstacles, like buildings and trees, result in a large amount of noise.
In this paper, we address the object detection problem using an innovative machine learning method, which takes advantage of a cross-modal supervision mechanism between the camera and radar. This method can accurately detect objects purely based on the radio signals collected by an FMCW radar. More specifically, we propose a novel radio object detection network (RODNet), which takes radar reflection range-azimuth heatmaps (RAMap, discussed in Section 3.1) as the input with the ground truth systematically labeled by a camera-based object detection and 3D localization system, that estimates object confi-dence maps (ConfMaps, discussed in Section 3.2). From the ConfMaps, we can infer the object classes and locations through the proposed post-processing method, called location-based non-maximum suppression (L-NMS, discussed in Section 3.2).
To efficiently train our RODNet, we implement a vision-radio cross-modal supervision approach, where the rich information extracted from the camera is used to label the ground truth systematically for radio signals. Firstly, camera-based object detection is conducted [32, 18] to detect objects in 2D images. Then, the detected objects are tracked to obtain smoothened trajectories after robust regression. The representative point of each object, denoting an object point on the ground, is then projected to 3D camera coordinates by an object 3D localization technique [41]. After that, the 3D object locations are transformed into radar range-azimuth (pole) coordinates, i.e., bird’s-eye view, with sensor calibration. Thus, the object knowledge learned from vision is transferred from the camera domain to the radar domain, resulting in good semantic information for cross-modal supervision. The flowchart of this cross-modal supervision is shown in Figure 2.
We train and evaluate the RODNet using our selfcollected dataset – CRUW, which contains various autonomous driving scenarios of more than 400K vision-radio synchronized frames. Instead of using radar points as the data format, we choose radar frequency data (RAMaps), which involve object speed and texture information. The benefits of using RAMaps are described in Section 3.1. Our CRUW dataset is, to the best of our knowledge, the first dataset containing synchronized stereo videos and radar frequency data for autonomous driving. After intensive experiments, our RODNet can achieve 77.75% mAP in various scenarios whether objects are visible or not in cameras.
Overall, our main contributions are the following:
• A novel and accurate radio-only object detection network – RODNet for accurate radio-only object detection for autonomous driving, without vision information.
• A vision-radio cross-modal supervision framework for training the radio object detection network without human laborious and inconsistent labeling.
• A new dataset, containing both the camera and radar data, which is valuable for camera/radar cross-modal learning tasks.
• A new evaluation metric for radio object detection tasks.
The rest of this paper is organized as follows. Related works for vision and radio data learning are presented in Section 2. The proposed vision-radio cross-modal supervision framework is introduced in Section 3, with training and inference of our RODNet being explained in Section 4. In Section 5, we introduce our CRUW dataset used for our training and evaluation. Then, our implementation details, evaluation metrics, and experimental results are shown in Section 6. Finally, we conclude our work in Section 7.
2.1. Learning of Vision Data
Image-based object detection and tracking. Imagebased object detection [17, 32, 18, 10, 25, 31, 24, 14] is aimed to detect every object with its class and precise bounding box location from RGB images, which is fundamental and crucial for camera-based autonomous driving. Given the object detections, most tracking algorithms focus on exploiting the association between the detected objects in consecutive frames, the so-called tracking-by-detection framework [8, 42, 38, 40]. Among them, the TrackletNet Tracker(TNT) [40] is an effective and robust tracker to perform multiple object tracking (MOT) of the detected objects with a static or moving camera. Once the same objects among several consecutive images are associated, the missing and erroneous detections can be recovered or corrected, resulting in better subsequent 3D localization performance. Thus, we implement this tracking technique into the vision part of our framework.
Visual object 3D localization. Object 3D localization has attracted many interests in autonomous driving society [35, 36, 27, 28, 6]. One idea is to localize vehicles by estimating their 3D structures using a CNN, e.g., 3D bounding boxes [27] and 3D keypoints [28, 6, 22]. Then, they use a pre-defined 3D vehicle model to solve the deformations, resulting in accurate 3D keypoints as well as the vehicle location. Another idea [35, 36], however, tries to develop a real-time monocular structure-from-motion (SfM) system, taking into account different kinds of cues, including SfM cues (3D points and ground plane) and object cues (bounding boxes and detection scores). Although these two kinds of works achieve favorable performance in object 3D localization, they only work for vehicles since only the vehicle structure information is considered. To address this limitation, an accurate and robust object 3D localization system, based on the detected and tracked bounding boxes of objects, is proposed in [41], claiming that the system works for most common moving objects in the road scenes, such as cars, pedestrians, and cyclists. Thus, we finally take this 3D localization system as our camera annotation method.
2.2. Learning of Radar Data
Radar object classification. Significant research in radar object classification has demonstrated its feasibility as a good alternative when cameras fail to provide good results [19, 5, 13, 23, 11]. With handcrafted feature extraction, Heuel, et al. [19] classify objects using a support vector machine (SVM) to distinguish cars and pedestrians. While, Angelov et al. [5] use neural networks to extract features from the short-time Fourier transform (STFT) heatmap, evaluating on three classes, i.e., car, pedestrian, and cyclist. However, the above methods only focus on classifica-tion, which assumes only one object has been appropriately identified in the scene and not applicable to the complex autonomous driving scenarios. Recently, a radar object detection method is proposed in [15], which combines a statistical detection algorithm CFAR [33] with a neural network classifier VGG16 [34]. But it would easily give many false positives, i.e., obstacles detected as objects. Besides, the laborious human annotations required by this method are usually impossible to obtain.
Cross-modal learning and supervision. Recently, the concept of cross-modal learning has been discussed in the deep learning society [21, 39, 30, 20]. This concept is trying to transfer or fuse the information between two different signal modalities in order to help train the neural networks. Specifically, RF-Pose [43] introduces the cross-modal supervision idea into wireless signals to achieve human pose estimation based on WiFi range radio signals. As the human annotations for wireless signals are nearly impossible to obtain, RF-Pose uses a computer vision technique, i.e., OpenPose [12], to generate human pose annotations from the camera. Then, a neural network is trained to predict human poses purely from the radio signals and can even accurately estimate poses when the human is behind a wall. As for autonomous driving applications, [26] propose a vehicle detection method using LiDAR information for cross-modal learning. However, our work is different from theirs in the following aspects: 1) this work only considers vehicles as the target object class, while we detect pedestrians, cyclists, and cars; 2) they use LSTM to handle temporal information while we use 3D convolutions; 3) the scenario of their dataset is mostly highway without noisy obstacles, which is easier for radio object detection, while we are dealing with all kinds of traffic scenarios.
Our proposed vision-radio cross-modal supervision framework (Figure 2) can be divided into two parts, i.e., vision and radio, where the vision part performs as a teacher using rich semantic information from cameras to do image-based object detection, tracking and 3D localization. While the radio part consists of a student network, described in Section 4, which can detect objects purely from the input radio signals in the format of RAMaps, addressed in Sec- tion 3.1. This RODNet is trained using the object information from vision as the supervision, which is achieved by the defined confidence map in Section 3.2.
3.1. Radio Signal Processing and Properties
As our student network takes radar data as the input, we need to understand the signal characteristics of the radio signals. In this section, we first introduce our radar data processing steps and then analyze some radio signal properties which are useful for our task.
We use a range-azimuth heatmap representation, named RAMap, to represent our radar signal reflections. RAMap can be described as a bird’s-eye view, where the x-axis shows azimuth (angle) and the y-axis shows range (distance). For an FMCW radar, it transmits continuous chirps and receives the reflected echoes from the obstacles. After the echoes are received and processed, we implement the fast Fourier transform (FFT) on the samples to estimate the range of the reflections. Then, a low-pass filter (LPF) is utilized to remove the high-frequency noise across all chirps in each frame at the rate of 30 FPS. After the LPF, we conduct a second FFT on the receiver antennas to estimate the azimuth of the reflections and obtain the final RAMaps.
There is another radar data format, radar points, as mentioned in [9, 7]. Typically, radar points can be derived from our RAMaps using peak detection [33] but the process is not vice versa. In this work, we utilize radar frequency data instead of radar points as our data format because of the following special properties, which is beneficial for our object detection task.
• Rich speed information. According to the principle of the radio signal, rich speed information is included in radar data. The speed and its law of variation over time consist of texture information, target motion details, etc. For example, the speed information of a nonrigid body, like a pedestrian, is usually messy, while for a rigid body, like a car, it should be more consistent due to the Doppler effect. In order to utilize this information, we need to consider multiple consecutive radar frames, instead of one single frame, as the system input.
• Inconsistent resolution. Radar usually has highresolution in range but low-resolution in angle due to the limitation of radar specifications, like the number of antennas, and the distance between them.
• Different representation. Radio signals are usually represented as complex numbers containing frequency information. This kind of data is unusual to be modeled by a typical neural network.
3.2. Camera Annotation Generation
Camera is a 2D sensor which projects the 3D world into 2D images. On the other hand, radar has the capability of obtaining 3D information. To build a bridge between camera and radar, we take advantage of a recent work on an accurate and robust system for visual object 3D localization [41]. The proposed system takes CNN inferred depth map as the input, combining CNN detections of visual objects and multi-object tracking results, and estimates object classes as well as their 3D locations relative to the camera. This 3D location can serve as a bridge between these two different sensors.
After objects are accurately localized in the 3D coordinates, we need to transform the results into a proper representation that is compatible with our RODNet. Considering the idea in human pose estimation [12] that defines heatmap to represent human joint locations, we define confidence map (ConfMap) with range-azimuth coordinates to represent object locations. One set of ConfMaps has multiple channels, and each channel represents one specific class label. The value at the pixel in the c-th channel represents the probability of an object with class c existing at that range-azimuth location. A na¨ıve idea to generate ConfMaps is to set ConfMap values at the object 3D locations from cameras to be 1. However, this idea is neither good for extracting the features from the whole reflection pattern in RAMaps nor robust to camera annotations. Thus, we also set the ConfMap values around the object locations using a Gaussian distribution. The mean of the Gaussian distribution is the object location, and the variance is related to the object class and 3D distance (see the ConfMaps in Figure 2).
In this section, the proposed radio object detection network (RODNet) is introduced. It takes a snippet of RAMaps as input, passing through a 3D CNN, and output a set of ConfMaps. Then, a post-processing method, called location-based non-maximum suppression (L-NMS), is adopted to recover ConfMaps for the final detections.
4.1. RODNet
According to Section 3.1, the RODNet needs to have the following capabilities: 1) Extract temporal information; 2) Handle multiple spatial scales; 3) Be able to feed in complex number data. Thus, we design the RODNet as follows:
• We use 3D convolution layers in our RODNet to extract both spatial and temporal information.
• We implement 3D autoencoder, 3D stacked hourglass, and 3D stacked hourglass with temporal inception layers for our RODNet architecture to handle the radar data with various spatial and temporal scales.
• For radar data as complex numbers, we treat the real and imaginary as two different channels in one RAMap.
The three different network architectures for the RODNet are shown in Figure 3, named 3D ConvolutionDeconvolution (RODNet-CDC), 3D stacked hourglass
Figure 3: The illustration of architectures for our three RODNet models.
(RODNet-HG), and 3D stacked hourglass with temporal inception (RODNet-HGwI), respectively. RODNet-CDC is a shallow network adopted from [43], but we also squeeze the features in temporal domain to better extract temporal information. It contains six 3D convolutional layers with kernels and three 3D transpose convolutional layers with kernels
. While the RODNet-HG is adopted from [29], but we replace 2D convolution layers with 3D convolution layers and adjust the parameters for our task. As for the RODNet-HGwI, we replace the 3D convolution layers in each hourglass by the temporal inception layers [37] with different temporal kernel scales (5, 9, 13) to extract different lengths of temporal features from the RAMaps.
Overall, our RODNet is fed with a snippet of RAMaps R with dimension and predicts ConfMaps
with dimension
, where
represents the snippet length, C is the number of object classes, w and h are width and height of RAMaps or ConfMaps respectively. Thus, RODNet predicts separate ConfMaps for individual radar frames. With systematically derived camera annotations, we train our RODNet using binary cross entropy loss,
Here, D represents the ConfMaps generated from camera annotations, represents the ConfMaps prediction, (i, j) represents the pixel indices, and c is the class label.
4.2. Recover Detections from ConfMaps
After training, the RODNet can predict ConfMaps from the given RAMaps. However, to obtain the final detections, a post-processing step is still required. Here, we adopt the idea of non-maximum suppression (NMS), which is frequently used in image-level object detection to remove the redundant bounding boxes from the detectors. Here, NMS uses intersection over union (IoU) as the criterion to determine if a bounding box should be removed. However, there is no bounding box definition in our problem. Thus, we de-fine a novel metric, called object location similarity (OLS), to take the role of IoU, which describes the relationship between two detections considering their distance, classes and
scale information on ConfMaps. More specifically,
where d is the distance (in meters) between the two points on RAMap, s is the object distance from the sensors, representing object scale information, and is a per-class constant which represents the error tolerance for class c, which can be determined by the object average size of the corresponding class. Moreover, we empirically tune
to make OLS distributed reasonably between 0 and 1. The idea behind OLS can be described as follows. We try to interpret OLS as a definition of Gaussian probability, where distance d acts as bias and
acts as variance.
After OLS is defined, we propose a location-based NMS (L-NMS), whose procedure can be summarized as follows:
1) Get all the peaks in all C channels in ConfMaps within a window as a peak set
.
2) Pick the peak with the highest confidence and remove it from the peak set. Calculate OLS with each of the rest peaks
, where
.
3) If OLS between and
is greater than a threshold, remove
from the peak set.
4) Repeat Steps 2 and 3 until the peak set becomes empty.
Moreover, during the inference stage, we can send overlapped RAMap snippets into the RODNet, which provides different ConfMaps predictions for a single radar frame. Then, we merge these different ConfMaps together to obtain the final ConfMaps results. This scheme can improve the system robustness and can be considered as a performance-speed trade off, which is further discussed in Section 6.2.
Going through some existing datasets for autonomous driving [16, 3, 4, 9, 7], only nuScenes [9] and Oxford RobotCar Dataset [7] consider radar data. However, the format is 3D radar points, which do not contain the useful speed information that needed for many radar learning tasks (mentioned in Section 3.1).
In order to efficiently train and evaluate our RODNet using radar frequency data, we collect a new dataset – CRUW dataset. Our sensor platform contains a pair of stereo cameras [1] and two perpendicular 77GHz FMCW radar antenna arrays [2]. The sensors, assembled and mounted together as shown in Figure 4 (b), are well-calibrated and synchronized.
The CRUW dataset contains more than 3 hours with 30 FPS (about 400K frames) of camera/radar data under several different autonomous driving scenarios, including campus road, city street, highway, parking lot, etc. Some sample
Figure 4: Our dataset collection vehicle, sensor platform and some driving scenes.
Figure 5: Illustration for our CRUW dataset distribution.
vision data are shown in Figure 4 (c). The data are collected in two different views, i.e., driver front view and driver side view, to ensure that our method is applicable to different perspective views for autonomous driving. Besides, we also collect several vision-fail scenarios where the images are pretty bad, i.e., dark, strong light, blur, etc. These data are only used for testing to illustrate that our method can still be reliable when vision techniques fail.
The object distribution in CRUW is shown in Figure 5. The statistics only consider the objects within the radar field of view. There are about 260K objects in CRUW dataset in total, including 92% training and 8% testing. The average number of objects in each frame is similar between training and testing data. As for testing set, there are more pedestrians in easy testing set, and more cars in hard testing set.
For the ground truth needed for evaluation purposes, we annotate 10% of the visible data and 100% vision-fail data. The annotation is operated on RAMaps by labeling the object classes and locations according to the corresponding images and RAMap reflection magnitude.
6.1. Evaluation Metrics
To evaluate the performance, we utilize our proposed OLS (see Eq. 2) in Section 4.2, replacing the role of IoU in image-level object detection, to determine whether the detection result can be matched with a ground truth. During the evaluation, we first calculate OLS between each detection result and ground truth in every frame. Then, we use different thresholds from 0.5 to 0.9 with a step of 0.05, for OLS and calculate the average precision (AP) and average recall (AR) for different OLS thresholds. Here, we use AP and AR to represent the average values among different OLS thresholds, and use APOLS and AROLS to represent the values at a certain OLS threshold. Overall, we use AP and AR as our main evaluation metrics for the radio object detection task.
6.2. Radio Object Detection Results
We train our RODNet using the training data with camera annotations in the CRUW dataset. For testing, we perform inference and evaluation on the human-annotated visible data. The quantitative results are shown in Table 1. Due to very limited related works in this area, we compare our results with the following radar-only baselines: 1) a decision tree using some handcrafted features from radar data [15]; 2) a radar object classification network [5] appended after the CFAR detection algorithm; 3) radio object detection method mentioned in [15]. To evaluate the performance under different scenarios, we split the test set into three difficulty levels, i.e., easy, medium, and hard. The AR from [15] is relatively stable in the three different test sets, but the AP drops a lot. This is caused by a large amount of false positives detected by this kind of radar-only algorithms. Comparing with [15], our RODNet outperforms a lot on both AP and AR metrics, especially for AP on the medium and hard data, which shows the robustness to noisy scenarios for our RODNet. Besides, we compare the performance of our different implemented RODNet models using the evaluation metrics introduced in Section 6.1. From the results, our RODNet can achieve the best performance of 77.75% mAP with the RODNet-HGwI model.
Moreover, AP under different OLS thresholds are analyzed in Table 2. Although the mAP of the RODNet-CDC seems relatively bad, its APis still pretty good, which means the RODNet-CDC is a little bit weak at the localization accuracy but good at classification.
Some qualitative results are shown in Figure 1, where we can see that the RODNet can accurately localize and classify multiple objects in different scenarios. The examples on the left of Figure 1 are the scenarios that are relatively clean with fewer noises on the RAMaps, while the right ones are more complex with different kinds of obstacles, like trees, traffic sign, walls, etc. Especially, in the second to the last example, we can see high reflections on the right of the RAMap, which comes from the walls. The resulting ConfMap shows that the RODNet does not recognize them as any object, which is quite promising.
Real-time implementation is important for autonomous driving applications. As mentioned in Section 4.2, we use different overlapping lengths during the inference, running on an NVIDIA TITAN XP, and report the time consumed in Figure 6. The results illustrate that the RODNet-CDC
Table 1: Radio object detection comparison evaluated on CRUW dataset. (* denotes self-implementation.)
Table 2: Detection performance under different OLS thresholds. (* denotes self-implementation.)
Figure 6: Performance-speed trade-off for the RODNet real-time implementation.
Figure 7: Performance of vision-based and our RODNet on “Hard” testing set with different localization error tolerance.
reaches its best performance-speed trade-off at 51.09% mAP with a running time of 14 ms per frame, RODNetHG reaches best at 70.83% mAP with 65 ms per frame, and RODNet-HGwI reaches best at 73.98% mAP with 89 ms per frame.
We also compare the performance between our RODNet and the vision-based method [41] on both visible and vision-fail data. The results are shown in Figure 7 with respect to different OLS thresholds. Here, the thresholds represent different localization error tolerance for the detection results. From the figure, the performance of vision-based method drops significantly given a tighter OLS threshold, while our RODNet shows its strength and robustness on its localization performance. Moreover, the RODNet can still maintain the performance on vision-fail data where the vision-based method becomes totally useless.
6.3. Ablation Studies
Spatio-temporal features learned by the RODNet. After the RODNet is well-trained, we would like to analyze the features learned from the radar data. In Figure 9, we show two different kinds of feature maps, i.e., the features after the first convolution layer and the features before the last layer. These feature maps are generated by cropping some randomly chosen objects from the original feature maps and average them into one. From the visualization, we notice that the feature maps are similar in the beginning, but they become more discriminative at the end of the RODNet.
Length of radar data required for good detection. Because our RODNet takes RAMap snippets as the input, we would like to know how long it requires to obtain good detection results. Thus, we try different lengths of the snip-
Figure 8: Examples illustrate strengths and limitations of our RODNet.
Figure 9: Feature extraction through the RODNet.
Table 3: Performance for different RAMap snippet lengths.
pets and evaluate their AP on our test set. The results are shown in Table 3. From the results, AP is very low with short input snippets because of insufficient radar information from short temporal period. For RODNet-HG, detection AP reaches highest at length of 16, while for RODNetHGwI, we obtain best performance at length 32.
6.4. Strengths and Limitations
RODNet Strengths. Some examples to illustrate the RODNet’s advantages are shown in Figure 8 (a). First, the RODNet has similar performance in some severe conditions, like during the night, shown in the first example. Moreover, the RODNet can handle some occlusion cases when the camera usually fails. In the second example, two pedestrians are nearly fully occluded in the image, but our RODNet can still detect both of them. This is because they are separate in the radar point of view. Last but not least, the RODNet has a wider field of view (FoV) than vision so that it can see more information. As shown in the third example, there is only a small part of the car visible in the camera view, which can hardly be detected from the camera side, but the RODNet can successfully detect it.
RODNet Limitations. Some failure cases are shown in Figure 8 (b). The failure can be concluded as mainly three categories, i.e., nearby objects, huge objects, and noisy surroundings. When two objects are very near, the RODNet often fails to distinguish them due to the limited resolution of radar. As shown in the first example, the RAMap patterns of the two pedestrians are intersected, so that our result only shows one pedestrian detected. Another problem is, for huge objects like bus and train, the RODNet often detects it as multiple objects as shown in the second example. Lastly, the RODNet is sometimes affected by noisy surroundings. In the third example, there is no object in the view, but the RODNet detects the obstacles as several cars. The last two problems should be solved with a larger dataset for training.
Object detection is crucial in autonomous driving and many other areas. Computer vision society has been focusing on this topic for decades and come up with many good solutions. However, vision-based detection is still suffering from many severe conditions. This paper proposed a brand-new and novel object detection method purely from radar information, which is more robust than vision. The proposed RODNet can accurately and robustly detect objects in various autonomous driving scenarios even during the night or bad weather. Moreover, this paper presented a new way to learn radio information using cross-modal supervision, which can potentially improve the role of radar in autonomous driving applications.
This work was partially supported by CMMB Vision – UWECE Center on Satellite Multimedia and Connected Vehicles. The authors would also like to thank the colleagues and students in Information Processing Lab at UWECE for their help and assistance on the dataset collection, processing, and annotation works.
[1] Flir systems. https://www.flir.com/. 5
[2] Texas instruments. http://www.ti.com/. 5
[3] Apollo scape dataset. http://apolloscape.auto/, 2018. 5
[4] Waymo open dataset: An autonomous driving dataset. https://www.waymo.com/open, 2019. 5
[5] A. Angelov, A. Robertson, R. Murray-Smith, and F. Fio- ranelli. Practical classification of different moving targets using automotive radar and deep neural networks. IET Radar, Sonar Navigation, 12(10):1082–1089, 2018. 3, 6, 7
[6] J. A. Ansari, S. Sharma, A. Majumdar, J. K. Murthy, and K. M. Krishna. The earth ain’t flat: Monocular reconstruction of vehicles on steep and graded roads from a moving camera. In 2018 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS), pages 8404–8410. IEEE, 2018. 3
[7] D. Barnes, M. Gadd, P. Murcutt, P. Newman, and I. Pos- ner. The oxford radar robotcar dataset: A radar extension to the oxford robotcar dataset. In Proceedings of the IEEE International Conference on Robotics and Automation (ICRA), Paris, 2020. 4, 5
[8] P. Bergmann, T. Meinhardt, and L. Leal-Taixe. Tracking without bells and whistles. arXiv preprint arXiv:1903.05625, 2019. 3
[9] H. Caesar, V. Bankiti, A. H. Lang, S. Vora, V. E. Liong, Q. Xu, A. Krishnan, Y. Pan, G. Baldan, and O. Beijbom. nuscenes: A multimodal dataset for autonomous driving. arXiv preprint arXiv:1903.11027, 2019. 4, 5
[10] Z. Cai and N. Vasconcelos. Cascade r-cnn: Delving into high quality object detection. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 6154–6162, 2018. 3
[11] P. Cao, W. Xia, M. Ye, J. Zhang, and J. Zhou. Radar-id: hu- man identification based on radar micro-doppler signatures using deep convolutional neural networks. IET Radar, Sonar & Navigation, 12(7):729–734, 2018. 3
[12] Z. Cao, T. Simon, S.-E. Wei, and Y. Sheikh. Realtime multi- person 2d pose estimation using part affinity fields. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 7291–7299, 2017. 3, 4
[13] S. Capobianco, L. Facheris, F. Cuccoli, and S. Marinai. Ve- hicle classification based on convolutional networks applied to fmcw radar signals. In Italian Conference for the Traffic Police, pages 115–128. Springer, 2017. 3
[14] K. Duan, S. Bai, L. Xie, H. Qi, Q. Huang, and Q. Tian. Cen- ternet: Keypoint triplets for object detection. In Proceedings of the IEEE International Conference on Computer Vision, pages 6569–6578, 2019. 3
[15] X. Gao, G. Xing, S. Roy, and H. Liu. Experiments with mmwave automotive radar test-bed. In Asilomar Conference on Signals, Systems, and Computers, 2019. 3, 6, 7
[16] A. Geiger, P. Lenz, C. Stiller, and R. Urtasun. Vision meets robotics: The kitti dataset. The International Journal of Robotics Research, 32(11):1231–1237, 2013. 5
[17] R. Girshick. Fast r-cnn. In Proceedings of the IEEE international conference on computer vision, pages 1440–1448, 2015. 3
[18] K. He, G. Gkioxari, P. Doll´ar, and R. Girshick. Mask r-cnn. In Proceedings of the IEEE international conference on computer vision, pages 2961–2969, 2017. 2, 3
[19] S. Heuel and H. Rohling. Two-stage pedestrian classifica- tion in automotive radar systems. In 2011 12th International Radar Symposium (IRS), pages 477–484, Sep. 2011. 3
[20] L. Jing and Y. Tian. Self-supervised visual feature learning with deep neural networks: A survey. arXiv preprint arXiv:1902.06162, 2019. 3
[21] A. Karpathy and L. Fei-Fei. Deep visual-semantic alignments for generating image descriptions. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 3128–3137, 2015. 3
[22] H. Kim, B. Liu, and H. Myung. Road-feature extraction using point cloud and 3d lidar sensor for vehicle localization. In 2017 14th International Conference on Ubiquitous Robots and Ambient Intelligence (URAI), pages 891–892. IEEE, 2017. 3
[23] J. Kwon and N. Kwak. Human detection by neural networks using a low-cost short-range doppler radar sensor. In 2017 IEEE Radar Conference (RadarConf), pages 0755– 0760. IEEE, 2017. 3
[24] T.-Y. Lin, P. Goyal, R. Girshick, K. He, and P. Doll´ar. Focal loss for dense object detection. In Proceedings of the IEEE international conference on computer vision, pages 2980– 2988, 2017. 3
[25] W. Liu, D. Anguelov, D. Erhan, C. Szegedy, S. Reed, C.- Y. Fu, and A. C. Berg. Ssd: Single shot multibox detector. In European conference on computer vision, pages 21–37. Springer, 2016. 3
[26] B. Major, D. Fontijne, A. Ansari, R. Teja Sukhavasi, R. Gowaikar, M. Hamilton, S. Lee, S. Grzechnik, and S. Subramanian. Vehicle detection with automotive radar using deep learning on range-azimuth-doppler tensors. In Proceedings of the IEEE International Conference on Computer Vision Workshops, 2019. 3
[27] A. Mousavian, D. Anguelov, J. Flynn, and J. Kosecka. 3d bounding box estimation using deep learning and geometry. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 7074–7082, 2017. 3
[28] J. K. Murthy, G. S. Krishna, F. Chhaya, and K. M. Krishna. Reconstructing vehicles from a single image: Shape priors for road scene understanding. In 2017 IEEE International Conference on Robotics and Automation (ICRA), pages 724– 731. IEEE, 2017. 3
[29] A. Newell, K. Yang, and J. Deng. Stacked hourglass net- works for human pose estimation. In European conference on computer vision, pages 483–499. Springer, 2016. 5
[30] Y. Qi, Y.-Z. Song, H. Zhang, and J. Liu. Sketch-based image retrieval via siamese convolutional neural network. In 2016 IEEE International Conference on Image Processing (ICIP), pages 2460–2464. IEEE, 2016. 3
[31] J. Redmon and A. Farhadi. Yolov3: An incremental improve- ment. arXiv preprint arXiv:1804.02767, 2018. 3
[32] S. Ren, K. He, R. Girshick, and J. Sun. Faster r-cnn: Towards real-time object detection with region proposal networks. In Advances in neural information processing systems, pages 91–99, 2015. 2, 3
[33] M. A. Richards. Fundamentals of radar signal processing. Tata McGraw-Hill Education, 2005. 3, 4
[34] K. Simonyan and A. Zisserman. Very deep convolutional networks for large-scale image recognition. arXiv preprint arXiv:1409.1556, 2014. 3
[35] S. Song and M. Chandraker. Robust scale estimation in real- time monocular sfm for autonomous driving. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 1566–1573, 2014. 3
[36] S. Song and M. Chandraker. Joint sfm and detection cues for monocular 3d localization in road scenes. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 3734–3742, 2015. 3
[37] C. Szegedy, W. Liu, Y. Jia, P. Sermanet, S. Reed, D. Anguelov, D. Erhan, V. Vanhoucke, and A. Rabinovich. Going deeper with convolutions. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 1–9, 2015. 5
[38] Z. Tang and J.-N. Hwang. Moana: An online learned adap- tive appearance model for robust multiple object tracking in 3d. IEEE Access, 7:31934–31945, 2019. 3
[39] S. Venugopalan, M. Rohrbach, J. Donahue, R. Mooney, T. Darrell, and K. Saenko. Sequence to sequence-video to text. In Proceedings of the IEEE international conference on computer vision, pages 4534–4542, 2015. 3
[40] G. Wang, Y. Wang, H. Zhang, R. Gu, and J.-N. Hwang. Ex- ploit the connectivity: Multi-object tracking with trackletnet. In Proceedings of the 27th ACM International Conference on Multimedia, pages 482–490. ACM, 2019. 3
[41] Y. Wang, Y.-T. Huang, and J.-N. Hwang. Monocular visual object 3d localization in road scenes. In Proceedings of the 27th ACM International Conference on Multimedia, pages 917–925. ACM, 2019. 2, 3, 4, 7
[42] L. Yang, Y. Fan, and N. Xu. Video instance segmentation. arXiv preprint arXiv:1905.04804, 2019. 3
[43] M. Zhao, T. Li, M. Abu Alsheikh, Y. Tian, H. Zhao, A. Tor- ralba, and D. Katabi. Through-wall human pose estimation using radio signals. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 7356– 7365, 2018. 3, 5