Figure 5. The backbone network architecture of HVNet. Multiscale features are first fused shallowly in the main stream network, and then fused deeply in the proposed FFPN network.
AVFEO layer, each feature of is projected to a pixel
in pseudo image
, given as:
where is the target scale, | denotes exact division and mod denotes the module operation.
3.3. Efficient Index-based Implementation
The Hybrid Voxel Feature Extraction module is a index-based learning system, where the irregular graph data (point cloud) is grouped by physical correlation. In order to handle the sparse structures, we propose HSV operation to transform sparse matrix into dense matrix and corresponding indexes. Verified by experiments, the efficiency of the HVNet then hinges heavily on how well The index strategy of U and the parallel stream processing schedules. Therefore, the key index-based propagation operators, Scatter and Gather, is implemented on GPUs.
The Gather conducts the sparse point data propagation within voxel and behave as tensor slice according to the cursor vector c. For implementation, tensor slice operation in PyTorch [19] is fast enough for Gather.
The Scatter manipulates all values from source tensor into output at the indices specified in c along a given axis. In our approach, Scatter Mean is used in voxel wise attention and Scatter Max is used in AVFE and AVFEO layers. Take Scatter Max as an example: , where out and src are output and input respectively, c is the ‘group-index’ that references the location of src. In the implementation of Scatter, atomic lock of GPU Global Memory is used to ensure the consistency of arg max results for reproducibility.
3.4. Backbone Network
We use a 2D convolutional backbone network for further detection. The backbone network has two main parts: 1) the main stream network that aggregates multi-scale features ; 2) the FFPN network that refines the feature map and generates multi-class features in the same time.
Main stream network with multi-scale feature aggregation . The main stream is shown in the orange part in Fig. 5, characterized by a concatenation of several blocks. In Block 2 and Block 3, the first layer has a stride to reduce spatial resolution of features. Given pseudo-image feature set
where
is taken as input of Block 1 and
is aggregated in Block 2 by tensor concatenation with output of the first layer in Block 2. More scales can be added by the similar way. We take the output of the last three blocks
and
Feature fusion pyramid network. The FPN [15] has been proved to be a valid means for multi-scale feature embedding. Due to the sparse data distribution in point cloud and the small resolution of , the FPN structure plays a more important role in our approach. Therefore, an enhanced feature fusion pyramid network is proposed, whose main structure is shown in Fig. 3 Instead fusing features from top to bottom layer by layer in [15], we first concatenate features at the smallest scale to obtain an intermediate feature
, represented as:
where + means tensor concatenate, 1 means the indicator function, is the number of
represents the deconv function of each input feature map
for scale alignment and
denotes the conv function before being concatenated together. The class-specific pyramid features are given as
, where
denotes the conv layers with various stride. Compared with F-SSD [12], we fuse features in two stages: a) layer by layer fusion in Eq. 4 and b) downsample convolutional chain. Furthermore, within one forward propagation, class-specific pyramid features are able to be obtained, where
for Pedestrian class,
for Cyclist class and
for Car class respectively.
3.5. Detection Head and Loss Design
We use the detection head in the SSD [17] to detect 3D objects as [10]. In our setting, the positive anchors are selected by matching to the ground truth by Rotated Intersection over Union (RIoU) [4, 33] in bird eye view (BEV). Each pyramid feature is wraped by three parallel branches which are
convolution layers
,
and
to get classification probability, location offsets and height regression respectively, whose output channel numbers are
for location corner offsets in
Table 1. Performance of bird eye views on KITTI test set. Here ‘L’ denotes LiDAR input and ‘I’ denotes RGB image. We compare with detectors on the KITTI leaderboard evaluated by 40 recall positions. Methods are divided into three types: LiDAR & image, two-stage LiDAR only and one-stage. The bold results means the best in all methods and the blue results means the best among one-stage methods.
BEV and for z center and height. Different from most voxel-based methods [10, 35,
24] that predict center x, y and , HVNet utilizes location corner offsets relative to anchors in BEV as localization objective, represented as
is a vector of
during propagation. Suppose that the location branch
predicts the offset
, then the localization loss is given by
.
As to the classification branch , given the class probability
of an anchor, focal loss [16] is used to handle the unbalance between positive and negative samples, represented as
. Given prediction z, h from
, the vertical loss is denoted as
.
Therefore, the total loss is given by
4. Experiments
HVNet is evaluated on the challenging KITTI benchmark [5]. We first introduce the setup details of HVNet in Sec. 4.1. In Sec. 4.2, we compare HVNet with state-of-the-art methods. A detailed ablation study is also provided in Sec. 2 to verify the validity of each component. 4.1. Setup
Dataset. KITTI dataset [5] consists of 7481 training images and 7518 test images as well as the corresponding point cloud with categories Car, Pedestrian and Cyclist.
Metric. KITTI’s metric is defined as average precision (AP) over the 40 recall positions on the PR curve [4]. Labels are divided into three subsets (Easy, Moderate, Hard) on the basis of object size, occlusion and truncation levels. The leaderboard rank is based on results of Moderate difficult.
Experiment details. The physical detection range is limited within minimum and maximum (64, 32, 2). The size of a voxel is
, thus the resolution of BEV feature is
. In the encoder and decoder, scale set
and
{1, 2, 4}. Besides, feature dimension is q = 64 for
and
for
. Anchor size is designed as [0.8, 0.8, 1.7] for Pedestrian, 0.8, 1.8, 1.5 for Cyclist and [1.7, 3.5, 1.56], [2.0, 6.0, 1.56] for Car. Each class has the same anchor orientation in
. In the training phase, we choose anchors that have RIoU with ground truth larger than [0.35, 0.35, 0.5] for Pedestrian, Cyclist and Car respectively as positive samples, and those lower than [0.25, 0.25, 0.35] as negative samples. As to the test phase, prediction score threshold is set to 0.2, and the rotated NMS [18] threshold is set to [0.02, 0.02, 0.4]. In the loss design,
and
for focal loss are set to [0.75, 0.75, 0.25] and [2., 2., 2.] respectively. Three loss weights are
and
. HVNet is trained for 70 epochs with Adam [8] optimizer, the initial learning rate is
with weight decay
. We first employ warmup strategy [7] with 300 warmup iterations and 1./3 warmup ratio. Besides, lr decays in a ratio of 0.1 in the 40-th and 60-th epoch.
3D data augmentation. Global random flipping, rotation, scaling and translation are first applied to the whole point cloud data, where the flipping probability is set to 0.5, the rotation angle ranges are uniform distribution from , the scaling ratio is between [0.95, 1.05] and the location translation obeys normal distribution with mean 0 and standard deviation [0.2, 0.2, 0.2] for (x, y, z). Followed by SECOND [28], several new boxes from ground truth and corresponding points in other frames, where 8 for Cyclist, 8 for Pedestrian and 15 for Car, will be fixed into current training frame except boxes which have physical collision with boxes in the current frame.
4.2. Experimental Results Quantitative Analysis. We compare with 3D object detectors in three types: a) LiDAR & image based approaches;
Figure 6. Qualitative analysis on KITTI validation set with Kitti Viewer [28]. We show 3D bounding boxes on point cloud along with projected 2D bounding boxes on image. In each image, blue boxes indicates the ground truth, red boxes indicates detections by HVNet.
Table 2. Performance (AP) of BEV and 3D on KITTI validation set for Car. Our method achieves the state of the art in most case.
Table 3. Performance on KITTI validation set for Pedestrian, Cyclist in 2D, BEV and 3D tasks; and Car in 2D task.
b) LiDAR only two-stage approaches and c) one-stage approaches, shown in three columns respectively in Tab. 1. Most methods in a) and b) are relatively slow for inference. From the table we see that HVNet outperforms all other approaches in mAP and Cyclists. HVNet also achieves attractive performance for Car and Pedestrian under a real-time runtime, even when compared with two-stage approaches. Among one-stage approaches, HVNet achieves the state of the art in Car and Cyclist, leading the second best HRIVoxelFPN [24] and PointPillars [10] by over 1.61% and 8.44% respectively in moderate. More details for our test results are in KITTI leaderboard. Furthermore, we draw the performance vs. speed figure in Fig. 1 according KITTI leaderboard to have more intuitive comparison.
Only a few methods expose the results on the validation set. The comparison results for Car are reported in Tab. 2. Among methods that expose results, our approach achieves the best performance in both BEV and 3D tasks. As almost no currently published method presents validation results for Pedestrian and Cyclist, we show only our validation results for these two classes in Tab 3 in all the three tasks: 2D, BEV and 3D. Overall, evaluation both on test and validation set shows that our approach can produce high-accuracy detections with a fast inference speed. Besides that, we con-
10 15 20 25 30 35 40 45 50 55 60 65 70 75 70
Figure 7. Voxel scale study on BEV of KITTI validation set. For PointPillar we use our own implementation in Pytorch. Blue circle shows the results of PointPillar. We choose voxel scale at . Red rectangle illustrates results from Hybrid Voxel Feature Extractor. From left to right, we use voxel scales at {0.1, 0.2}m ,{0.2, 0.3}m, {0.2, 0.4}m, feature projection scales at 0.2m, 0.3m, 0.4m respectively
duct experiments on our Hybrid Voxel Feature Extractor architecture at different settings, compared with PointPillar through different grid size in Fig. 7. We can easily see the power of our architecture. Compared with best result from PointPillar at grid size 0.16m, our model show great advantages at both mAP and inference speed. Even at a coarse scale for feature projection scale of grid size 0.4m, our model can achieve comparable results of PointPillar at grid size 0.24m and save a lot of runtime cost.
Qualitative Analysis. We present several 3D detections on the KITTI validation set along with projected 2D bounding boxes on 2D image in Fig. 1. HVNet can produce high-quality 3D detections for all classes via a single pass of the network. Moreover, good detections are also given by HVNet for scenes that have point cloud occlusion or strongly dense objects . Generally, these qualitative results demonstrate the effectiveness of our approach.
4.3. Ablation Studies
Multiple components study. To analyze how different components contribute to the final performance, we conduct an ablation study on KITTI validation set. We use the BEV mean AP of three categories (Car, Pedestrian and Cyclist) as the evaluation matrix, shown in Tab. 4. Our baseline model, shown in the first line of Tab. 4, is a detector with a single scale feature extractor and a SSD [17] like
Table 4. Ablation study on KITTI validation set for the mAP. 0.5S means the scale of mAP denotes the change in Moderate mAP compared with the corresponding controlled experiment. The maximum improvement is achieved by increasing
25 30 35 40 45 50 55 60 65 70 75
Figure 8. Ablation study on KITTI validation set for hybrid scale. The jacinth line means changes from 5 to 1 when
The blue line means
changes from 5 to 1 when
backbone. Besides, the VFE layers [35] replace the AVFE and AVFEO layers. It’s given that adding attention to VFE layer is able to bring a 2.06 mAP gain in BEV Moderate compared with baseline. FPN is effective with a 0.58 gain but the proposed FFPN bring a larger improvement of a 1.42 gain. We also adopt focal loss for classification. However, the default results in the confidence score degradation for Pedestrian and Cyclist. Therefore, we change
by experiments. As to the Hybrid Voxel Feature Extractor, increasing voxel/feature projection scale number to 2 gives the maximum performance boost of 2.17 in mAP. Furthermore, keeping going up to 3 scales gives another 0.71 gain.
Hybrid Voxel Feature Extractor. Given that the Hybrid Voxel strategy plays an important role, it is important to make and
enough for feature encoding while not consuming much computation. Thus, we conduct a series of experiments with various scale number. Note that as the scale number increases, the block number in backbone in Fig. 5 increases as well. Shown in Fig. 8, it’s a nice tradeoff between speed and performance when
, demonstrated that the effectiveness of scale projection between
and
. Furthermore, we visualize the
with/without attention in Fig. 3, which shows that the activation of target region is greater with attention strategy.
Table 5. Inference speed for HVNet with different module. Inference speed. The inference time of HVNet is 32ms
Figure 9. Multi-scale features with/without attention. With attention mechanism, our output feature map can suppress the background area and enhance the shape feature of objects
for a single class in average on a 2080Ti GPU where the Hybrid Voxel Feature Extractor takes 12ms, the backbone network takes 11ms and the head with NMS takes the left 9ms. The required time for each module is changed with the number of input points. In our approaches, HSV and index-based implementation are proposed to accelerate the feature encoding, whose effectiveness is shown in Tab. 5. We employ the VFE layer in PointPillars [10] as baseline. Utilize the HSV and index-based implementation save 2ms in average. Furthermore, the head in our model only takes extra 3ms for extending multi-class detection in one forward.
5. Conclusion
In this work, we propose HVNet, a novel one-stage 3D object detector. HVNet aggregates hybrid scale voxel grids into unified point-wise features, and then projects them into different scale pseudo-image features under the guidance of attention knowledge. The key to HVNet is that it decouples the feature extraction scales and the pseudo-image projection scales. Further more, a backbone with feature fusion pyramid network takes pseudo-images and fuses features to generate compact representations for different categories. Experimental studies show that our method achieves state-of-the-art mAP with a real-time speed.
[1] Xiaozhi Chen, Huimin Ma, Ji Wan, Bo Li, and Tian Xia. Multi-view 3d object detection network for autonomous driving. In IEEE Conference on Computer Vision and Pattern Recognition, 2017. 2, 7
[2] Yilun Chen, Shu Liu, Xiaoyong Shen, and Jiaya Jia. Fast Point R-CNN. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2019. 1, 3, 6
[3] M. Engelcke, D. Rao, D. Z. Wang, C. H. Tong, and I. Posner. Vote3Deep: Fast object detection in 3D point clouds using efficient convolutional neural networks. In 2017 IEEE International Conference on Robotics and Automation (ICRA), pages 1355–1361, 2017. 2
[4] Mark Everingham, Luc Van Gool, Christopher KI Williams, John Winn, and Andrew Zisserman. The pascal visual object classes (voc) challenge. International journal of computer vision, 88(2):303–338, 2010. 5, 6
[5] A. Geiger, P. Lenz, and R. Urtasun. Are we ready for autonomous driving? The KITTI vision benchmark suite. In 2012 IEEE Conference on Computer Vision and Pattern Recognition, pages 3354–3361, 2012. 1, 2, 6
[6] Benjamin Graham, Martin Engelcke, and Laurens Van Der Maaten. 3d semantic segmentation with submanifold sparse convolutional networks. In IEEE Conference on Computer Vision and Pattern Recognition, 2018. 2
[7] Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. Deep residual learning for image recognition. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 770–778, 2016. 3, 6
[8] Diederik P Kingma and Jimmy Ba. Adam: A method for stochastic optimization. arXiv preprint arXiv:1412.6980, 2014. 6
[9] J. Ku, M. Mozifian, J. Lee, A. Harakeh, and S. L. Waslan- der. Joint 3D Proposal Generation and Object Detection from View Aggregation. In 2018 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS), pages 1–8, 2018. 1, 2, 6
[10] Alex H. Lang, Sourabh Vora, Holger Caesar, Lubing Zhou, Jiong Yang, and Oscar Beijbom. PointPillars: Fast Encoders for Object Detection From Point Clouds. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 12697–12705, 2019. 1, 2, 3, 4, 5, 6, 7, 8
[11] Peiliang Li, Xiaozhi Chen, and Shaojie Shen. Stereo R-CNN Based 3D Object Detection for Autonomous Driving. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 7644–7652, 2019. 1
[12] Zuoxin Li and Fuqiang Zhou. Fssd: feature fusion single shot multibox detector. arXiv preprint arXiv:1712.00960, 2017. 5
[13] Ming Liang, Bin Yang, Yun Chen, Rui Hu, and Raquel Urta- sun. Multi-Task Multi-Sensor Fusion for 3D Object Detection. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 7345–7353, 2019. 1, 2, 3, 6
[14] Ming Liang, Bin Yang, Shenlong Wang, and Raquel Urta- sun. Deep Continuous Fusion for Multi-Sensor 3D Object
Detection. In Proceedings of the European Conference on Computer Vision (ECCV), pages 641–656, 2018. 3
[15] Tsung-Yi Lin, Piotr Doll´ar, Ross Girshick, Kaiming He, Bharath Hariharan, and Serge Belongie. Feature pyramid networks for object detection. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 2117–2125, 2017. 3, 5
[16] Tsung-Yi Lin, Priya Goyal, Ross Girshick, Kaiming He, and Piotr Dollar. Focal Loss for Dense Object Detection. In Proceedings of the IEEE International Conference on Computer Vision, pages 2980–2988, 2017. 6
[17] Wei Liu, Dragomir Anguelov, Dumitru Erhan, Christian Szegedy, Scott Reed, Cheng-Yang Fu, and Alexander C Berg. Ssd: Single shot multibox detector. In European conference on computer vision, pages 21–37. Springer, 2016. 5, 7
[18] Alexander Neubeck and Luc Van Gool. Efficient nonmaximum suppression. In 18th International Conference on Pattern Recognition (ICPR’06), volume 3, pages 850–855. IEEE, 2006. 6
[19] Adam Paszke, Sam Gross, Soumith Chintala, Gregory Chanan, Edward Yang, Zachary DeVito, Zeming Lin, Alban Desmaison, Luca Antiga, and Adam Lerer. Automatic differentiation in pytorch. 2017. 5
[20] Charles R. Qi, Wei Liu, Chenxia Wu, Hao Su, and Leonidas J. Guibas. Frustum PointNets for 3D Object Detection From RGB-D Data. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 918–927, 2018. 1, 2, 6, 7
[21] Zengyi Qin, Jinglu Wang, and Yan Lu. Triangulation Learn- ing Network: From Monocular to Stereo 3D Object Detection. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 7615–7623, 2019. 1
[22] Shaoshuai Shi, Xiaogang Wang, and Hongsheng Li. PointR- CNN: 3D Object Proposal Generation and Detection From Point Cloud. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 770–779, 2019. 1, 3, 6
[23] Andrea Simonelli, Samuel Rota Rota Bul`o, Lorenzo Porzi, Manuel L´opez-Antequera, and Peter Kontschieder. Disentangling Monocular 3D Object Detection. arXiv:1905.12365 [cs], 2019. 1
[24] Bei Wang, Jianping An, and Jiayan Cao. Voxel-FPN: Multi- scale voxel feature aggregation in 3D object detection from point clouds. arXiv:1907.05286 [cs, stat], 2019. 2, 3, 6, 7
[25] Dominic Zeng Wang and Ingmar Posner. Voting for voting in online point cloud object detection. In Robotics: Science and Systems, volume 1, pages 10–15607, 2015. 2
[26] Zhixin Wang and Kui Jia. Frustum ConvNet: Sliding Frus- tums to Aggregate Local Point-Wise Features for Amodal 3D Object Detection. arXiv:1903.01864 [cs], 2019. 1, 3, 6
[27] Bin Xu and Zhenzhong Chen. Multi-Level Fusion Based 3D Object Detection From Monocular Images. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 2345–2353, 2018. 1
[28] Yan Yan, Yuxing Mao, and Bo Li. Second: Sparsely embed- ded convolutional detection. Sensors, 18(10):3337, 2018. 1, 2, 6, 7, 12
[29] Bin Yang, Ming Liang, and Raquel Urtasun. HDNET: Ex- ploiting HD Maps for 3D Object Detection. In Conference on Robot Learning, pages 146–155, 2018. 1, 6
[30] Bin Yang, Wenjie Luo, and Raquel Urtasun. Pixor: Real- time 3d object detection from point clouds. In IEEE Conference on Computer Vision and Pattern Recognition, 2018. 2
[31] Zetong Yang, Yanan Sun, Shu Liu, Xiaoyong Shen, and Jiaya Jia. IPOD: intensive point-based object detector for point cloud. CoRR, abs/1812.05276, 2018. 1, 3, 6
[32] Zetong Yang, Yanan Sun, Shu Liu, Xiaoyong Shen, and Ji- aya Jia. STD: Sparse-to-Dense 3D Object Detector for Point Cloud. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2019. 1, 3, 6
[33] Dingfu Zhou, Jin Fang, Xibin Song, Chenye Guan, Junbo Yin, Yuchao Dai, and Ruigang Yang. IoU Loss for 2D/3D Object Detection. arXiv:1908.03851 [cs], 2019. 5
[34] Yin Zhou, Pei Sun, Yu Zhang, Dragomir Anguelov, Jiyang Gao, Tom Ouyang, James Guo, Jiquan Ngiam, and Vijay Vasudevan. End-to-end multi-view fusion for 3d object detection in lidar point clouds. arXiv preprint arXiv:1910.06528, 2019. 3
[35] Yin Zhou and Oncel Tuzel. VoxelNet: End-to-End Learn- ing for Point Cloud Based 3D Object Detection. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 4490–4499, 2018. 1, 2, 3, 4, 6, 7, 8
1. More About the Attention Mechanism
The attention layer can be thought of a kernel method for the input feature: , where G represents attention feature which is described in Equ. 2 in the paper, W and
are linear weights. Since that G contains a part of X in the first layer, the attention mechanism can be regarded as a second-order kernel for input X. In experiment Fig. 10 we show that with attention the output feature map suppresses the background and enhances the shape feature of object.
2. Ablation Studys
2.1. Corner Loss
Most of existing methods often regress x, y, z, l, w, h, theta of the 3D bounding box of the object, which we name pose loss. We first analyze the difference between corner loss and pose loss. For simplicity, we use our best model on KITTI validation. We can find that our corner loss perform better than original loss from Tab. 1. According to our experiment, learning theta directly is a little difficult for pure point cloud data. Also, there is a discontinuity for angle 0 and . We could add some normalization to alleviate this problem. However, there are still some corner cases such as the gap between 0 and
, where
being a small number.
2.2. Loss Varietas
We also tried the IoU branch and the IoU/GIoU Loss in our HVNet. For IoU branch, we add a parallel convolution layer in the detection head to predict the IoU of each anchor, and then multiply the IoU prediction and cls score in the inference time. This strategy performs well in valid set but fails in test set. The IoU Loss is use the IoU as loss directly, which fails in both val and test set.
Besides, we also tried the uncertain loss to learn the weight of different item of loss, such as corner loss, elevation loss and cls loss. However, it does not improve the result work as well.
3. Quantitative Analysis
We have provide all results on 3D/BEV/3D on the KITTI validation set, but only BEV results on the test set due to the space limit. The overall quantitative results on test are added in Tab. 2.
4. Qualitative Analysis
In this section, we provide more qualitative visual results of our method on KITTI validation/test set.
4.1. Qualitative Results
We provide more visual results of KITTI validation set in Fig 1. Besides, the results of KITTI test set is shown in Fig 2. Generally, these qualitative results demonstrate the effectiveness of our approach.
4.2. Failure Case.
There are two main types of failure cases:
• The missing cases in the ground truth of KITTI dataset, such as the missing pedestrian and cyclist in the second example of Fig. 1 and 2. HVNet have surprisingly detects some of the objects in these cases.
• Some actual failure cases of HVNet, visualized in Fig. 4. In the first example, a pickup truck is detected into two separates cars, caused by the lack of points in the middle of the car. The waste bin and grass in the second example are regarded as pedestrian and cyclist respectively, both of which are very confusing when using LIDAR only.
4.3. Feature Learning of
We visualize the features of scale
in Fig. 3. Each point cloud has features of three scales as a counterpart. We choose the first channel of each feature and then draw it up.
Figure 1. Qualitative analysis on KITTI validation set with Kitti Viewer [28]. We show 3D bounding boxes on point cloud along with projected 2D bounding boxes on image. In each image, blue boxes indicates the ground truth, red boxes indicates detections by HVNet.
Figure 2. Qualitative analysis on KITTI test set with Kitti Viewer. We show 3D bounding boxes on point cloud along with projected 2D bounding boxes on image. In each image, red boxes indicates detections by HVNet.