3D object detection has seen quick progress thanks to advances in deep learning on point clouds. A few recent works have even shown state-of-the-art performance with just point clouds input (e.g. VOTENET). However, point cloud data have inherent limitations. They are sparse, lack color information and often suffer from sensor noise. Images, on the other hand, have high resolution and rich texture. Thus they can complement the 3D geometry provided by point clouds. Yet how to effectively use image information to assist point cloud based detection is still an open question. In this work, we build on top of VOTENET and propose a 3D detection architecture called IMVOTENET specialized for RGB-D scenes. IMVOTENET is based on fusing 2D votes in images and 3D votes in point clouds. Compared to prior work on multi-modal detection, we explicitly extract both geometric and semantic features from the 2D images. We leverage camera parameters to lift these features to 3D. To improve the synergy of 2D-3D feature fusion, we also propose a multi-tower training scheme. We validate our model on the challenging SUN RGB-D dataset, advancing state-of-the-art results by 5.7 mAP. We also provide rich ablation studies to analyze the contribution of each design choice.
Recognition and localization of objects in a 3D environment is an important first step towards full scene understanding. Even such low dimensional scene representation can serve applications like autonomous navigation and augmented reality. Recently, with advances in deep networks for point cloud data, several works [31, 52, 39] have shown state-of-the-art 3D detection results with point cloud as the only input. Among them, the recently proposed VOTENET  work by Qi et al., taking 3D geometry input only, showed remarkable improvement for indoor object recognition compared with previous works that exploit
Figure 1. Voting using both an image and a point cloud from an indoor scene. The 2D vote reduces the search space of the 3D object center to a ray while the color texture in image provides a strong semantic prior. Motivated by the observation, our model lifts the 2D vote to 3D to boost 3D detection performance.
all RGB-D channels. This leads to an interesting research question: Is 3D geometry data (point clouds) sufficient for 3D detection, or is there any way RGB images can further boost current detectors?
By examining the properties of point cloud data and RGB image data (see for example Fig. 1), we believe the answer is clear: RGB images have value in 3D object detection. In fact, images and point clouds provide complementary information. RGB images have higher resolution than depth images or LiDAR point clouds and contain rich textures that are not available in the point domain. Additionally, images can cover “blind regions” of active depth sensors which often occur due to reflective surfaces. On the other hand, images are limited in the 3D detection task as they lack absolute measures of object depth and scale, which are exactly what 3D point clouds can provide. These observations, strengthen our intuition that images can help point cloud-based 3D detection.
However, how to make effective use of 2D images in a 3D detection pipeline is still an open problem. A na¨ıve way is to directly append raw RGB values to the point clouds – since the point-pixel correspondence can be established through projection. But since 3D points are much sparser, in doing so we will lose the dense patterns from the image domain. In light of this, more advanced ways to fuse 2D and 3D data have been proposed recently. One line of work [32, 48, 17] uses mature 2D detectors to provide initial proposals in the form of frustums. This limits the 3D search space for estimating 3D bounding boxes. However, due to its cascaded design, it does not leverage 3D point clouds in the initial detection. In particular, if an object is missed in 2D, it will be missed in 3D as well. Another line of work [43, 16, 45, 11] takes a more 3D-focused way to concatenate intermediate ConvNet features from 2D images to 3D voxels or points to enrich 3D features, before they are used for object proposal and box regression. The downside of such systems is that they do not use 2D images directly for localization, which can provide helpful guidance for detection objects in 3D.
In our work, we build upon the successful VOTENET architecture  and design a joint 2D-3D voting scheme for 3D object detection named IMVOTENET. It takes advantage of the more mature 2D detectors  but at the same time still reserves the ability to propose objects from the full point cloud itself – combining the best of both lines of work while avoiding the drawbacks of each. A key motivation for our design is to leverage both geometric and semantic/texture cues in 2D images (Fig. 1). The geometric cues come from accurate 2D bounding boxes in images, such as the output by a 2D detector. Instead of solely relying on the 2D detection for object proposal , we defer the proposal process to 3D. Given a 2D box, we generate 2D votes on the image space, where each vote connects from the object pixel to the 2D amodal box center. To pass the 2D votes to 3D, we lift them by applying geometric transformations based on the camera intrinsic and pixel depth, so as to generate “pseudo” 3D votes. These pseudo 3D votes become extra features appended to seed points in 3D for object proposals. Besides geometric cues from the 2D votes, each pixel also passes semantic and texture cues to the 3D points, as either features extracted per-region, or ones extracted per-pixel.
After lifting and passing all the features from the images to 3D, we concatenate them with the 3D point features from a point cloud backbone network [33, 34]. Next, following the VOTENET pipeline, those points with the fused 2D and 3D features generate 3D Hough votes  – not limited by 2D boxes – toward object centers and aggregate the votes to produce the final object detections in 3D. As the seed features have both 2D and 3D information, they are intuitively more informative for recovering heavily truncated objects or objects with few points, as well as more confident in distinguishing geometrically similar objects.
In addition, we recognize that when fusing 2D and 3D sources, one has to carefully balance the information from two modalities to avoid one being dominated by the other. To this end, we further introduce a multi-towered network structure with gradient blending  to ensure our network makes the best use of both the 2D and 3D features. During testing, only the main tower that operates on the joint 2D-
3D features are used, minimizing the sacrifice on efficiency. We evaluate IMVOTENET on the challenging SUN RGB-D dataset . Our model achieves the state-of-the-art results while showing a significant improvement (+5.7 mAP) over the 3D geometry only VOTENET, validating the usefulness of image votes and 2D features. We also provide extensive ablation studies to demonstrate the importance of each individual component. Finally, we also explore the potential of using color to compensate for sparsity in depth points, especially for the case of lower quality depth sensors or for cases where depth is estimated from a moving monocular camera (SLAM), showing potential of our method to more broader use cases. To summarize, the contributions of our work are:
1. A geometrically principled way to fuse 2D object detection cues into a point cloud based 3D detection pipeline.
2. The designed deep network IMVOTENET achieves state-of-the-art 3D object detection performance on SUN RGB-D.
3. Extensive analysis and visualization to understand various design choices of the system.
Advances in 3D sensing devices have led to a surge of methods designed to identify and localize objects in a 3D scene. The most relevant lines of work are detection with point clouds and detection with full RGB-D data. We also briefly discuss a few additional relevant works in the area of multi-modal data fusion.
3D object detection with point clouds. To locate objects using purely geometric information, one popular line of methods is based on template matching using a collection of clean CAD models either directly [19, 26, 24], or through extracted features [42, 2]. More recent methods are based on point cloud deep nets [33, 52, 18, 39, 31]. In the context of 3D scene understanding, there have also been promising results on semantic and instance segmentation [49, 4, 9]. Most relevant to our work are PointRCNN  and Deep Hough Voting (VOTENET)  which demonstrated state-of-the-art 3D object detection in outdoor and indoor scenes, respectively. Notably, these results are achieved without using the RGB input. To leverage this additional information, we propose a way to further boost detection performance in this work.
3D object detection with RGB-D data. Depth and color channels both contain useful information that can be useful for 3D object detection. Prior methods for fusing those two modalities broadly fall into three categories: 2D-driven, 3D-driven, and feature concatenation. The first type of
Figure 2. 3D object detection pipeline for IMVOTENET. Given RGB-D input (with the depth image converted to a point cloud), the model initially have two separate branches: one for 2D object detection on the image and the other for point cloud feature extraction (with a PointNet++  backbone) on the point clouds. Then we lift 2D image votes as well as semantic and texture cues to the 3D seed points (the fusion part). The seed points with concatenated image and point cloud features then generate votes towards 3D object centers and also propose 3D bounding boxes with its features (the joint tower). To push for more effective multi-modality fusion, we have two other towers that take image features only (image tower) and point cloud features only (point tower) for voting and box proposals.
method [17, 32, 6, 48] starts with object detecions in the 2D image, which are then used to guide the search space in 3D. By 3D-driven, we refer to methods that first generate region proposals in 3D and then utilize 2D features to make a prediction, such as the Deep Sliding Shapes . Recently more works focus on fusing 2D and 3D features earlier in the process such as Multi-modal Voxelnet , AVOD , multi-sensor  and 3D-SIS . However, all these mostly perform fusion through concatenation of 2D features to 3D features. Our proposed method is more closely related to the third type, but differs from it in two important aspects. First, we propose to make explicit use of geometric cues from the 2D detector and lift them to 3D in the form of pseudo 3D votes. Second, we use a multi-tower architecture  to balance features from both modalities, instead of simply training on the concatenated features.
Multi-modal fusion in learning. How to fuse signals from multiple modalities is an open research problem in other areas than 3D object detection. For example, the main focus of vision and language research is on developing more effective ways to jointly reason over visual data and texts [7, 30, 50] for tasks like visual question answering [1, 14]. Another active area of research is video+sound [28, 8], where the additional sound track can either provide supervision signal , or propose interesting tasks to test joint understanding of both streams . Targeting at all such tasks, a recent gradient blending approach  is proposed to make the multi-modal network more robust (to over-fitting and different convergence rates), which is adopted in our approach too.
We design a 3D object detection solution suited for RGB-D scenes, based on the recently proposed deep Hough voting framework (VOTENET ) by passing geometric and semantic/texture cues from 2D images to the voting process (as illustrated in Fig. 2). In this section, after a short summary of the original VOTENET pipeline, we describe how to build ‘2D votes’ with the assistance of 2D detectors on RGB and explain how the 2D information is lifted to 3D and passed to the point cloud to improve the 3D voting and proposal. Finally, we describe our multi-tower architecture for fusing 2D and 3D detection with gradient blending . More implementation details are provided in supplement.
3.1. Deep Hough Voting
VOTENET  is a feed-forward network that consumes a 3D point cloud and outputs object proposals for 3D object detection. Inspired by the seminal work on the generalized Hough transform , VOTENET proposes an adaptation of the voting mechanism for object detection to a deep learning framework that is fully differentiable.
Specifically, it is comprised of a point cloud feature extraction module that enriches a subsampled set of scene points (called seeds) with high-dimensional features (bottom of Fig. 2 from input points to seeds). These features are then pushed through a Multi-Layer-Perceptron (MLP) to generate votes. Every vote is both a point in the 3D space with its Euclidean coordinates (3-dim) supervised to be close to the object center, and a feature vector learned for the final detection task (F-dim). The votes form a clustered point cloud near object centers and are then processed by another point cloud network to generate object proposals and classification scores. This process is equivalent to the pipeline in Fig. 2 with just the point tower and without the image detection and fusion.
VOTENET recently achieved state-of-the-art results on indoor 3D object detection in RGB-D . Yet, it is solely based on point cloud inputs and neglects the image channels which, as we show in this work, are a very useful source of information. In IMVOTENET, we leverage the additional image information and propose a lifting module from 2D votes to 3D that improves detection performance. Next, we explain how to get 2D votes in images and how we lift its geometric cues to 3D together with semantic/texture cues.
3.2. Image Votes from 2D Detection
We generate image votes based on a set of candidate boxes from 2D detectors. An image vote, in its geometric part, is simply a vector connecting an image pixel and the center of the 2D object bounding box that pixel belongs to (see Fig. 1). Each image vote is also augmented with its semantic and texture cues from the features of its source pixel, such that each image vote has dimension in total as in the fusion block in Fig. 2.
To form the set of boxes given an RGB image, we apply an off-the-shelf 2D detector (e.g. Faster R-CNN ) pre-trained on color channels of the RGB-D dataset. The detector outputs the M most confident bounding boxes and their corresponding classes. We assign each pixel within a detected box a vote to the box center. Pixels inside multiple boxes are given multiple votes (corresponding 3D seed points are duplicated for each of them), and those outside of any box are padded with zeros. Next we go to details on how we derive geometric, semantic and texture cues.
Geometric cues: lifting image votes to 3D The translational 2D votes provide useful geometric cues for 3D object localization. Given the camera matrix, the 2D object center in the image plane becomes a ray in 3D space connecting the 3D object center and the camera optical center (Fig. 1). Adding this information to a seed point can effectively narrow down the 3D search space of the object center to 1D.
In details, as shown in Fig. 3, given an object in 3D with its detected 2D bounding box in the image plane, we denote the 3D object center as C and its projection onto the image as c. A point P on the object surface is associated with its projected point p in the image place, hence knowing the 2D vote to the 2D object center c, we can reduces the search space for the 3D center to a 1D position on the ray OC. We now derive the computation we follow to pass the ray information to the a 3D seed point. Defining in the camera coordinate, and in the image plane coordinate, we seek to recover the 3D object
Figure 3. Illustration of the pseudo 3D vote. In the figure, P is a surface point in 3D, C is the unknown object center while p and c are their projections on the image plane respectively. pseudo 3D center and the vector is the pseudo 3D vote.
center (voting target for the 3D point P). The true 3D vote from P to C is:
The 2D vote, assuming a simple pin-hole camera 1 with focal length f, can be written as:
We further assume the depth of the surface point P is similar to the center point C. This is a reasonable assumption for most objects when they are not too close to the camera. Then, given , we compute ,
which we refer to as a pseudo 3D vote, as lies on the ray OC and is in the proximity of C. This pseudo 3D vote provides information about where the 3D center is relative to the point surface point P.
To compensate for the error caused by the depth approximation (), we pass the ray direction as extra information to the 3D surface point. The error (along the X-axis) caused by the approximated depth, after some derivation, can be expressed by
Hence, if we input the direction of the ray : , the network should have more information to estimate the true 3D vote by estimate the depth different . As we do not know the true 3D object center C, we can use the ray direction of which aligns with
OC after all, where
Normalizing and concatenating with the pseudo vote, the image geometric features we pass to the seed point P are:
Semantic cues On top of the geometric features just discussed that just use the spatial coordinates of the bounding boxes, an important type of information RGB can provide is features that convey a semantic understanding of what’s inside the box. This information often complements what can be learned from 3D point clouds and can help to distinguish between classes that are geometrically very similar (such as table vs. desk or nightstand vs. dresser).
In light of this, we provide additional region-level features extracted per bounding box as semantic cues for 3D points. For all the 3D seed points that are projected within the 2D box, we pass a vector representing that box to the point. If a 3D seed point falls into more than one 2D boxes (i.e., when they overlap), we duplicate the seed point for each of the overlapping 2D regions (up to a maximum number of K). If a seed point is not projected to any 2D box, we simply pass an all-zero feature vector for padding.
It is important to note that the ‘region features’ here include but are not limited to features extracted from RoI pooling operations . In fact, we find representing each box with a simple one-hot class vector (with a confidence score for that class) is already sufficient to cover the semantic information needed for disambiguation in 3D. It not only gives a light-weight input (e.g. 10-dim  vs. 1024-dim ) that performs well, but also generalizes to all other competitive (e.g. faster) 2D detectors [35, 25, 22] that do not explicitly use RoI but directly outputs classification scores. Therefore, we use this semantic cue by default.
Texture cues Different from the depth information that spreads sparsely in the 3D space, RGB images can capture high-resolution signals at a dense, per-pixel level in 2D. While region features can offer a high-level, semantic-rich representation per bounding box, it is complementary and equally important to use the low-level, texture-rich representations as another type of cues. Such cues can be passed to the 3D seed points via a simple mapping: a seed point gets pixel features from the corresponding pixel of its 2D
Although any learned, convolutional feature maps with spatial dimensions (height and width) can serve our purpose, by default we still use the simplest texture feature by feeding in the raw RGB pixel-values directly. Again, this choice is not only light-weight, but also makes our pipeline independent of 2D networks.
Experimentally, we show that even with such minimalist choice of both our semantic and texture cues, significant performance boost over geometric-only VOTENET can be achieved with our multi-tower training paradigm, which we discuss next.
3.3. Feature Fusion and Multi-tower Training
With lifted image votes and its corresponding semantic and texture cues (in the fusion block in Fig. 2) as well as the point cloud features with the seed points , each seed point can generate 3D votes and aggregate them to propose 3D bounding boxes (through a voting and proposal module similar to that in ). Yet it takes extra care to optimize the deep network to fully utilize cues from all modalities. As a recent paper  mentions, without a careful strategy, multi-modal training can actually result in degraded performance as compared to a single modality training. The reason is that different modalities may learn to solve the task at different rates so, without attention, certain features may dominate the learning and result in over-fitting. In this work, we follow the gradient blending strategy introduced in  to weight the gradient for different modality towers (by weighting the loss functions).
In our multi-tower formulation, as shown in Fig. 2, we have three towers taking seed points with three sets of features: point cloud features only, image features only and joint features. Each tower has the same target task of detecting 3D objects – but they each have their separate 3D voting and box proposal network parameters as well as their separate losses. The final training loss is the weighted sum of three detection losses:
Within the image tower, while image features alone cannot localize 3D objects, we have leveraged surface point geometry and camera intrinsic to have pseudo 3D votes that are good approximations to the true 3D votes. So combining this image geometric cue with other semantic/texture cues we can still localize objects in 3D with image features only.
Note that, although the multi-tower structure introduces extra parameters, at inference time we no longer need to compute for the point cloud only and the image only towers – therefore there is minimal computation overhead.
Table 1. 3D object detection results on SUN RGB-D v1 val set. Evaluation metric is average precision with 3D IoU threshold 0.25 as proposed by . Note that both COG  and 2D-driven  use room layout context to boost performance. The evaluation is on the SUN RGB-D v1 data for fair comparisons.
In this section, we first compare our model with previous state-of-the-art methods on the challenging SUN RGB-D dataset (Sec. 4.1). Next, we provide visualizations of detection results showing how image information helps boost the 3D recognition (Sec. 4.2). Then, we present an extensive set of analytical experiments to validate our design choices (Sec. 4.3). Finally, we test our method in the conditions of very sparse depth, and demonstrate its robustness (Sec. 4.4) in such scenarios.
4.1. Comparing with State-of-the-art Methods
Benchmark dataset. We use SUN RGB-D [40, 13, 47, 41] as our benchmark for evaluation, which is a single-view 3 RGB-D dataset for 3D scene understanding. It consists of 10K RGB-D images, with 5K for training. Each image is annotated with amodal oriented 3D bounding boxes. In total, 37 object categories are annotated. Following standard evaluation protocol , we only train and report results on the 10 most common categories. To feed the data to the point cloud backbone network, we convert the depth images to point clouds using the provided camera parameters. The RGB image is aligned to the depth channel and is used to query corresponding image regions from scene 3D points.
Methods in comparison. We compare IMVOTENET with previous methods that use both geometry and RGB. Moreover, since previous state-of-the-art (VOTENET ) used only geometric information, to better appreciate the improvement due to our proposed fusion and gradient blending modules we add two more strong baselines by extending the basic VOTENET with additional features from image.
Among the previous methods designed for RGB-D, 2D-driven , PointFusion  and F-PointNet  are all cascaded systems that rely on 2D detectors to provide proposals for 3D. Deep Sliding Shapes  designs a Faster R-CNN  style 3D CNN network to generate 3D proposals from voxel input and then combines 3D and 2D RoI features for box regression and classification. COG  is a sliding shape based detector using 3D HoG like feature extracted from RGB-D data.
As for the variations of VOTENET , the first one, ‘+RGB’, directly appends the the RGB values as a three-dimensional vector to the point cloud features (of the seed points). For the second one (‘+region feature’), we use the same pre-trained Faster R-CNN (as in our model) to obtain the region-level one-hot class confidence feature, and concatenate it to the seed points inside that 2D box frustum. These two variations can also be viewed as ablated versions of our method.
Results. Table 1 shows the per-class 3D object detection results on SUN RGB-D. We can see that our model outperforms all previous methods by large margins. Especially, it improves upon the previously best model VOTENET by 5.7 mAP, showing effectiveness of the lifted 2D image votes. It gets better results on nearly all categories and has the biggest improvements on object categories that are often occluded (+12.5 AP for bookshelves) or geometrically similar to the others (+11.6 AP for dressers and +7.7 AP for nightstands).
Compared to the variations of the VOTENET that also uses RGB data, our method also shows significant advantages. Actually we find that naively appending RGB values to the point features resulted in worse performance, likely due to the over-fitting on RGB values. Adding region features as a one-hot score vector helps a bit but is still inferior compared to our method that more systematically leverage image votes.
4.2. Qualitative Results and Discussion
In Fig. 4, we highlight detection results of both the original VOTENET  (with only point cloud input) and our IMVOTENET with point cloud plus image input, to show how image information can help 3D detection in various ways. The first example shows how 2D object localization
Figure 4. Qualitative results showing how image information helps. First row: the bookshelf is detected by IMVOTENET thanks to the cues from the 2D detector; Second row: the black sofa has barely any depth points due to its material, but leveraging images, we can detect it; Third row: with 2D localization cues and semantics, we detect the desk and chairs in the back which are even missed by ground truth annotations. Best viewed in color with zoom in.
and semantic help. We see a cluttered bookshelf that was missed by the VOTENET but thanks to the 2D detection in the images, we have enough confidence to recognize it in our network. The image semantics also help our network to avoid the false positive chair as that in the VOTENET output (coffee table and candles confused the network there). The second example shows how images can compensate depth sensor limitations. Due to the color and material of the black sofa, there is barely any depth point captured for it. While VOTENET completely misses the sofa, our network is able to pick it up. The third example shows how image cues can push the limit of 3D detection performance, by recovering far away objects (the desk and chairs in the back) that are even missed in the ground truth annotations.
4.3. Analysis Experiments
In this subsection, we show extensive ablation studies on our design choices and discuss how different modules affect the model performance. For all experiments we report mAP@0.25 on SUN RGB-D as before.
Analysis on geometric cues. To validate that geometric cues lifted from 2D votes help, we ablate geometric features (as in Eq. 6) passed to the 3D seed points in Table 2a. We see that from row 1 to row 3, not using any 2D geometric cue results in a 2.2 point drop. On the other hand, not using the ray angle resulted in a 1.2 point drop, indicating the ray angle helps provide corrective cue to the pseudo 3D votes.
Analysis on semantic cues. Table 2b shows how different types of region features from the 2D images affect 3D detection performance. We see that the one-hot class score vector (probability score for the detected class, other classes set to 0), though simple, leads to the best result. Directly using the 1024-dim RoI features from the Faster R-CNN network actually got the worst number likely due to the optimization challenge to fuse this high-dim feature with the rest point features. Reducing the 1024-dim feature to 64-dim helps but is still inferior to the simple one-hot score feature.
Analysis on texture cues. Table 2c shows how different low-level image features (texture features) affect the end detection performance. It is clear that the raw RGB features are already effective while the more sophisticated per-pixel CNN features (from feature pyramids  of the Faster RCNN detector) actually hurts probably due to over-fitting. More details are in the supplementary material.
Table 2. Ablation analysis on 2D cues. We provide detailed analysis on all types of features from 2D (see Sec. 3.2 for detailed descriptions).
Table 3. Analysis on multi-tower training. In the first block we show performance without blending in gray. Then we show the setting where each of the tower dominates (0.8) the overall training. Finally we show our default setting where weights are more balanced.
Gradient blending. Table 3 studies how tower weights affect the gradient blending training. We ablate with a few sets of representative weights ranging from single tower training (the first row), dominating weights for each of the tower (2nd to 4th rows) and our best set up. It is interesting to note that even with just the image features (the 1st row, 4th column) i.e. the pseudo votes and semantic/texture cues from the images, we can already outperform several previous methods (see Table 1), showing the power of our fusion and voting design.
4.4. Detection with Sparse Point Clouds
While depth images provide dense point clouds for a scene (usually 10k to 100k points), there are other scenarios that only sparse points are available. One example is when the point cloud is computed through visual odometry  or Structure from Motion (SfM)  where 3D point positions are triangulated by estimating poses of a monocular camera in multiple views. With such sparse data, it is valuable to have a system that can still achieve decent detection performance.
To analyze the potential of our model with sparse point clouds, we simulate scans with much less points through two types of point sub-sampling: uniformly random sub-sampling (remove existing points with a uniform distribution) and ORB  key-point based sub-sampling (sam-
Table 4. Sparse point cloud experiment, where we sub-sample the number of points in the cloud either via random uniform sampling or with ORB key points . In such cases, our IMVOTENET significantly outperforms purely geometry based VOTENET.
ple ORB key points on the image and only keep 3D points that project close to those 2D key points). In Table 5, we present detection results with different distribution and density of point cloud input. We see that in the column of “point cloud”, with decreased number of points, 3D detection performance quickly drops. On the other hand, we see including image cues significantly improves performance. This improvement is most significant when the sampled points are from ORB key points that are more non-uniformly distributed.
In this work we have explored how image data can assist a voting-based 3D detection pipeline. The VOTENET detector we build upon relies on a voting mechanism to effectively aggregate geometric information in point clouds. We have demonstrated that our new network, IMVOTENET, can leverage extant image detectors to provide both geometric and semantic/texture information about an object in a format that can be integrated into the 3D voting pipeline. Specifically, we have shown how to lift 2D geometric information to 3D, using knowledge of the camera parameters and pixel depth. IMVOTENET significantly boosts 3D object detection performance exploiting multi-modal training with gradient blending, especially in settings when the point cloud is sparse or unfavorably distributed.
 Stanislaw Antol, Aishwarya Agrawal, Jiasen Lu, Margaret Mitchell, Dhruv Batra, C Lawrence Zitnick, and Devi Parikh. Vqa: Visual question answering. In ICCV, 2015. 3
 Armen Avetisyan, Manuel Dahnert, Angela Dai, Manolis Savva, Angel X. Chang, and Matthias Nießner. Scan2cad: Learning cad model alignment in rgb-d scans. In CVPR, 2019. 2
 Dana H Ballard. Generalizing the hough transform to detect arbitrary shapes. Pattern recognition, 13(2):111–122, 1981. 3
 Christopher Choy, JunYoung Gwak, and Silvio Savarese. 4d spatio-temporal convnets: Minkowski convolutional neural networks. arXiv preprint arXiv:1904.08755, 2019. 2
 Angela Dai, Angel X Chang, Manolis Savva, Maciej Hal- ber, Thomas Funkhouser, and Matthias Nießner. Scannet: Richly-annotated 3d reconstructions of indoor scenes. In CVPR, 2017. 6
 Zhuo Deng and Longin Jan Latecki. Amodal detection of 3d objects: Inferring 3d bounding boxes from 2d ones in rgbdepth images. In CVPR, volume 2, 2017. 3
 Akira Fukui, Dong Huk Park, Daylen Yang, Anna Rohrbach, Trevor Darrell, and Marcus Rohrbach. Multimodal compact bilinear pooling for visual question answering and visual grounding. arXiv preprint arXiv:1606.01847, 2016. 3
 Ruohan Gao and Kristen Grauman. 2.5 d visual sound. In CVPR, 2019. 3
 Benjamin Graham, Martin Engelcke, and Laurens van der Maaten. 3d semantic segmentation with submanifold sparse convolutional networks. In CVPR, 2018. 2
 Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. Deep residual learning for image recognition. In CVPR, 2016. 11
 Ji Hou, Angela Dai, and Matthias Nießner. 3d-sis: 3d seman- tic instance segmentation of rgb-d scans. In CVPR, 2019. 2, 3
 Paul VC Hough. Machine analysis of bubble chamber pic- tures. In Conf. Proc., 1959. 2
 Allison Janoch, Sergey Karayev, Yangqing Jia, Jonathan T Barron, Mario Fritz, Kate Saenko, and Trevor Darrell. A category-level 3d object dataset: Putting the kinect to work. In Consumer depth cameras for computer vision. 2013. 6
 Justin Johnson, Bharath Hariharan, Laurens van der Maaten, Li Fei-Fei, C Lawrence Zitnick, and Ross Girshick. Clevr: A diagnostic dataset for compositional language and elementary visual reasoning. In CVPR, 2017. 3
 Jan J Koenderink and Andrea J Van Doorn. Affine structure from motion. JOSA A, 8(2):377–385, 1991. 8
 Jason Ku, Melissa Mozifian, Jungwook Lee, Ali Harakeh, and Steven L Waslander. Joint 3d proposal generation and object detection from view aggregation. In IROS. IEEE, 2018. 2, 3
 Alex H Lang, Sourabh Vora, Holger Caesar, Lubing Zhou, Jiong Yang, and Oscar Beijbom. Pointpillars: Fast encoders for object detection from point clouds. In CVPR, 2019. 2
 Yangyan Li, Angela Dai, Leonidas Guibas, and Matthias Nießner. Database-assisted object retrieval for real-time 3d reconstruction. In Computer Graphics Forum, volume 34. Wiley Online Library, 2015. 2
 Ming Liang, Bin Yang, Shenlong Wang, and Raquel Urtasun. Deep continuous fusion for multi-sensor 3d object detection. In ECCV, pages 641–656, 2018. 3
 Tsung-Yi Lin, Priya Goyal, Ross Girshick, Kaiming He, and Piotr Doll´ar. Focal loss for dense object detection. In ICCV, 2017. 5
 Tsung-Yi Lin, Michael Maire, Serge Belongie, James Hays, Pietro Perona, Deva Ramanan, Piotr Doll´ar, and C Lawrence Zitnick. Microsoft coco: Common objects in context. In ECCV, 2014. 11
 Or Litany, Tal Remez, Daniel Freedman, Lior Shapira, Alex Bronstein, and Ran Gal. Asist: automatic semantically invariant scene transformation. CVIU, 157:284–299, 2017. 2
 Wei Liu, Dragomir Anguelov, Dumitru Erhan, Christian Szegedy, Scott Reed, Cheng-Yang Fu, and Alexander C Berg. Ssd: Single shot multibox detector. In ECCV, pages 21–37. Springer, 2016. 5
 Liangliang Nan, Ke Xie, and Andrei Sharf. A search-classify approach for cluttered indoor scene understanding. ACM TOG, 31(6), 2012. 2
 David Nist´er, Oleg Naroditsky, and James Bergen. Visual odometry. In CVPR, 2004. 8
 Andrew Owens, Phillip Isola, Josh McDermott, Antonio Tor- ralba, Edward H Adelson, and William T Freeman. Visually indicated sounds. In CVPR, 2016. 3
 Andrew Owens, Jiajun Wu, Josh H McDermott, William T Freeman, and Antonio Torralba. Ambient sound provides supervision for visual learning. In ECCV. Springer, 2016. 3
 Ethan Perez, Florian Strub, Harm De Vries, Vincent Du- moulin, and Aaron Courville. Film: Visual reasoning with a general conditioning layer. In AAAI, 2018. 3
 Charles R Qi, Hao Su, Kaichun Mo, and Leonidas J Guibas. Pointnet: Deep learning on point sets for 3d classification and segmentation. CVPR, 2017. 2
 Charles R Qi, Li Yi, Hao Su, and Leonidas J Guibas. Point- net++: Deep hierarchical feature learning on point sets in a metric space. arXiv preprint arXiv:1706.02413, 2017. 2, 3, 11
 Joseph Redmon, Santosh Divvala, Ross Girshick, and Ali Farhadi. You only look once: Unified, real-time object detection. In CVPR, 2016. 5
 Zhile Ren and Erik B Sudderth. Three-dimensional object detection and layout prediction using clouds of oriented gradients. In CVPR, 2016. 6
 Shaoshuai Shi, Xiaogang Wang, and Hongsheng Li. Pointr- cnn: 3d object proposal generation and detection from point cloud. arXiv preprint arXiv:1812.04244, 2018. 1, 2
 Nathan Silberman, Derek Hoiem, Pushmeet Kohli, and Rob Fergus. Indoor segmentation and support inference from rgbd images. ECCV, pages 746–760, 2012. 6
 Shuran Song and Jianxiong Xiao. Sliding shapes for 3d ob- ject detection in depth images. In ECCV. Springer, 2014. 2
 Minhyuk Sung, Vladimir G Kim, Roland Angst, and Leonidas Guibas. Data-driven structural priors for shape completion. ACM TOG, 2015. 5
 Chen Wang, Danfei Xu, Yuke Zhu, Roberto Mart´ın-Mart´ın, Cewu Lu, Li Fei-Fei, and Silvio Savarese. Densefusion: 6d object pose estimation by iterative dense fusion. In CVPR, pages 3343–3352, 2019. 2, 3
 Weiyao Wang, Du Tran, and Matt Feiszli. What makes training multi-modal networks hard? arXiv preprint arXiv:1905.12681, 2019. 2, 3, 5
 Jianxiong Xiao, Andrew Owens, and Antonio Torralba. Sun3d: A database of big spaces reconstructed using sfm and object labels. In ICCV. IEEE, 2013. 6
 Li Yi, Wang Zhao, He Wang, Minhyuk Sung, and Leonidas Guibas. Gspn: Generative shape proposal network for 3d instance segmentation in point cloud. arXiv preprint arXiv:1812.03320, 2018. 2
 Zhou Yu, Jun Yu, Chenchao Xiang, Jianping Fan, and Dacheng Tao. Beyond bilinear: Generalized multimodal factorized high-order pooling for visual question answering. TNN, 2018. 3
 Hang Zhao, Chuang Gan, Andrew Rouditchenko, Carl Von- drick, Josh McDermott, and Antonio Torralba. The sound of pixels. In ECCV, 2018. 3
 Yin Zhou and Oncel Tuzel. Voxelnet: End-to-end learning for point cloud based 3d object detection. In CVPR, 2018. 1, 2
In this supplementary, we provide more details on the IMVOTENET architecture in Sec. B, including point cloud network architecture, 2D detector, 2D votes and image votes lifting. We also show visualizations of the sparse point clouds in Sec. C.
In this section, we explain the details in the IMVOTENET architecture. Sec. B.1 provides details in the point cloud deep net as well as the training procedure. Further details on the 2D detector and 2D votes are described in Sec. B.2 while details on lifting 2D votes with general camera parameters are described in Sec. B.3.
B.1. Point Cloud Network
Input and data augmentation. The point cloud backbone network takes a randomly sampled point cloud of a SUN RGB-D  depth image with 20k points. Each point has its XY Z coordinate as well as its height (distance to floor). The floor height is estimated as the 1% percentile of heights of the all points. Similar to , we augment the input point cloud by randomly sub-sampling the points from the depth image points on-the-fly. Points are also randomly flipped in both horizontal directions and randomly rotated along the up-axis by Uniform[-30,30] degrees. Points are also randomly scaled by Uniform[-.85, 1.15]. Note that the point height and the camera extrinsic are updated accordingly with the augmentation.
Network architecture. We adopt the same PointNet++  backbone network as that in  with four set abstraction (SA) layers and two feature propagation/upsamplng (FP) layers. With input of where N=20k, the output of the backbone network is a set of seed points of where K=1024 and C=256.
As for voting, different from VOTENET that directly predicts votes from the seed points, here we fuse lifted image votes and the seed points before voting. As each seed point can fall into multiple 2D detection boxes, we duplicate a seed point q times if it falls in q overlapping boxes. Each duplicated seed point has its feature augmented with a concatenation of the following image vote features: 5-dim lifted geometric cues (2 for the vote and 3 for the ray angle), 10-dim (per-class) semantic cues and 3-dim texture cues. In the end the fused seed point has 3-dim XY Z coordinate and a 274-dim feature vector.
The voting layer takes the seed point and maps its features to votes through a multi-layer perceptron (MLP) with FC output sizes of 256, 256 and 259, where the last FC layer outputs XYZ offset and feature residuals (with regard to the 256-dim seed feature) for the votes. As in , the proposal module is another set abstraction layer that takes in the generated votes and generate proposals of shape where is the number of total duplicated seed points and the output dimension consists of 2 objectness scores, 3 center regression values, 2NH numbers for heading regression (NH heading bins) and 4NS numbers for box size regression (NS box anchors) and NC numbers for semantic classification.
Training procedure. We pre-train the 2D detector as described more in Sec. B.2 and use the extracted image votes as extra input to the point cloud network. We train the point cloud deep net with the Adam optimizer with batch size 8 and an initial learning rate of 0.001. The learning rate is decayed by after 80 epochs and then decayed by another after 120 epochs. Finally, the training stops at 140 epochs as we find further training does not improve performance.
B.2. 2D Detector and 2D Cues
2D detector training. While IMVOTENET can work with any 2D detector, in this paper we choose Faster RCNN , which is the current dominant framework for bounding box detection in RGB. The detector we used has a basic ResNet-50  backbone with Feature Pyramid Networks (FPN)  constructed as . It is pre-trained on the COCO train2017 dataset  achieving a val2017 AP of 41.0. To adapt the COCO detector to the specific dataset for 2D detection, we further fine-tune the model using ground truth 2D boxes from the training set of SUN-RGBD before applying the model only using the color channels. The fine-tuning lasts for 4K iterations, with the learning rate reduced by 10at 3K-th iteration starting from 0.01. The batch size, weight decay, and momentum are set as 8, 1e-4, and 0.9, respectively. Two data augmentation techniques are used: 1) standard left-right flipping; and 2) scale augmentation by randomly sample the shorter side of the input image from [480,600]. The resulting detector achieves a mAP (at overlap 0.5) of 58.5 on val set.
Note that we specifically choose not to use the most advanced 2D detectors (e.g. based on ResNet-152 ) just for the sake of performance improvement. As our experimental results shown in the main paper, even with this simple baseline Faster R-CNN, we can already see significant boost thanks to the design of IMVOTENET.
2D boxes. To infer 2D boxes using the detector, we first resize the input image to a shorter side of 600 before feeding into the model. Then top 100 detection boxes across all classes for an image is aggregated. We further reduce the number of 2D boxes per image by filtering out any detection with a confidence score below 0.1. Two things to note about the 2D boxes used while training IMVOTENET: 1) we could also train with ground truth 2D boxes, however we empirically found that including them for training hurts performance, likely due to the different detection statistics at test time; 2) as the pre-training for the 2D detector is also performed on the same training set, it generally gives better detection results on SUN RGB-D train set images, to reduce the effect of over-fitting, we randomly dropped 2D boxes with a probability of 0.5.
Alternative semantic cues. Other than the default semantic cue to represent each 2D box region as the one-hot classi-fication score vector (the detected class has the value of the confidence score from the detector, all other locations have zeros), we further experimented with dense RoI features extracted from that region. Two variants are reported in the paper, with the 1024-dim one being the output from the last FC layer before region classification and regression. For the 64-dim one, we insert an additional FC layer before the fi-nal output layers so that region information is compressed into the 64-dim vector. The added layer is pre-trained with the 2D detector (resulting in a val mAP of 57.9) and fixed when training IMVOTENET.
Alternative texture cues. The default texture cue is the raw RGB values (normalized to [-1,1]). Besides this simple texture cue, we also experimented more advanced per-pixel features. One handy feature that preserves such spatial information is the feature maps from FPN that fuse topdown and lateral connections . Here k is the index to the layers in the feature pyramid, which also designates the feature strides and size. For example, has a stride of 2for both height and width; and a spatial size of roughly 1/16 of the input image4. For the strides are 2maps have a channel size of 256, which becomes the input dimension when used as texture cues for IMVOTENET.
B.3. Image Votes Lifting
In the main paper we derived the lifting process to transform a 2D image vote to a 3D pseudo vote without considering the camera extrinsic. As the point cloud sampled from depth image points is transformed to the upright coordinate before feeding to the point cloud network (through camera extrinsic R as a rotational matrix), the 3D pseudo vote also needs to be transformed to the same coordinate.
Fig. 5 shows the surface point P, object center C and the end point of the pseudo vote . Since the point cloud is in the upright coordinate, the point cloud deep net can only estimate the depth displacement of P and C along the direction (it cannot estimate the depth displacement along the direction as the rotational angles from camera to upright coordinate are unknown to the network). Therefore,
Figure 5. Image vote lifting with camera extrinsic. Here we show surface point P and object center C in two coordinate systems: camera coordinate and upright coordinate (OY is along gravitational direction). is the true 3D vote. is the pseudo vote as calculated in the main paper and is the transformed pseudo vote finally used in feature fusion.
we need to calculate a new pseudo vote where is on the ray OC and is perpendicular to the .
To calculate the we need to firstly transform P and to the upright coordinate. Then assume and in the upright coordinate, we can compute:
In the Sec. 4.4 of the main paper we showed how image information and IMVOTENET model can be specially helpful in detections with sparse point clouds. Here in Table. 5 we visualize the sampled sparse point clouds on three example SUN RGB-D images. We project the sampled points to the RGB images to show their distribution and density. We see that the 20k points in the first row have a dense and uniform coverage of the entire scene. After randomly subsampling the points to 5k and 1k points in the second and third rows respectively, we see the coverage is much more sparse but still uniform. In contrast the ORB key point based sampling (the last two rows) results in very uneven distribution of points where they are clustered around corners and edges. The non-uniformity and low coverage of ORB key points makes it especially difficult to recognize objects in point cloud only. That’s also where our IMVOTENET model showed the most significant improvement upon VOTENET.
sampling method # points 1 2 3
Table 5. Sparse point cloud visualization. We show projected point clouds on three SUN RGB-D images and compare point density and distribution among random sampling (to 20k, 5k and 1k points) and ORB key points based sampling (to 5k and 1k points). For ORB key point sampling, we firstly detect ORB key points in the RGB images with an ORB key point detector and then keep 3D points that are projected near those key points. Best viewed in color with zoom in.