RGB-D object detection attempts to localize and classifyobjects within an image with depth information. It is one of the core technologies in the field of robotics application and can be beneficial to many intelligent tasks, including pose estimation [1], [2], content-based image retrieval [3] and robot task planning [4]. In recent years, the successful application of deep convolutional neural networks has pushed this research into a new phase and achieved very good results. Most CNN-based RGB-D object detection frameworks are extended from RCNN-based object detectors [5]–[7] for RGB images. R-CNN-Depth [8] is the first deep learning framework for RGB-D object detection that extends the R-CNN system [5] to take advantage of depth information by incorporating two parallel network streams for both RGB and depth modalities. This two-stream pipeline later became the basis for many visual perception tasks in RGB-D images [9]– [12]. In this framework, the features from the RGB and depth modalities are computed independently and concatenated after applying fully connected layers for final proposal classifica-tion. However, this pipeline has its own limitations: (1) Inde-
The first two authors contributed equally to this paper. The corresponding author is Liang Lin. G. Li, Y. Gan, H. Wu, N. Xiao and L. Lin are with the School of Data and Computer Science, Sun Yat-sen University, Guangzhou 510006, China.
pendent feature computation and simple feature concatenation ignore the correlation between the two modalities. (2) Only information inside the object proposal is used for object classification, which neglects the auxiliary role of context information outside the bounding box in object classification.
In this paper, we propose a Cross Modal Attentional Context (CMAC) learning framework for RGB-D object detection that incorporates the consistency and complementary information between two diverse modalities (RGB and depth), as well as an attentional model for global context mining and discriminative object part discovery. To exploit the correlation between RGB and depth modalities, the CMAC model employs a cross-modal feature fusion component to fuse the features extracted from the output feature maps of the two fully convolutional networks (with different input sources). Instead of directly applying fused features to classification and object location refinement, our proposed CMAC model further learns attentional context and explores discriminative object parts based on the fused features. We believe that both the attentional global context and the discriminative parts attended inside each possible object region (object proposal) are crucial for accurate RGB-D object detection.
To capture the global context, our model employs a recurrent attention model that consists of multiple stacked Long Short-Term Memory (LSTM) units. The recurrent neural network is optimized to infer relevant regions for each given region proposal. As shown in Figure 1, the regions that are considered helpful for classification of the object proposal are highlighted. As can be seen, our proposed CMAC model can identify an adaptive global context for different object proposals (i.e., the regions of the keyboard, parts of the table around the target monitor as well as the other monitor are highlighted when the input region proposal contains a monitor. When the input region proposal contains a chair, the regions including parts of the table and other chairs are assigned higher weights in the final classification.). Moreover, inspired by the fact that humans tend to quickly capture distinguishable parts for more accurate object classification judgment when observing objects with occluded regions, we propose to further incorporate a fine-grained object part attention module in our network framework. Considering the flexible attention mechanism and the excellent spatial manipulation ability of Spatial Transform Networks (STNs), we adopt multiple STNs in parallel to examine the discriminative parts located inside a specific object proposal for capturing local context. As illustrated in Figure 1, the CMAC model is able to successfully locate the most discriminative location that can differentiate an object’s category (i.e., the main screen and the base of
Fig. 1: Example visualization results for global context and object part attention generated by our proposed CMAC model. For global context, information from relevant regions (the highlighted regions) of the object proposals is obtained through a recurrent attentional model. For local context, multiple parallel spatial transformers are utilized to exploit information from the discriminative parts (green rectangles) of the object proposals. Red rectangles indicate the object proposals.
the monitors, as well as the back and legs of the chairs). Acquiring such fine-grained object parts provides enhanced feature representations for region proposals.
In summary, the main contributions of the proposed CMAC model can be listed as follows:
- We propose a novel Cross Modal Attentional Context (CMAC) deep learning framework that effectively incorporates the correlated information between different modalities and successfully identifies useful contextual information both locally and globally for RGB-D object detection.
- An attention-based global context module, based on an LSTM network, is utilized to recurrently generate contextual information from a global view for each object proposal.
- Multiple spatial transform networks are adopted in parallel to localize discriminative object parts for accurate object recognition.
- Extensive experiments on the SUNRGBD and NYUv2 datasets well demonstrate the effectiveness of the proposed CMAC model, which outperforms the state-of-the-art method [10] by 3.7% and 3.2%, respectively, in terms of mAP.
A. Object Detection in RGB-D Images
Object detection in RGB-D images has attracted increased attention because of the rapid development of affordable depth sensors and their diverse application scenarios. Many successful algorithms have been proposed to effectively exploit information from RGB-D data. [13] and [14] took advantage of hand-designed features such as SIFT and multiple shape features in the depth channel for RGB-D object recognition. Schwarz et al. [11] utilized two-stream CNNs pre-trained on ImageNet to extract features from RGB-D images. While most work mainly focuses on the RGB modality, some recent work has been dedicated to improving the object detection performance by taking depth information into consideration. Gupta et al. [8] proposed a geocentric embedding to convert each single-channel depth map into a three-channel depth image (HHA image), in which they encoded each pixel with three channels of information, i.e., the height above the ground, the horizontal disparity and the angle with respect to gravity. They also introduced a generalized method for the R-CNN detector that can be applied to RGB-D images; they used large CNNs pre-trained on RGB images to extract features from HHA data. To learn rich representations for the depth modality, [10] transferred supervisions from labeled RGB images to unlabeled depth images. In this paper, we follow [8] and encode depth information into HHA images for improved feature learning and take the model in [10] as our compared baseline model.
Another core issue of RGB-D object detection is how to merge the features from different sources. Existing fusion strategies can be divided into two streams: (1) Early fusion [13], [15], [16], in which the depth channel is being treated as an extra channel to RGB images and is concatenated with the RGB channels for feature extraction. (2) Late fusion [8]–[10], [17], where features are separately learned for each modality and are concatenated at later stages for object classification. Our model is similar to the late fusion approach, but instead of directly concatenating features for classification, we apply the attention model to the fused features to learn a better global context and discriminative object parts to achieve more
Fig. 2: The network architecture of our proposed cross-modal attentional context (CMAC) learning framework. The input consists of one RGB image and one HHA image (geocentric encoding of the depth image). Our network framework is composed of four components: convolutional feature extraction, cross-modal feature fusion, attention-based global context modeling and fine-grained object part attention.
accurate object recognition.
B. Context Information in Object Detection
Context information has been applied in many methods to enhance the performance of object detection [18]–[24]. For instance, [24] exploited context from information about the entire scene for object detection and localization. [21] explored contextual relationships between regions in an unsupervised manner, where objects are detected using a discriminative approach. Spatial support and geographic information are used as context clues in [20]. Context models have also been applied to deep-learning-based object detectors. [25] proposed a group recursive learning approach to refine object proposals by incorporating semantic and spatial layout correlations of surrounding proposals. Chu et al. [26] formulated a fully connected conditional random field (CRF) to incorporate the local appearance and the contextual information in terms of relationships among objects and the global scene based on contextual features generated by a convolutional neural network. Inside-Outside net (ION) [27] introduced spatial recurrent networks (RNNs) to integrate the contextual information outside the region of interest while utilizing skip pooling to extract fine-grained information from multiple low-level convolutional layers. Although our proposed model also explores global contextual information through recurrent networks, it explicitly learn to attend the most relevant regions of the object proposal by generating a weight map for each proposal. The weight map can well reveal the contextual region that corresponds to the final classification result. One the other hand, instead of directly extracting local features from the whole object bounding box, our model can achieve better object feature representations by recurrently discovering the most discriminative object parts inside the object proposal and performing part-level feature fusion.
C. Recurrent Attention Models
Recurrent attentional models have been widely incorporated in deep-learning-based computer vision tasks [28]–[31] to achieve better performance. Bahdanau et al. [28] introduced recurrent attention to neural machine translation, which allows the model to adaptively attend to the most relevant part of a sentence. [30] adopted visual attention to dynamically select a sequence of regions and only processed the selected regions for efficient computation. A recent work in [31] used an LSTMbased attention model to learn a description of static images. More recently, an attention mechanism has also been applied to vision tasks for videos. For instance, [32] extended an attention model for video description and employed a temporal attention mechanism to model the dynamic temporal structure of videos. [33] optimized the attention model to attend to the relevant parts within a single frame and attached higher importance to them while performing action recognition.
The work that is most relevant to our proposed method is the attentive context proposed in [29], which also incorporated a recurrent attention model to exploit global contextual information. However, the attention model used in [29] generated a static attentive location map for all object proposals. Instead of utilizing a fixed attentive context, our model generates an attentional context feature adaptive to the input region proposals. Furthermore, we employ a fine-grained object part attention module to harness multiple discriminative object parts inside each object proposal for achieving a superior local feature representation. Experimental analysis in Sec. IV-C demonstrates that our method is more robust to background and inter-class noise.
An overview of our framework is illustrated in Fig. 2. Our RGB-D object detection system, which is based on cross modal attentional context learning, is composed of four components, including fully convolutional networks based feature extraction, cross-modal feature fusion, attention-based global context modeling and fine-grained object part attention. We term this network Cross-Modal Attentional Context (CMAC) network. Specifically, given an RGB-D image, we first employ Multiscale Combinatorial Grouping (MCG) [34] to generate a number of object proposals from RGB information and encode the original depth value to the three-channel HHA representation, as proposed in [8]. Following the benchmark object detection framework of Fast R-CNN [6], our CMAC model takes as input an RGB image, an HHA image and corresponding object proposals to generate class labels as well as a refined bounding box for each object proposal.
As shown in Fig. 2, the feature extraction module is built on two separate fully convolutional sub-networks, including the VGG16 model [35] for RGB modality and AlexNet model [36] for depth modality. The output of the last convolutional layer is being treated as our initial feature for object detection, therein including D convolutional maps. The two fully convolutional sub-networks take as input the RGB image and the HHA image to generate the corresponding feature cube. Region-of-Interest (RoI) pooling operations are performed on the two feature cubes to obtain both global (whole image) and local features (object proposal) of the two modalities before being fed to a cross-modal feature fusion module. Moreover, both the fused global feature and the fused local feature are fed to a global context modeling module to obtain an attentional global context feature for the corresponding object proposal, while the fused local feature itself is also treated as an input for the fine-grained object part attention, which generates an embedded local feature. Finally, the concatenation of the global context feature and the embedded local feature are employed for final object detection, while local feature embedding is applied for further bounding box regression.
A. Cross-Modal Feature Fusion
It has been widely verified that the RGB modality and depth modality are complementary, the combination of which can help to boost the RGB-D object detection performance [8], [10]. In this paper, we exploit the features extracted from the two modalities for both global context modeling and local proposal feature description. Specifically, we design a simple yet effective sub-network to fuse features extracted from both modalities. For each object proposal, we extract a fixed-size feature representation using ROI pooling [6] in both modalities, denoted as and
. We also apply a pooling operation to the output feature map of the last convolution layer of the two fully convolutional networks to generate fixed-size feature cubes, denoted as
and
, respectively. The feature fusion between RGB and depth modality can be represented by
where and
are the global context feature and local object proposal feature after fusion, respectively, and concat(
) indicates the concatenation operation of feature representations along the channel axis.
In contrast to [8], [10], [37], which apply two independent CNNs to separately extract features from both modalities and directly perform simple concatenation for final classification, our cross-modal feature fusion operation is treated as a feature generation step for further global context modeling and local feature embedding before final classification. In the experiment section, we verify that our proposed cross-modal feature representation can help to produce more effective local and global context information, greatly improving the performance of the final classification.
B. Attention-based Global Context Modeling
It is well known that contextual representation is crucial for accurate visual recognition [27], [29], [38]–[41]. Instead of directly obtaining fixed context information to assist in object detection [29], [39], we focus on exploiting adaptive context information for each object proposal. Specifically, we design a soft attention model based on multi-layered RNNs with LSTM units to spatially weight the features and generate an adaptive global context feature for each object proposal. Average pooling and max pooling operations over the feature map of the whole image can be considered as special cases of our method.
The attentional context model takes as input the concatenation of the global feature cube and that of the local feature cube before being fed to a convolutional layer for feature embedding. The dimensions of the embedded global and local feature are denoted as
(20
in our experiments) and
(7
in our experiments), respectively. Based on these embedded feature cubes, the RNN model learns an attentional map of size K
K to determine the effectiveness of the contextual region that may be beneficial to the object detection.
Inspired by the LSTM-based soft attention model proposed in [33], we apply an LSTM network to generate a contextual attention map at every time step conditioned on the previous hidden state, the globally embedded feature vector as well as the local feature. Specifically, at each time-step t, we extract D-dimensional global feature vectors as well as
local object proposal feature vectors. As in [33], we refer to these feature vectors as global feature slices and local feature slices, respectively, denoted as
Each vertical column of and
denotes the feature representation (receptive field) in the input image. We follow the implementation of the LSTM network in [42], which is formulated as
TABLE I: Detection results from different methods on SUNRGBD and NYUv2. AC-CNN* indicates our implementation of the RGB-D version of AC-CNN [29]. G and L denote our proposed model incorporated with a single LSTM module (G) or STN module (L), respectively. (w/o fusion) and (w/ fusion) denote without and with multi-modal context fusion, respectively.
TABLE II: Comparison of exploiting global context using different methods on SUNRGBD and NYUv2
where , and
are the input gate, forget gate, cell state, output gate and hidden state of the LSTM, respectively;
is the global context feature vector input to the LSTM at time step t; the vector
is the local feature embedding of the object proposal with the global average pooling operation; T
denotes a simple affine transformation with trainable parameters, where d is the dimensionality of
and
; and
and
denote the logistic sigmoid activation and element-wise multiplication, respectively.
At each time step t, our LSTM model learns to predict a weight map of size
, where its value corresponds to the spatial attention that should be paid when performing proposal classification. The weight map
is computed by a multilayer perception
conditioned on the previous hidden state
. The spatial weight of
at location i can thus be computed as follows:
Fig. 3: Illustration of the STN module. The STN module takes the feature of the object proposal as input and attends to the most discriminative parts. The feature from these parts will subsequently serve as an enhanced local feature in object classification and bounding box regression.
Based on the weight map, the global context feature vector x at time step t is computed as an average of the feature slices weighted according to , formulated as
where is the
global feature slice. Because the relevant regions are given higher weights, the global feature
will be dominated by features from these regions and hence provide more useful contextual information for more accurate object detection.
During the initialization stage, we follow the same strategy proposed in [43] for faster convergence. Specifically, we initialize the cell state and the hidden state
of the LSTM network as
where finit, c and finit, h are two multi-layer perceptions. The two initial values are applied to infer the initial weights for the initialization of the global context feature
.
As shown in Fig. 2, the output of our LSTM model is a D-dimensional global context feature, which is further fed to two fully connected layers to produce the final feature representation, denoted as .
C. Fine-grained object part attention
Because the local salient parts inside a specific object proposal play an important guiding role in judging the classifi-cation of an object (especially for partially occluded objects), we further propose to employ multiple STNs [44] in parallel to infer discriminative object parts for each object proposal. The spatial transformer is a differential module that learns to spatially transform the input feature maps U to the output feature maps V . A spatial transformer is applied in the following three steps. First, a localization network is employed to predict the affine transformation matrix to be applied to the input feature map. Second,
is being applied to create a sampling grid in U by the grid generator. Finally, a sampler is adopted to produce the output maps sampled from the regions of input maps at the sampling grid. As shown in Figure 3, we train each transformer to automatically attend to discriminative object parts inside an object proposal. During training, we fix the scaling factor to 0.5 and only accept scaling and translating in each spatial transformer. Thus,
is given by
where are the translation parameters that are predicted based on the localization network.
Taking the local context feature map as input, each transformer in our object part attention module transforms and samples the input map to the output map
. After normalization, the outputs of each transformer are concatenated with the local context feature to form a midlevel feature representation for an object proposal, defined as
where is the output of the
transformer and N is the number of spatial transformers.
As shown in Figure 2, we use a convolution layer after re-scaling to reduce the dimensions of
from
to
, which is then fed to two fully connected layers to infer the final feature representation for the object proposal, denoted as
.
D. Training Objective
Denote as the predicted discrete probability distribution (per ROI) over C + 1 categories and
as the predicted bounding-box regression offsets. Given the obtained local and global context features
and
and
can be computed as follows:
where indicates the softmax operation and
and
are two fully connected layers with C + 1 units and
units, respectively.
Note that we only incorporate local contextual information for bounding-box regression. Finally, we minimize an objective function following the multi-task loss given in Fast-RCNN [6], which is defined as
where u is the ground-truth label, v is the regression target, is the log loss for ground-truth class u, and
is the smooth
loss proposed in [6].
evaluates to 1 when
and 0 otherwise. By convention, the background class is labeled as u = 0.
A. Experimental Settings
Datasets and Evaluation Metrics: We evaluate our model on two RGB-D datasets: SUNRGBD [45] and NYUv2 [46]. The SUNRGBD and NYUv2 datasets contain 10335 and 1449 RGB-D images, respectively, and are divided into train and test subsets. We adopt Average Precision (AP) and mean of Average Precision (mAP) following the PASCAL challenge protocols as our evaluation metrics.
Implementation Details: In our experiments, we implement our model based on Fast R-CNN [6], an open-source framework for traditional RGB object detection built on the Caffe platform [47]. We utilize the network architecture from Guptaet al [10] as our basic CNN network structure for convolutional feature map extraction. All the newly added fully connected and convolutional layers are randomly initialized with a zero-mean Gaussian distribution with standard deviations of 0.01 and 0.001. The recurrent attention model consists of 4 stacked LSTM units with shared parameters. All the parameters of the LSTM units are initialized based on the xavier algorithm [48].
We apply Stochastic Gradient Decent (SGD) to fine tune our model. Each SGD mini-batch is composed of 128 randomly sampled object proposals from 2 randomly chosen images. In each mini-batch, we select 25% of the ROIs as foreground from object proposals that have intersection over union (IoU) overlap with a ground-truth bounding box of at least 0.5. The remaining ROIs are sampled from object proposals that have a maximum IoU with ground truth in the interval [0.1, 0.5) and act as background with ground truth label u = 0. During training, images are horizontally flipped with a probability of 0.5 for data augmentation, and no other augmentation is used. We run SGD for approximately 10 epochs on the training set to fine tune the network parameters. The momentum is set to 0.9, and the learning rate is initialized to 0.001 and decreased by 10 every 4 epochs. It takes approximately 1.5 days to train our model on a single NVIDIA GeForce GTX TITAN X GPU with 12 GB of memory.
It costs approximately 10 GB of GPU memory to train our model. The average training time for each iteration is approximately 1.23 seconds. However, the testing process is particularly efficient and takes approximately 0.58 seconds (excluding object proposal extraction) to process one image.
B. Performance Comparisons
RGB-D Datasets: We compare our proposed method against recent state-of-the-art RGB-D object detection methods, including rich image and depth feature-based RGBD object detection [8] and the supervision-transfer-based model [10]. Moreover, to better validate the superiority of the attention-based global context and fine-grained object part attention on RGB-D datasets, we also implement an RGBD version (denoted as AC-CNN*) of the AC-CNN model proposed in [29] for comparison. AC-CNN follows a similar idea to our proposed method but incorporates fixed global and local attentive contexts to assist in improving the object detection performance. In the implementation, we apply the
Fig. 4: Illustration of the attentional weight maps generated by the attention-based global context modeling module. The top rows are the input images and region proposals. The middle and bottom rows are the attentional weight maps generated by our model without context fusion and those with context fusion, respectively. The bottom two rows show that our model can perceive the most relevant regions to the given object proposal and that more useful regions can be acquired through context fusion. A detailed discussion can be found in section IV-D.
TABLE III: Detection results on SUNRGBD. AC-CNN* indicates our implementation of the RGB-D version of AC-CNN [29]. (w/o fusion) and (w/ fusion) denote without and with multi-modal context fusion, respectively.
TABLE IV: Detection results on NYUv2. AC-CNN* indicates our implementation of the RGB-D version of AC-CNN [29]. (w/o fusion) and (w/ fusion) denote without and with multi-modal context fusion, respectively.
Fast RCNN [6] framework based on AlexNet [36] to the depth modality for proposal classification and bounding-box position regression. The final results are obtained by averaging the results from the RGB modality and depth modality. For fair comparison, we also apply the same depth modality processing as in AC-CNN* to our model; we call this custom model RGB-D detection without cross-modal fusion (denoted as w/o fusion).
Table III and Table IV illustrate the object detection results of our model, AC-CNN*, and the other two state-of-the-art RGB-D object detection models on the SUNRGBD and NYUv2 datasets. As shown in the table, our proposed method obtains state-of-the-art mAP scores of 47.5% and 52.3% on SUNRGBD and NYUv2, which outperforms the ST model [10] by 3.7% and 3.2%, respectively. The improvements validate the effectiveness of our model in RGB-D object detection by incorporating the proposed attention-based global context and fine-grained attentional object parts learned from the fused cross-modal context. Furthermore, our model (Ours (w/o fusion)) gains 1.5% and 1.7% improvements in mAP scores over AC-CNN* on the SUNRGBD and NYUv2 datasets, respectively, and achieve better detection results on most of the categories.
RGB Dataset: To compare our model with the AC-CNN model [29] in a more equitable way, we remove the depth modality from our model and perform an extra evaluation
TABLE V: Comparison of different LSTM settings utilized in the attention-based global context sub-module. The experiments are conducted on SUNRGBD. (2 LSTM) denotes that there are 2 stacked LSTM units in the global contextualized sub-network.
TABLE VI: Comparison of different STN settings utilized in fine-grained object part attention sub-module. The experiments are conducted on SUNRGBD. (2 STN) indicates that there are 2 parallel spatial transformers in the local contextualize sub-network.
on PASCAL VOC 2007, which contains 9963 RGB images. Specifically, we implement a variant of our model (denoted as Ours*) that performs global context modeling and object part attention only on the RGB modality without incorporating information from the depth modality. As shown in Table VII. Our model outperforms the baseline FRCN [6] and AC-CNN [29] by 3.6% and 1.2% in terms of mAP scores, respectively. The improvement on the RGB dataset as well as the favorable results achieved for RGB-D object detection well demonstrate the superiority of the proposed attention-based global context and fine-grained object part attention over the fixed global context and multi-scale local context proposed in [29]. Table VIII provides the comparisons of the proposed method with several state-of-the-art methods [27], [49]–[52] on PASCAL VOC 2012. It can be observed that our model obtains an mAP score of 76.7%, which outperforms the baseline model by 2.9%. Our model also achieves competitive results compared with the state-of-the-art methods, which validates the effectiveness of the proposed method.
C. Ablation Studies
In this subsection, we show the effectiveness and necessity of each component in our proposed model and also demonstrate the effectiveness of the network design.
Contribution of Each Component in CMAC model: As described in Section III, our proposed CMAC model consists of three newly added sub-networks on the top of deep feature representation, including cross-modal feature fusion, attention-based global context modeling and fine-grained object part attention, which are employed to incorporate the strong correlation between different modalities and capture the global and local contextual information, respectively. We investigate the contributions of each component by gradually applying each sub-network to the object detection. Table I shows that 2.5% and 1.8% improvements in mAP scores over the baseline model are obtained using only fine-grained object part attention. Similar improvements of 2.4% and 2.2% on SUNRGBD and NYUv2 can be observed when only incorporating attention-based global context modeling. The better performance achieved by exploiting both global context features and discriminative object parts evidences the complementarity of the two sub-networks. Furthermore, incorporating cross-modal feature fusion into our detection framework brings an extra performance increase of 0.6% and 0.4% on SUNRGBD and NYUv2, respectively. The above experimental results and analysis well demonstrate the effectiveness of each component in our proposed CMAC framework.
Comparison of Diverse Global Context Modeling: To validate the effectiveness of our attention-based global context, which is generated based on a recurrent model, we compare our model with two variants: the global average pooling method in which the global contextual information is produced by applying the average pooling operation to the extracted feature map, and AC-CNN, which utilizes an attention-based recurrent model to generate the fixed global context. We conduct experiments on the SUNRGBD dataset, and the results are listed in Table II. No local context is used during these experiments. It can be observed that our model outperforms the global averaging pooling method and AC-CNN by 1.9% and 1.5%, respectively. Simply averaging the features of all regions may introduce both background and inter-class noise, which may deteriorate the object detection performance. Although background noise can be overcome by AC-CNN, which generates a fixed attention map for global context feature extraction and benefits the proposal classification, AC-CNN still suffers from a decreased performance caused by inter-class noise (e.g., regions that are beneficial for desk classification might provide noisy information to garbage bin classification). Note that our attention map for global context weighting is generated according to the diverse contents of each ROI feature and can be optimized to attend to the most effective regions related to the input content. The results shown in Table II verify that our model performs better in mitigating both background and inter-class noise by incorporating global context and thus greatly enhances the accuracy of object detection.
Effectiveness of LSTM Settings: In our proposed CMAC model, we have employed a recurrent model to exploit the attentional global context, in which multiple stacked LSTM units are utilized to generate the attentional weight map in an iterative manner. To investigate the effectiveness of different LSTM settings, we implement several variants, whereby the recurrent model is constructed with different numbers (2 to 5) of LSTM units. The experimental results are listed in Table V. As shown in the table, the mAP metric increases by 0.6% and 0.8% when the number of stacked LSTM units is increased from 2 to 3 and 4, respectively. When this number reaches or exceeds 5, no significant performance boosts are achieved, indicating that our model can obtain better context information through recurrent iterations and will converge quickly. We believe that good performance can be obtained in complicated images through more recurrent iterations.
Fig. 5: Comparison of detection results produced by ST [10] (top row), AC-CNN [29] (middle row) and our model (bottom row). The red and green rectangles indicate the ground-truth bounding box and the predicted results, respectively.
TABLE VII: Detection results on VOC 2007. Ours* denotes a variant of our model in which we incorporate only RGB information for object detection
Effectiveness of STN Settings: In the proposed method, we adopt several parallel multiple transform networks (STNs) to attend to discriminative object parts inside an object proposal. To investigate the most effective STN setting, we implement several variants whereby the fine-grained object parts are inferred from different numbers (2 to 4) of spatial transformers. As shown in Table VI, the detection performance increases from 43.8% (baseline) to 45.7% and 46.3% with 1 and 2 spatial transformers, respectively, which indicates that STNs are able to mine discriminative object parts to enhance the local feature representation. However, increasing the number of spatial transformers does not always bring about a better performance. We observe a 3% decrease in mAP when increasing the number of spatial transformers from 2 to 3, indicating that the STNs may start to enroll confusing object parts after most of the discriminative parts have been detected.
TABLE VIII: PASCAL VOC 2012 test detection results. 07+12+S: 07 trainval + 12 trainval + segmentation labels, 07++12: 07 trainval + 07 test + 12 trainval
D. Visualization
In this subsection, we present some visual comparisons of the RGB-D object detection results as well as some visual effects of the attentional weight maps generated by our global context modeling component. Figure 5 shows some detection results of the ST [10] model, the AC-CNN [29] model and our model. It can be observed that our model performs best in detecting small and occluded objects (e.g., monitor, box, garbage bin and the occluded chair). Furthermore, as shown in the third column, our proposed method is also more robust to appearance-similar instances because of the fusion of the geometry context (e.g., the pillow with similar texture to the bed). Figure 4 demonstrates the attentional weight maps generated by our model without (middle row) and with (bottom row) context fusion. Obviously, our attentional model is able to perceive regions most relevant to the specific object proposal, i.e., a lamp is likely to be placed on top of a night stand near a bed, and a night stand is also likely to be placed on the floor near a bed and often co-occurs with a lamp. Moreover, our model obtains more accurate attentional weight maps by fusing information from both RGB and depth modalities since the depth image can provide geometric information. For example, our model is capable of attending to the chairs near the target chair, as they share similar geometric structures. The last column in Fig. 4 shows that our model will attend to the background regions when the proposal does not contain objects, which helps in making correct classifications.
In this paper, we have introduced an approach to effectively learn the cross-modal attentive context for RGBD object detection. In our model, the contextual representations from different sources (i.e., RGB and depth modalities) are fused in the cross-modal feature fusion module. Based on the fused local and global feature, a recurrent attention model including several stacked LSTM units is employed to capture a global context that is closely related to the object proposal. Furthermore, our model adopts several parallel spatial transformers, which learn to attend to discriminative parts inside each object proposal, to generate the enhanced local context information. Extensive experiments and state-of-the-art detection results on SUNRGBD and NYUv2 well demonstrate the effectiveness of our model in exploiting contextual information.
[1] S. Hinterstoisser, V. Lepetit, S. Ilic, S. Holzer, G. Bradski, K. Konolige, and N. Navab, “Model based training, detection and pose estimation of texture-less 3d objects in heavily cluttered scenes,” in Asian conference on computer vision. Springer, 2012, pp. 548–562.
[2] K. Wang, S. Zhai, H. Cheng, X. Liang, and L. Lin, “Human pose esti- mation from depth images via inference embedded multi-task learning,” in Proceedings of the 2016 ACM on Multimedia Conference. ACM, 2016, pp. 1227–1236.
[3] C. Wu, I. Lenz, and A. Saxena, “Hierarchical semantic labeling for task- relevant rgb-d perception.” in Robotics: science and systems, 2014.
[4] S. Schuster, R. Krishna, A. Chang, L. Fei-Fei, and C. D. Manning, “Generating semantically precise scene graphs from textual descriptions for improved image retrieval,” in Proceedings of the Fourth Workshop on Vision and Language, 2015, pp. 70–80.
[5] R. Girshick, J. Donahue, T. Darrell, and J. Malik, “Rich feature hierarchies for accurate object detection and semantic segmentation,” in Proceedings of the IEEE conference on computer vision and pattern recognition, 2014, pp. 580–587.
[6] R. Girshick, “Fast r-cnn,” in Proceedings of the IEEE International Conference on Computer Vision, 2015, pp. 1440–1448.
[7] S. Ren, K. He, R. Girshick, and J. Sun, “Faster r-cnn: Towards real-time object detection with region proposal networks,” in Advances in neural information processing systems, 2015, pp. 91–99.
[8] S. Gupta, R. Girshick, P. Arbel´aez, and J. Malik, “Learning rich features from rgb-d images for object detection and segmentation,” in European Conference on Computer Vision. Springer, 2014, pp. 345–360.
[9] A. Eitel, J. T. Springenberg, L. Spinello, M. Riedmiller, and W. Burgard, “Multimodal deep learning for robust rgb-d object recognition,” in Intelligent Robots and Systems (IROS), 2015 IEEE/RSJ International Conference on. IEEE, 2015, pp. 681–687.
[10] S. Gupta, J. Hoffman, and J. Malik, “Cross modal distillation for super- vision transfer,” in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2016, pp. 2827–2836.
[11] M. Schwarz, H. Schulz, and S. Behnke, “Rgb-d object recognition and pose estimation based on pre-trained convolutional neural network features,” in Robotics and Automation (ICRA), 2015 IEEE International Conference on. IEEE, 2015, pp. 1329–1335.
[12] Z. Li, Y. Gan, X. Liang, Y. Yu, H. Cheng, and L. Lin, “Lstm-cf: Unifying context modeling and fusion with lstms for rgb-d scene labeling,” in European Conference on Computer Vision. Springer, 2016, pp. 541– 557.
[13] L. Bo, X. Ren, and D. Fox, “Depth kernel descriptors for object recognition,” in Intelligent Robots and Systems (IROS), 2011 IEEE/RSJ International Conference on. IEEE, 2011, pp. 821–826.
[14] K. Lai, L. Bo, X. Ren, and D. Fox, “A large-scale hierarchical multi- view rgb-d object dataset,” in Robotics and Automation (ICRA), 2011 IEEE International Conference on. IEEE, 2011, pp. 1817–1824.
[15] M. Blum, J. T. Springenberg, J. W¨ulfing, and M. Riedmiller, “A learned feature descriptor for object recognition in rgb-d data,” in Robotics and Automation (ICRA), 2012 IEEE International Conference on. IEEE, 2012, pp. 1298–1303.
[16] L. Bo, K. Lai, X. Ren, and D. Fox, “Object recognition with hierar- chical kernel descriptors,” in Computer Vision and Pattern Recognition (CVPR), 2011 IEEE Conference on. IEEE, 2011, pp. 1729–1736.
[17] L. Spinello and K. O. Arras, “Leveraging rgb-d data: Adaptive fusion and domain adaptation for object detection,” in Robotics and Automation (ICRA), 2012 IEEE International Conference on. IEEE, 2012, pp. 4469–4474.
[18] P. Carbonetto, N. De Freitas, and K. Barnard, “A statistical model for general contextual object recognition,” Computer Vision-ECCV 2004, pp. 350–362, 2004.
[19] G. Li and Y. Yu, “Visual saliency detection based on multiscale deep cnn features,” IEEE Transactions on Image Processing, vol. 25, no. 11, pp. 5012–5024, 2016.
[20] S. K. Divvala, D. Hoiem, J. H. Hays, A. A. Efros, and M. Hebert, “An empirical study of context in object detection,” in Computer Vision and Pattern Recognition, 2009. CVPR 2009. IEEE Conference on. IEEE, 2009, pp. 1271–1278.
[21] G. Heitz and D. Koller, “Learning spatial context: Using stuff to find things,” Computer Vision–ECCV 2008, pp. 30–43, 2008.
[22] D. Hoiem, A. A. Efros, and M. Hebert, “Geometric context from a single image,” in Computer Vision, 2005. ICCV 2005. Tenth IEEE International Conference on, vol. 1. IEEE, 2005, pp. 654–661.
[23] G. Li and Y. Yu, “Contrast-oriented deep neural networks for salient object detection,” IEEE Transactions on Neural Networks and Learning Systems, 2018.
[24] A. Torralba, K. P. Murphy, and W. T. Freeman, “Using the forest to see the trees: exploiting context for visual object detection and localization,” Communications of the ACM, vol. 53, no. 3, pp. 107–114, 2010.
[25] J. Li, X. Liang, J. Li, Y. Wei, T. Xu, J. Feng, and S. Yan, “Multi-stage object detection with group recursive learning,” IEEE Transactions on Multimedia, 2017.
[26] W. Chu and D. Cai, “Deep feature based contextual model for object detection,” arXiv preprint arXiv:1604.04048, 2016.
[27] S. Bell, C. Lawrence Zitnick, K. Bala, and R. Girshick, “Inside-outside net: Detecting objects in context with skip pooling and recurrent neural networks,” in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2016, pp. 2874–2883.
[28] D. Bahdanau, K. Cho, and Y. Bengio, “Neural machine translation by jointly learning to align and translate,” arXiv preprint arXiv:1409.0473, 2014.
[29] J. Li, Y. Wei, X. Liang, J. Dong, T. Xu, J. Feng, and S. Yan, “Attentive contexts for object detection,” IEEE Transactions on Multimedia, 2016.
[30] V. Mnih, N. Heess, A. Graves et al., “Recurrent models of visual attention,” in Advances in neural information processing systems, 2014, pp. 2204–2212.
[31] K. Xu, J. Ba, R. Kiros, K. Cho, A. C. Courville, R. Salakhutdinov, R. S. Zemel, and Y. Bengio, “Show, attend and tell: Neural image caption generation with visual attention.” in ICML, vol. 14, 2015, pp. 77–81.
[32] L. Yao, A. Torabi, K. Cho, N. Ballas, C. Pal, H. Larochelle, and A. Courville, “Describing videos by exploiting temporal structure,” in Proceedings of the IEEE international conference on computer vision, 2015, pp. 4507–4515.
[33] S. Sharma, R. Kiros, and R. Salakhutdinov, “Action recognition using visual attention,” arXiv preprint arXiv:1511.04119, 2015.
[34] P. Arbel´aez, J. Pont-Tuset, J. T. Barron, F. Marques, and J. Malik, “Mul- tiscale combinatorial grouping,” in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2014, pp. 328–335.
[35] K. Simonyan and A. Zisserman, “Very deep convolutional networks for large-scale image recognition,” arXiv preprint arXiv:1409.1556, 2014.
[36] A. Krizhevsky, I. Sutskever, and G. E. Hinton, “Imagenet classification with deep convolutional neural networks,” in Advances in neural information processing systems, 2012, pp. 1097–1105.
[37] S. Song and J. Xiao, “Deep sliding shapes for amodal 3d object detection in rgb-d images,” in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2016, pp. 808–816.
[38] G. Li and Y. Yu, “Visual saliency based on multiscale deep features,” in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2015, pp. 5455–5463.
[39] R. Mottaghi, X. Chen, X. Liu, N.-G. Cho, S.-W. Lee, S. Fidler, R. Urtasun, and A. Yuille, “The role of context for object detection and semantic segmentation in the wild,” in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2014, pp. 891– 898.
[40] H. Li, G. Li, L. Lin, H. Yu, and Y. Yu, “Context-aware semantic inpainting,” IEEE Transactions on Cybernetics, 2018.
[41] G. Li, Y. Xie, L. Lin, and Y. Yu, “Instance-level salient object segmenta- tion,” in Computer Vision and Pattern Recognition (CVPR), 2017 IEEE Conference on. IEEE, 2017, pp. 247–256.
[42] S. Hochreiter and J. Schmidhuber, “Long short-term memory,” Neural computation, vol. 9, no. 8, pp. 1735–1780, 1997.
[43] A. Show, “Tell: Neural image caption generation with visual attention,” Kelvin Xu et. al.. arXiv Pre-Print, 2015.
[44] M. Jaderberg, K. Simonyan, A. Zisserman et al., “Spatial transformer networks,” in Advances in Neural Information Processing Systems, 2015, pp. 2017–2025.
[45] S. Song, S. P. Lichtenberg, and J. Xiao, “Sun rgb-d: A rgb-d scene understanding benchmark suite,” in Proceedings of the IEEE conference on computer vision and pattern recognition, 2015, pp. 567–576.
[46] N. Silberman, D. Hoiem, P. Kohli, and R. Fergus, “Indoor segmentation and support inference from rgbd images,” Computer Vision–ECCV 2012, pp. 746–760, 2012.
[47] Y. Jia, E. Shelhamer, J. Donahue, S. Karayev, J. Long, R. Girshick, S. Guadarrama, and T. Darrell, “Caffe: Convolutional architecture for fast feature embedding,” in Proceedings of the 22nd ACM international conference on Multimedia. ACM, 2014, pp. 675–678.
[48] X. Glorot and Y. Bengio, “Understanding the difficulty of training deep feedforward neural networks.” in Aistats, vol. 9, 2010, pp. 249–256.
[49] S. Ravishankar, A. Jain, and A. Mittal, “Multi-stage contour based detection of deformable objects,” in European conference on computer vision. Springer, 2008, pp. 483–496.
[50] W. Liu, D. Anguelov, D. Erhan, C. Szegedy, S. Reed, C.-Y. Fu, and A. C. Berg, “Ssd: Single shot multibox detector,” in European conference on computer vision. Springer, 2016, pp. 21–37.
[51] Z. Shen, Z. Liu, J. Li, Y.-G. Jiang, Y. Chen, and X. Xue, “Dsod: Learning deeply supervised object detectors from scratch,” in The IEEE International Conference on Computer Vision (ICCV), vol. 3, no. 6, 2017, p. 7.
[52] J. Dai, Y. Li, K. He, and J. Sun, “R-fcn: Object detection via region- based fully convolutional networks,” in Advances in neural information processing systems, 2016, pp. 379–387.