In recent years the use of deep convolutional neural networks (DCNN) (Krizhevsky et al., 2012; Si- monyan & Zisserman, 2015) has significantly improved the detection of objects. The region-based convolutional neural network (R-CNN) framework (Girshick et al., 2014; Girshick, 2015; Ren et al., 2015) gave a tremendous performance gain over previous state-of-the-art methods such as the deformable part models (DPMs) (Felzenszwalb et al., 2010; Zhu et al., 2010). Despite much recent progress in object detection, it is still an open question that how well the DCNN-based methods perform on more complicated vision tasks. In this paper, instead of simply detecting objects, we are interested in a more challenging task, i.e., symbiotic objection detection and semantic part localization. It requires to detect the objects and localize corresponding semantic parts (if visible) in a unified manner. To be noted, it is different from the tasks of detecting the objects and parts individually, which do not provide the correspondence between object instances and their parts.
The semantic object parts (e.g., person head, sofa cushion) are of great significance to many vision tasks and deliver important cues for reasoning the object pose, viewpoint, occlusion and other fine-grained properties. Previous studies involving semantic parts either leverage them to provide more supervision for object detection (Azizpour & Laptev, 2012; Chen et al., 2014) or assume that the objects have already been detected (Yang & Ramanan, 2011; Chen & Yuille, 2014). In addition, current data annotation of semantic parts either covers only a limited number of articulated object classes (e.g., person, animals) (Johnson & Everingham, 2010; Azizpour & Laptev, 2012), or is not very suitable for detection tasks (Chen et al., 2014). To enable a systematic study on our task, we define and annotate the semantic parts for all the 20 object classes on the PASCAL VOC 2012 dataset.
As shown in Fig. 1, on the basis of recently successful R-CNN framework, we explore two different directions of learning representation for our task: Firstly, we present an end-to-end Object-Part (OP) R-CNN (see Fig. 1 (a)), which learns an implicit deep feature representation for facilitating the mapping from an image ROI to a joint prediction of object and part bounding boxes. Secondly, we propose a deep part-based model (named DeePM in this paper, see Fig. 1 (b)) which incorpo-
Figure 1: Illustration on the architecture of the proposed models (best viewed in color). (a) the OP R-CNN; (b) the DeePM model.
rates the Faster R-CNN (Ren et al., 2015) with a part-based graphical model. It learns an explicit representation on the object-part configuration.
In OP R-CNN, we add two new output layers connected to the last fully-connected layer, and use the corresponding losses for the part visibility classification and bounding-box regression tasks, respectively. Then, as in Fast (Girshick, 2015) and Faster (Ren et al., 2015) R-CNNs, we employ a multi-task loss to train the network for joint classification and bounding-box regression on both object and part classes. DeePM, unlike OP R-CNN, does not directly predict the part location based on the deep feature extracted from the object bounding box. It adopts a deep CNN with two separate streams, which share the convolutional layers at the early stages and then are dedicated to object and part classes for extracting their appearance features, respectively. At the same time, as in (Ren et al., 2015), a region proposal network (RPN) is incorporated in each stream to generate object or part proposals in a learning-based manner. After that, a part-based graphical model is built to combine the deep appearance features with geometry and co-occurrence constraints between the object and parts. This enables us to flexibly share part types (learned by unsupervised clustering) and model parameters for learning a compact representation of the object-part configuration.
Using our semantic part annotations on the PASCAL VOC 2012 dataset, we evaluate both object and part detection performance of the proposed methods (OP R-CNN and DeePM), and compare them with the state-of-the-art R-CNN methods on object detection. The DeePM consistently outperforms OP R-CNN in detecting objects and parts (by 0.3% and 2.9% in mAP, respectively), and obtains superior object detection performance to Fast (Girshick, 2015) and Faster (Ren et al., 2015) RCNNs. In addition, we propose a new performance evaluation criterion (named “(1+k)” AP), which considers the detection of object and parts jointly, for the task of symbiotic objection detection and semantic part localization. The DeePM shows consistently superior performance w.r.t. OP R-CNN in the “(1 + k)” AP (when k > 0), indicating that our flexible graphical model does help to this more complicated detection task.
The rest of this paper is organized as follows: In Sec. 2 we discuss the related work of this paper. Sec. 3 describes our semantic part annotation on the PASCAL VOC object classes. In Sec. 4 we present the OP R-CNN model. We elaborate on the DeePM model in Sec. 5, and then describe its inference and learning methods in Sec. 6. We show the experimental results in Sec. 7, and conclude this paper in Sec. 8.
The studies of part-based models have a long and important historical standing in computer vision. The pioneer pictorial structure work (Fischler & Elschlager, 1973) provided an inspirable framework for representing visual objects with a spring-like graph of parts. Following this direction, continuous efforts on the part-based models have been made for a wide range of computer vision tasks including object detection (Felzenszwalb et al., 2010; Chen et al., 2014), pose estimation (Yang & Ramanan, 2011; Chen & Yuille, 2014), semantic segmentation (Long et al., 2015; Chen et al., 2015) and action recognition (Yang et al., 2010; Zhu et al., 2013). Particularly, the DPMs (Felzenszwalb et al., 2010; Zhu et al., 2010), which are built on basis of the HOG features (Dalal & Triggs, 2005), have reached milestone of object detection in the past few years. Different from the “latent” parts used in (Felzenszwalb et al., 2010), some recent works (Azizpour & Laptev, 2012; Chen et al., 2014) have explored the use of semantic parts for improving object detection and localization. However, semantic part detection has not yet been systemically investigated in literature, which places the main interest of this paper. Our DeePM model is different from previous part-based object detection methods in the sense of flexible type sharing: Like DPMs (Felzenszwalb et al., 2010), the graphical models used in (Azizpour & Laptev, 2012; Chen et al., 2014) are also view-based, where the type of part nodes is tied to the object type. Compared with other part sharing work (Ott & Everingham, 2011), our model enables more flexible sharing between different configurations in the sense that the pairwise edge can be defined on arbitrary pair of object mixture component and part type. In addition, to be noted, despite the part localization has been successfully addressed in some other tasks such as human pose estimation (Yang & Ramanan, 2011; Chen & Yuille, 2014), it is different from our task that the objects are presumably detected beforehand.
Recent advances in visual object recognition are driven by the renaissance of deep convolutional neural networks (LeCun et al., 1998; Krizhevsky et al., 2012), which lead to leap progresses in many important recognition tasks such as image classification (Krizhevsky et al., 2012), object detection (Girshick et al., 2014) and semantic segmentation (Long et al., 2015; Chen et al., 2015). In particular, the line of R-CNN studies has dramatically improved the performance of previous DPMs and become current the state of the art in object detection. (Girshick et al., 2014) proposed the R-CNN which incorporates the region proposals with DCNN features for object detection task. (He et al., 2014) presented a SPP-net which employs a spatial pyramid pooling layer to efficiently extract the region CNN features. Very recently, Girshick presented a Fast R-CNN (Girshick, 2015) which adopts a multi-task loss to enable the joint training of networks for object region classification and bounding-box regression tasks. In the Faster R-CNN work, (Ren et al., 2015) presented a dedicated RPN, which improves both the runtime and performance by sharing the convolutional layers with Fast R-CNN, to generate object bounding-box proposals efficiently. In order to enable symbiotic object detection and part localization, the proposed OP R-CNN extends the multi-task loss of Fast/Faster R-CNNs with two additional losses which are responsible for part visibility classifica-tion and bounding-box regression. In our DeePM, the two-stream DCNN, in which the low-level convolutional layers are shared between object and part classes and the latter mid/high-level layers are separate for objects and parts, is dedicatedly designed to our task and naturally conjunct to the graphical model. In addition, we employ two separate RPNs to generate the region proposals of object and part classes, respectively.
In addition, several researchers (Savalle et al., 2014; Wan et al., 2015; Girshick et al., 2015) have married deep CNNs with part-based deformable models, where the parts are not semantic and the part locations are hidden/latent variables. These models are interesting but their performance is lower than recent R-CNN methods which do not use parts. (Zhang et al., 2014) presented a part-based R-CNN to incorporate semantic part localization with R-CNNs for fine-grained category recognition on birds, which does not address our task in more general objects. (Ouyang et al., 2015) proposed a DeepID-Net which utilizes a deformation constrained pooling layer to model the deformation of object parts with geometric constraint and penalty in deep convolutional neural networks, but it still uses the “latent” parts and does not address our task. (Zhu et al., 2015) proposed a segDeepM model which incorporates a MRF with the R-CNN to exploit sematic segmentation cues for improving the accuracy of object detection. Although the name of their model is similar to our DeePM, it does not utilize any part annotations and also not involve the task of semantic part detection.
On semantic object part annotation, Azizpour and Laptev (Azizpour & Laptev, 2012) labelled the part bounding boxes for 6 out of 20 object classes (all are animals) in PASCAL VOC datasets. In (Chen et al., 2014) it provides pixel-level semantic part annotations for a portion of object classes on PASCAL VOC 2012, but the part definitions of some classes are not suitable for detection tasks (e.g., too small parts like eyes, nose). Different from these previous works (Azizpour & Laptev, 2012; Chen et al., 2014), we defined and annotated semantically meaningful parts, which are tailored to the task of symbiotic object detection and part localization, for all the 20 object classes in PASCAL VOC 2012 dataset. Each object category has 1 to 7 parts, resulting in totally 83 part classes. The definition of parts is based on the body structure (e.g., person, animals), viewpoint (e.g. bus, car) or functionality (e.g., chair, sofa). Fig. 3 illustrates our part annotations for the 20 PASCAL VOC object classes.
Our annotation effort took two months of intensely labeling, performed by five labellers trained by us. This results in much more accurate annotation than using crowdsourcing systems such as MTurk. For each object instance, the labelers were asked to annotate its visible parts with a tight rectangular bounding box as used in original object-level annotations in PASCAL VOC. We annotated the parts for all the object instances except some very small or visually difficult ones. In Table 4, we give a full list on our part definition of each object class. These annotations will be released soon.
In this section, we apply the Faster R-CNN framework (Ren et al., 2015) to the task of symbiotic object detection and part localization. Besides the original classification and bounding-box regression losses on object classes, we add two new losses which are responsible for the part visibility classifi-cation and bounding-box regression tasks, respectively. Accordingly, there are two additional output layers connected to the last fully-connected (FC) layer, which is shared by all the four sibling output layers for different tasks (see Fig. 1(a)). We named it by OP R-CNN in this paper.
Suppose there are a total of J object classes and part classes for object class j. As the ground-truth annotation used for training our OP R-CNN, each image ROI example is labelled by the following four variables: an object class label
indicating it belongs to the
-th object (
corresponds to the background class), a tuple of object bounding-box regression targets
for object class j if it is not background, a binary part visibility indicator vector
where
and
indicates the visibility of the i-th part for the object class j, and a set of part bounding-box regression targets
for all visible parts.
As in (Girshick, 2015; Ren et al., 2015), we adopt a multi-task loss L to train the OP R-CNN for the aforementioned four tasks jointly:
In Equ. (1), is the predicted object class probability vector, which is computed by the softmax operation over the (J + 1) confident values from the object classifica-tion output layer.
is the predicted probability on the presence of the i-th part of the object class
is a log loss for multi-class object classification task, while
is a binary cross-entropy loss for part visi- bility classification task. Following (Girshick, 2015),
smooth
(Here we omit the class index subscript j or j, i for notation brevity) is a smooth
loss for bounding-box regression task.
is the Iverson bracket indicator function which outputs 1 when the involved statement is true and 0 otherwise. It implies that we only use the positive examples (i.e., in case that the ground-truth bounding box is viable) for training the bounding-box regressors.
As shown in Fig. 1(b), our DeePM is a detection pipeline composing of a two-stream DCNN and a latent graphical model. This DCNN is dedicatedly designed for our task (symbiotic object detection and part localization). It is responsible for generating both of the object and part proposals as well as corresponding deep features which are used in the appearance terms of the graphical model. The graphical model is presented to incorporate the deep features with the object-part geometry and co-occurrence constrains, which are dependent on different object and part types to capture typical configurations. We will describe the details of the DCNN and the graphical model in Sec. 5.1 and 5.2, respectively.
5.1 A TWO-STREAM DCNN FOR OBJECTS AND PARTS
We propose a two-stream DCNN to generate detection proposals and appearance features for both the object and semantic parts based on the Faster R-CNN framework. Its architecture is illustrated in Fig. 1 (b). The convolutional layers are shared in the early stages, and then split into two separate streams which correspond to object and part classes, respectively. This is desirable because the low-level visual representations (e.g., oriented edges, color blobs) are commonly shared among different classes but the mid-level ones should be class-specific. After that, all the subsequent layers are designed for objects and parts in separate streams. For either the object-level or part-level stream, we adopt a RPN (Ren et al., 2015) to generate the object or part bounding-box proposals correspondingly. The RPNs, which share the convolutional layers with the detection networks, can efficiently generate the object and part proposals in a learning-based manner. It is more desirable (especially for parts) than traditional region proposal methods (e.g., Selective Search (Uijlings et al., 2013)) which are usually based on low-level or middle-level visual cues.
In each stream, the rectangular detection proposals are manufactured with a ROI pooling layer (He et al., 2014; Girshick, 2015). It generates a fixed dimensional representation based on the feature activities of the last convolutional layer. The pooled features of detection proposals are then fed into the fully-connected (FC) layers, followed by a bounding-box regression layer for improving the localization accuracy. We use the last FC layer’s output activities as the feature representation of the appearance terms in the subsequent graphical model (see Fig. 1(b)).
In training the part-level stream of our DCNN, we combine some part classes together into one category. For instance, some parts, which are defined in the sense of spatial positions w.r.t. the object center, have indistinguishable visual appearance (e.g., front and back bicycle wheels) so they should be merged into one category for training the deep network.
5.2 A PART-BASED GRAPHICAL MODEL WITH FLEXIBLE SHARING
For an object class, we define a constellation model with (P + 1) nodes where is the number of semantic parts involved in this object. Let i = 0 denote the object node and
index the node of the i-th part. For each node i, we parameterize its location in the image I by a bounding box
. To account for the visual variations of an object or a part, we introduce a “type” variable
for each node i. Let
and
(
) denote the candidate type sets of the object and the i-th part, respectively. Particularly, we use the type value
to represent the invisibility of the part i. Then we define the configuration of an object associated with its parts by (L, Z), where
and
.
In our graphical model there are three kinds of terms: the appearance term , the geometry compatibility term
and the bias term
. We define the scoring function of a configuration of the object and its parts as Equ. (2):
Figure 2: Illustration on flexible type sharing in DeePM. The red solid circle represents the object node, and the blue solid circles indicate the part nodes. The red and blue hollow marks correspond to different types of object and part nodes, respectively. In our graphical model, one part type (i.e., a sideview horse head) could be shared by different object types (i.e., a fully-visible sideview horse, a highly occluded horse with only head and neck). In the inference step, only one type-specific pairwise relation (i.e., the bold edge) would be selected from the candidate type pair set for each occurrence or geometric term (best viewed in color).
The appearance term: The appearance term of each node is defined as a linear model on the last FC layer’s feature. For a bounding box
on I, let
and
denote the last FC layer’s features of our DCNN’s object and part streams, respectively. Formally, it is defined by
(for the object node) or
(for the part node i). We can see that the model parameter
depends on the object or part type
which accounts for typical part configurations or visual variations correspondingly.
The geometry compatibility term: Similar to (Chen et al., 2014), we use a vector to represent the spatial deformation of the part i w.r.t. the object, where
,
are the normalized spatial displacements and
,
,
are the normalized scales. Furthermore, we define a type-specific prototype parameter of the geometry term by
, which specifies the “anchor” point in the geometry feature space. Thus the geometry term
measures the geometry compatibility between the object and the part i, where
is a type-specific weight vector. Meanwhile,
is a feature vector linearizing the quadratic deformation penalty of
w.r.t.
, i.e.,
.
The bias term: We define bias terms to model the prior belief of different object-part configura-tions and type co-occurrence. Specifically, the unary bias term favors particular type assignments for the object node, while the pairwise bias term
favors particular type co-occurrence patterns between the object and the part i.
Flexible part type sharing: In this paper, the object type represents global object-part configuration (e.g., a particular object pose or viewpoint), while the part type corresponds to a typical visual appearance component in a mixture distribution. In our DeePM model, the pairwise edges allow the connections on arbitrary object-part type pairs for each part i. This enables flexible part type sharing among different object configurations, which is different from the tied object-part types in previous DPM models (Felzenszwalb et al., 2010; Azizpour & Laptev, 2012; Chen et al., 2014) on object detection. For example, as illustrated in Fig. 2, one part type (i.e., a sideview horse head) could be shared by different horse object types (i.e., a fully-visible sideview horse, a highly occluded horse with only head and neck). This is desirable in the sense of compactness and efficiency on representation.
Flexible model parameter sharing: As mentioned in Sec. 5.1, some part classes are merged to one category in training the DCNN, making the appearance features non-discriminative among these parts. Accordingly, this requires the model parameter of their appearance terms ought to be shared. E.g., as shown in Fig. 2, the appearance model of the sideview horse head is invariant to different orientations in the horizontal direction (i.e., towards left and right).
In this paper, DeePM is formulated in a more flexible and modular manner than previous DPM models. For this purpose we introduce a dictionary of part appearance models denoted by (
) is an elemental appearance model which could be shared by different parts. Here we adopt the same notation
as before to indicate the appearance model parameter, but use a subscript of linear index k instead. Then we define an index mapping
to transform the two-tuple subscript
to
is generally a many-to-one mapping function which enables flexible sharing of the appearance models between different parts (even the sharing could be allowed for the parts from different object classes). Likewise, we also define a dictionary of geometry models
and corresponding index mapping function
, where
) is the geometry prototype associated with the k-th element
in the dictionary.
Now we rewrite the model scoring function (Equ. (2)) as below:
Generally the bias parameters should not be shared between different types because it may make the type prior non-informative. In practice, we manually specify based on the appearance similarity of the parts, which is consistent to the merged part category definition in training the DCNN. For
, we do not impose model parameter sharing on the geometry terms. To be noted, one advantage of our formulation is that the sharing of model parameters can be specified in part type level and even be learned automatically (i.e., learning the mapping function
on the fly). We will explore this direction in future work.
As mentioned in Sec. 5.1, we generate the object and part detection proposals via corresponding object-level and part-level RPNs, respectively. Let denote an object proposal set and
be a part proposal set for I. For any object bounding box
, we define the set of its candidate part windows by
and
, where
is the inside rate measuring the area fraction of a window l inside
and
is a threshold (
in this paper). Thus given
the best configuration
can be inferred by maximizing the scoring function:
We use dynamic programming to maximize Equ. (4) in inference. In the rest of this section, we elaborate on the learning method of our DeePM.
Learning the DCNN: Generally we use similar criterions to train the DCNN as in (Ren et al., 2015). The shared convolutional layers are directly inherited from a pre-trained network. The fine-tuning procedure starts from the separate convolutional layers throughout all subsequent layers in both the object-level and part-level streams. To enable the sharing of convolutional layers for RPN and Fast R-CNN, we follow the four-step stage-wise training criterion as in (Ren et al., 2015).
Learning object and part types: In this paper, the object types are defined to capture typical con-figurations while the part types account for different components in a mixture of visual appearances. Thus we learn the types by using two different criterions for objects and parts, respectively.
Similar to (Azizpour & Laptev, 2012), we adopt a pose-based global configuration feature to learn the object types. Concretely, for each positive ground-truth example we concatenate the geometry features of all visible parts as well as the binary part visibility indicators to a single vector, and then use it as the global configuration feature to perform a modified K-means clustering algorithm (Azizpour & Laptev, 2012) which can well handle missing data (i.e., some parts may be absent)2. After that, we obtain a couple of clusters as object types which potentially correspond to different typical configurations. Fig. 8 and 9 visualize the learned types for horse and person, respectively.
For part classes, we adopt the feature activities after the ROI pooling layer to learn the types. Specifically such features from the DCNN’s part-level stream are used to perform K-means clustering on the ground-truth part data, and the resultant clusters work as part types.
Learning the graphical model with latent SVM: To facilitate the introduction of learning our graphical model, we gather all the model parameters w and b into one single vector W = . Given labelled positive training examples
) and negative examples
(
), where
involves all the visible ground-truth part bounding boxes for the example n (the negative examples do not have parts), we learn the model parameters W via the latent SVM framework (Felzenszwalb et al., 2010):
where C is a predefined hyper-parameter on model regularization and represents a feasible set of latent configurations for example n. In this paper, the definition of H is different between positive and negative examples: For positive examples, we constrain the search space
with the ground-truth bounding boxes of the parts which are visible (i.e., the candidate part locations L should be consistent to
, and the candidate type values of
should be larger than 0 for any visible part i). This constraint encourages the correct configurations of positive examples to be scored higher than a positive margin value (i.e., +1). For negative examples, in contrast, we do not impose any restrictions on
, implying that the scores should be less than a negative margin (i.e.,
) for all possible configurations of part locations and types.
Due to the existence of latent variables for positive examples, the problem of Equ. (5) is not convex and thus we employ the CCCP algorithm (Yuille & Rangarajan, 2002) to minimize the loss iteratively. At first, we initialize the object and part type values of positive examples according to the assignments from the aforementioned type clustering stage. Then we iteratively optimize Equ. (5) by alternating between two steps: (1) Given the type value assignments of positive examples, Equ. (5) becomes convex and we use a dual coordinate quadratic program solver (Ramanan, 2013) to minimize the hinge loss. (2) Given current model W, we search the best type assignments of the object and visible parts for positive examples. By iterations of these two steps, the loss of (5) decreases monotonously till convergence.
7.1 THE VISUAL TASK AND EVALUATION CRITERION
We evaluate the proposed OP R-CNN and DeePM models on the task of symbiotic object detection and semantic part localization. It requires the model to output all detected object bounding boxes with corresponding part bounding boxes (if visible). Each object bounding box is associated with a prediction score or probability indicating the confidence of presence on the object class of interest. Likewise, all the output part bounding boxes are associated with the visibility confidence values of corresponding part classes. To be noted, our task is different and more challenging than independent object or part detection in the sense that it asks for the correspondence between the output object and part detections. E.g., it requires to know which head bounding box corresponds to some particular person bounding box in a couple of overlapping person detections.
In this paper, we first use the average precision (AP) (Everingham et al., 2010), which is a standard evaluation criterion used in the object detection literature, as the performance evaluation criterion for both the object and part detection tasks. Particularly, it is a stricter measurement on evaluating the part localization performance than the percentage of correct part (PCP) criterion which is commonly used in the pose estimation literature. PCP only considers the part detections involved in true positive object bounding boxes, making it less informative for false positive detections. Because we cannot assume that the object bounding boxes are perfectly detected in advance of part localization in our task, AP is a more suitable performance evaluation criterion than PCP for part detection.
However, because it calculates the object and part detection performance separately, the standard AP criterion does not suit for the task of symbiotic objection detection and semantic part localization. In this paper, we propose a new performance evaluation criterion (named “(1+k)” AP) for this task. Specifically, the “(1 + k)” AP is defined as the average precision in the sense that both the object and at least k parts of it are correctly detected (i.e., the IoU overlap w.r.t. ground-truth object/part bounding box is larger than 0.5, or no bounding box output for invisible parts). For instance, the “(1 + 2)” AP means that only the detections, in which both of the object and no less than 2 parts are predicted correctly, are regarded as true positive examples. Like the standard PASCAL VOC AP criterion (Everingham et al., 2010), the duplicate detections are punitively counted as false positives. Thus, the proposed “(1 + k)” AP criterion would produce a brunch of AP numbers, each of which corresponds to a different number requirement of parts correctly detected. For k = 0, it does not require to detect parts correctly and this AP number is corresponding to the performance of detecting objects solely. When k is equal to the number of all possible parts for an object class, the “(1 + k)” AP number will be the most strict one because only the perfect joint object-part detections (i.e., both the object and all the corresponding parts are correctly detected) can be counted as true positive examples.
7.2 EXPERIMENTAL SETTINGS
Because we only annotate the semantic parts on the trainval images of PASCAL VOC 2012, we first perform several diagnostic experiments as well as part detection evaluation by using train set for training and val set for testing on the diagnostic experiments. In addition, we test our methods for the object detection task with VOC 2012 test set, and compare the object detection performance with Faster R-CNN (Ren et al., 2015). The AP number is evaluated for each (either object or part) class individually and the mean AP (mAP) is reported.
We construct our OP R-CNN based on the Faster R-CNN architecture (Ren et al., 2015), in which it uses the RPN to generate region proposals. Similar to (Girshick, 2015; Ren et al., 2015), we set the loss balance hyper-parameters by and
. All other parameters in training the OP R-CNN follow the settings in (Ren et al., 2015). All the DCNNs in our experiments are fine-tuned from a pretrained VGG-16 net (Simonyan & Zisserman, 2015).
In training the DCNN for our DeePM model, we follow the parameter settings in (Ren et al., 2015). The convolutional layers from conv1 1 to conv2 2 are shared in the DCNN, and the separate streams start from conv3 1. We use the feature activities of the last FC layer of VGG-16 net (i.e., fc7) as the appearance feature in the graphical model. We also normalize the fc7 features as in (Girshick et al., 2014), and set C = 0.001. Similar to (Girshick et al., 2014), we use hard negative example mining over all images, with the IoU overlap threshold 0.3 (the object proposal windows with overlap less than 0.3 w.r.t. the ground-truth boxes are used as negative examples). We use 5 types for each object class and 3 types for each part class. In testing stage, we use the same inference heuristics as in (Ren et al., 2015). The bounding-box regressors, which are learned from the object-level and part-level streams of the DCNN, are used in the non-maximum suppression (NMS) operations of object and part detections, respectively. For comparison, we obtain the Fast R-CNN and Faster R-CNNs’ results by using the code released from the authors of (Ren et al., 2015). For Fast R-CNN (Girshick, 2015), we use the selective search approach (Uijlings et al., 2013) to generate object region proposals.
7.3 EXPERIMENT RESULTS AND ANALYSIS
We first performance diagnostics experiments on OP R-CNN and DeePM, and then compare them with other object detection methods.
Table 1 shows the detection performance for object classes. In this experiment, we conduct a baseline method which uses the same fc7 features of the DCNN in DeePM to train SVM classifiers for object detection. This enables a direct comparison with our DeePM, in order to investigate the sig-nificance of using the geometry and co-occurrence cues in the graphical model. We use ‘fc7+svm’ to denote this baseline method in table 1. The DeePM outperforms the fc7+svm baseline by 1.4% in mAP, showing the significance of the geometry and co-occurrence constraints in the explicit representation learned with semantic parts. Moreover, its performance is comparable with OP R-CNN and superior to Fast R-CNN and Faster R-CNN on object detection.
For part detection, as shown in table 2, the DeePM model shows superior performance (2.9% in mAP) w.r.t. OP R-CNN. Especially, the performance improvement tends to be relatively large for small parts (e.g., the animal tails, heads). OP R-CNN learns an implicit feature representation extracted from the object bounding box to regress the part location, which may be difficult to predict potentially very variational positions for small parts. By contract, DeePM employs the part-level stream of DCNN to generate part bounding-box proposals, and then explicitly leverages useful geometry and co-occurrence cues to localize the parts in a symbiotic manner.
Moreover, we compare the DeePM with a deformable part-based baseline model, which has a single object type and the same set of semantic parts each with a single type. This baseline DPM model is basically a ‘DPM-v1’ counterpart (Felzenszwalb et al., 2010) on top of the learned deep features and RPN proposals. In figure 4, we show the performance comparison between DeePM (5 types used for the object node, 3 types used for the part nodes) and the ‘DPM-v1’-like single type baseline model on four object classes, i.e. bicycle, boat, horse and sofa. When using only a single type for each node, the geometric and co-occurrence terms will be not informative, leading to poor performance of detecting the object and its parts. As shown in figure 4, the DeePM model generally outperforms the single-type ‘DPM-v1’ baseline for both object and part detection performance, indicating the significance of using type-specific geometric and co-occurrence cues in the graphical model. In addition, following the strategy proposed in (Hoiem et al., 2012), we give a detailed analysis on the performance w.r.t. object/part size in figure 5. We can see that the extremely small (‘XS’) or small (‘S’) instances are very difficult to be detected, and the performance of extremely large (‘XL’) objects or parts is also relatively low because of highly truncated/occuluded examples often occurred in this size level.
The experiments above evaluate the object and part detection performance separately, which enable us to compare our methods with previous object detection approaches and show detection performance for each individual part class. However, as mentioned in Sec. 7.1, the standard evaluation criterion (i.e., PASCAL VOC AP criterion) does not suit for our main task (i.e., symbiotic objection detection and semantic part localization) in this paper. For this purpose, we also test the joint object-part detection performance for these four object classes by using the proposed “(1 + k)” AP criterion. As shown in figure 6, we can see that the performance decreases dramatically when the quantity requirement of correctly detected parts (i.e., k) becomes larger, which implies that corresponding detection task is more difficult. For the situation that k is equal to the number of all possible parts, the “(1 + k)” AP is extraordinary low (usually less than 1%) caused from the extremely strict definition of true positive examples. The DeePM consistently outperforms OP R-CNN3 in this “(1 + k)” AP (when k > 0), which indicates that our flexible graphical model does help to a more complicated detection task like the joint detection of object and semantic parts.
At last, we test the object performance of our methods on VOC 2012 test set. In this experiment, the models are trained with VOC 2012 trainval set and the parameter settings are consistent with above diagnostic experiments. As shown in table 3, we can see that the OP R-CNN and DeePM obtain slightly superior performance w.r.t. Fast/Faster R-CNN. In figure 7, we visualize some examples of the DeePM’s detection result.
In this paper, we study on learning part-based representation for symbiotic object detection and semantic part localization. For this purpose, we annotate semantic parts for all the 20 object classes on the PASCAL VOC 2012 dataset, which provides information on reasoning object pose, occlusion, viewpoint and functionality. To deal with, we propose both implicit (OP R-CNN) and explicit (DeePM) solutions. We evaluate our methods for both the object and part detection on PASCAL VOC 2012, and show that DeePM consistently outperforms OP R-CNN (especially by a relatively large margin on part detection), implying the importance of using the learning-based part proposals and explicit geometry cues for part localization. In addition, we proposed a new “(1 + k)” AP performance criterion for evaluating the task of symbiotic objection detection and semantic part localization. The DeePM consistently outperforms OP R-CNN in the “(1 + k)” AP (when k > 0), indicating that our flexible graphical model does help to this more complicated detection task.
Azizpour, Hossein and Laptev, Ivan. Object detection using strongly-supervised deformable part models. In European Conference on Computer Vision (ECCV), 2012.
Chen, Liang-Chieh, Papandreou, George, Kokkinos, Iasonas, Murphy, Kevin, and Yuille, Alan L. Semantic image segmentation with deep convolutional nets and fully connected crfs. In International Conference on Learning Representations (ICLR), 2015.
Chen, Xianjie and Yuille, Alan L. Articulated pose estimation by a graphical model with image dependent pairwise relations. In Advances in Neural Information Processing Systems (NIPS), 2014.
Chen, Xianjie, Mottaghi, Roozbeh, Liu, Xiaobai, Fidler, Sanja, Urtasun, Raquel, and Yuille, Alan. Detect what you can: Detecting and representing objects using holistic models and body parts. In Computer Vision and Pattern Recognition (CVPR), 2014.
Dalal, Navneet and Triggs, Bill. Histograms of oriented gradients for human detection. In Computer Vision and Pattern Recognition (CVPR), 2005.
Everingham, M., Van Gool, L., Williams, C. K. I., Winn, J., and Zisserman, A. The pascal visual object classes (voc) challenge. International Journal of Computer Vision, 88(2):303–338, June 2010.
Felzenszwalb, Pedro F, Girshick, Ross B, McAllester, David, and Ramanan, Deva. Object detection with discriminatively trained part-based models. IEEE Transactions on Pattern Analysis and Machine Intelligence (T-PAMI), 2010.
Fischler, Martin A and Elschlager, Robert A. The representation and matching of pictorial structures. IEEE Transactions on computers, 1973.
Girshick, Ross. Fast r-cnn. In International Conference on Computer Vision (ICCV), 2015.
Girshick, Ross, Donahue, Jeff, Darrell, Trevor, and Malik, Jagannath. Rich feature hierarchies for accurate object detection and semantic segmentation. In Computer Vision and Pattern Recognition (CVPR), 2014.
Girshick, Ross, Iandola, Forrest, Darrell, Trevor, and Malik, Jitendra. Deformable part models are convolutional neural networks. In Computer Vision and Pattern Recognition (CVPR), 2015.
He, Kaiming, Zhang, Xiangyu, Ren, Shaoqing, and Sun, Jian. Spatial pyramid pooling in deep con- volutional networks for visual recognition. In European Conference on Computer Vision (ECCV), 2014.
Hoiem, Derek, Chodpathumwan, Yodsawalai, and Dai, Qieyun. Diagnosing error in object detectors. In European Conference on Computer Vision (ECCV), 2012.
Johnson, Sam and Everingham, Mark. Clustered pose and nonlinear appearance models for human pose estimation. In British Machine Vision Conference (BMVC), 2010.
Krizhevsky, Alex, Sutskever, Ilya, and Hinton, Geoffrey E. Imagenet classification with deep con- volutional neural networks. In Advances in Neural Information Processing Systems (NIPS), 2012.
LeCun, Yann, Bottou, L´eon, Bengio, Yoshua, and Haffner, Patrick. Gradient-based learning applied to document recognition. Proceedings of the IEEE, 1998.
Long, Jonathan, Shelhamer, Evan, and Darrell, Trevor. Fully convolutional networks for semantic segmentation. In Computer Vision and Pattern Recognition (CVPR), 2015.
Ott, Patrick and Everingham, Mark. Shared parts for deformable part-based models. In The IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2011.
Ouyang, Wanli, Wang, Xiaogang, Zeng, Xingyu, Qiu, Shi, Luo, Ping, Tian, Yonglong, Li, Hong- sheng, Yang, Shuo, Wang, Zhe, Loy, Chen-Change, and Tang, Xiaoou. Deepid-net: Deformable deep convolutional neural networks for object detection. In The IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2015.
Ramanan, Deva. Dual coordinate solvers for large-scale structural svms. http://arxiv.org/abs/1312.1743, 2013.
Ren, Shaoqing, He, Kaiming, Girshick, Ross, and Sun, Jian. Faster r-cnn: Towards real-time object detection with region proposal networks. In Advances in Neural Information Processing Systems (NIPS), 2015.
Savalle, Pierre-Andr´e, Tsogkas, Stavros, Papandreou, George, and Kokkinos, Iasonas. Deformable part models with cnn features. In European Conference on Computer Vision (ECCV), Parts and Attributes Workshop, 2014.
Simonyan, Karen and Zisserman, Andrew. Very deep convolutional networks for large-scale image recognition. In International Conference on Learning Representations (ICLR), 2015.
Uijlings, Jasper R. R., van de Sande, Koen E. A., Gevers, Theo, and Smeulders, Arnold W. M. Selective search for object recognition. International journal of computer vision (IJCV), 2013.
Wan, Li, Eigen, David, and Fergus, Rob. End-to-end integration of a convolutional network, de- formable parts model and non-maximum suppression. In Computer Vision and Pattern Recognition (CVPR), 2015.
Yang, Weilong, Wang, Yang, and Mori, Greg. Recognizing human actions from still images with latent poses. In Computer Vision and Pattern Recognition (CVPR), 2010.
Yang, Yi and Ramanan, Deva. Articulated pose estimation with flexible mixtures-of-parts. In Computer Vision and Pattern Recognition (CVPR), 2011.
Yuille, Alan L. and Rangarajan, Anand. The concave-convex procedure (cccp). In Advances in Neural Information Processing Systems (NIPS), 2002.
Zhang, Ning, Donahue, Jeff, Girshick, Ross, and Darrell, Trevor. Part-based r-cnns for fine-grained category detection. In European Conference on Computer Vision (ECCV), 2014.
Zhu, Jun, Wang, Baoyuan, Yang, Xiaokang, Zhang, Wenjun, and Tu, Zhuowen. Action recognition with actons. In International Conference on Computer Vision (ICCV), 2013.
Zhu, Long, Chen, Yuanhao, Yuille, Alan L., and Freeman, William. Latent hierarchical structural learning for object detection. In Computer Vision and Pattern Recognition (CVPR), 2010.
Zhu, Yukun, Urtasun, Raquel, Salakhutdinov, Ruslan, and Fidler, Sanja. segdeepm: Exploiting segmentation and context in deep neural networks for object detection. In The IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2015.
Table 1: Object detection average precision (%) on the PASCAL VOC 2012 val set.
Table 2: Part detection average precision (%) on the PASCAL VOC 2012 val set. Please refer to Table 4 for the full name of each object semantic part.
Table 3: Object detection average precision (%) on the PASCAL VOC 2012 test set.
Table 4: The full list of our part definition for the 20 PASCAL VOC object classes. The abbreviation for each part is in the parentheses.
Figure 3: Illustration on our annotations of semantic object parts for the PASCAL VOC 2012 dataset. In this paper, we define 83 semantic part classes for all the 20 PASCAL VOC object classes. For clarity, we only visualize the part annotations for one object instance in each class. There may be multiple instances of the same object class in one image (e.g., the pictures of sheep and cow) and we have actually labelled the parts for each object instance (best viewed in color).
Figure 4: Performance (AP) comparison between DeePM (5 types used for the object node, 3 types used for each part node) and the ‘DPM-v1’-like single type baseline model. The blue and red color bars correspond to the baseline model and DeePM, respectively. (a) object class bicycle and its parts; (b) object class boat and its parts; (c) object class horse and its parts; (d) object class sofa and its parts. Please refer to table 4 for the full names of object parts (best viewed in color).
Figure 5: Performance (AP) w.r.t. different object/part size levels. ‘All’: all examples; ‘XS’: extremely small size; ‘S’: small size; ‘M’: medium size; ‘L’: large size; ‘XL’: extremely large size. (a) object class bicycle and its parts; (b) object class boat and its parts; (c) object class horse and its parts; (d) object class sofa and its parts. Please refer to table 4 for the full names of object parts (best viewed in color).
Figure 6: The “(1+ k)” average precision (%) comparison for DeePM and OP R-CNN. The red and green color bars correspond to DeePM and OP R-CNN, respectively. (a) bicycle; (b) boat; (c) horse; (d) sofa (best viewed in color).
table bottom
Figure 7: Visualization on some examples of the detected object-part configurations by DeePM (best viewed in color).
Figure 8: Visualization on the learned object types for horse class. The first column is the average image over the examples of different types. The second column shows the anchor part bounding boxes (i.e., the mean bounding boxes over corresponding part instances) within the object bounding box. The rest columns visualize the normalized center locations of part instances w.r.t. the object bounding box for all the part classes, respectively (best viewed in color).
Figure 9: Visualization on the learned object types for person class. The first column is the average image over the examples of different types. The second column shows the anchor part bounding boxes (i.e., the mean bounding boxes over corresponding part instances) within the object bounding box. The rest columns visualize the normalized center locations of part instances w.r.t. the object bounding box for all the part classes, respectively (best viewed in color).