RAPID developments in object detection [1] andsemantic segmentation [2] have made it possible to accurately understand region or pixel-level semantics of images. An increasing number of researchers focus more on real-time capability of models [3]. Rather than achieving object detection and semantic segmentation independently, integrating both of them into one framework is more preferable but a more challenging instance segmentation task in which not only the class label predictions of all the objects but also their pixel-level predictions separating the objects from the background are needed. Optimally addressing this task would greatly benefit many applications including autonomous driving, robotics and video surveillance.
Most successful instance segmentation approaches are derived from object detection models by adding a new branch on top of regional features from Region Proposal Network (RPN) [4] to predict the objects’ corresponding masks [5], [6], [7]. This model can be referred to as the two-stage instance segmentation model whose efficiency is often influenced by the detection-based scheme. A well-known example is the Mask R-CNN [5] framework built on the Faster R-CNN [4] backbone, in which the efficiency is limited by the two-stage mechanism with a heavy head and multiple Regions of Interest (RoIs) [4].
Apart from the two-stage scheme, one-stage scheme has been applied to object detection and the state-of-the-art detectors have reported similar accuracy to two-stage detector with faster speed. With carefully designed architectures [8], [9], [10] and loss functions [11], one-stage detectors are able to run in real-time. This gives rise to a natural question: can we design a one-stage instance segmentation model that enjoys the dual benefits of efficiency and accuracy?
The main problem in one-stage instance segmentation is how to distinguish instances between and within classes simultaneously from a convolution feature map without the help of pre-generated RoIs. To address this problem, we propose a new framework named as the Single Pixel Reconstruction Net (SPRNet) described in
Fig. 1: The SPRNet framework for instance segmentation. Classification, regression and mask branches are processed in parallel. We generate each instance mask from a single pixel, and resize the mask to fit the corresponding box to get the final instance-level prediction.
Figure 1. By using only a pixel to reconstruct the mask of an instance, our method outperforms R-CNN based frameworks where the prediction of instance mask starts from each RoI in that we can predict the instance mask in a faster way with less memory consumption and allow more instances to be predicted simultaneously on an image. We use a group of convolutions with various dilations to gather enough information into a single pixel. To construct the mask of one instance, we sample a pixel and then use consecutive deconvolutions to construct a 3232 score map denoting the final mask prediction.
We also make modifications to the backbone part in our model such that we can enhance the information carried by each pixel to facilitate the implementation of the SPRNet. Specifically, we borrow the principle from Feature Pyramid Network (FPN) [10], where features from different levels are added to construct a bottom-up and top-down path. In their implementation, a naive element-wise summation is used to fuse multi-level features. However, the FPN framework has two problems. First, the summation operation may introduce unexpected information flow in feature learning in that severe spatial shift from higher levels could damage already well preserved spatial information in lower ones. Second, since the derivative of summation is always a constant, summation will cause gradient propagation between different levels, implicitly weakening the effect of the fused features. This is because the initiative of designing FPN is to detect objects in different scales on different levels, which means that gradient from one level should not easily interfere that of another. Therefore, we propose an improved Gate-FPN (GFPN) by explicitly introducing a simple gating mechanism before feature fusion. This step improves the quality of feature fusion and smartly restrict gradient propagation between different levels, leading to better detection and segmentation.
The proposed SPRNet is of an encoder-decoder structure. In encoding part, following common backbones, Gate-FPN enhances semantic and spatial information carried by each pixel. In decoding part, each pixel is an instance carrier to generate instance mask, consecutive deconvolutions are applied on them to get the final predictions. To summarize, SPRNet extends the state-of-the-art one-stage detector RetinaNet by adding a parallel branch for predicting an object mask. Note that SPRNet is a general framework, so the backbone network can be replaced with other one-stage detectors without damaging the integrity and feasibility of instance mask generation. Our main contributions are as follows:
(i) To the best of our knowledge, we propose a one-stage instance segmentation model for the first time. SPRNet achieves comparable performance in terms of mask AP with Mask R-CNN, while delivering faster speed at the inference;
(ii) By introducing the Gate-FPN architecture, we bring all-round improvements on AP in detection and segmentation performance.
The rest part of the paper is organized as follows. Section 2 reviews the related work. Section 3 introduces the proposed SPRNet model with Gate-FPN in detail. Experimental results are demonstrated and analyzed in Section 4 through Section 5 and Section 6 concludes this paper.
Object Detection: Prevalent object detection approaches can be categorized into two classes: the two-stage framework based on region proposals [4], [12], or the one-stage framework based on convolutional feature maps [8], [11], [13]. As pioneered in the R-CNN work [4], the first stage generates a set of candidate region proposals to recall as much objects as possible, and the second stage uses a deep network to classify the proposals. R-CNN successfully underpinned many follow-up improvements like Faster R-CNN [4] and RFCN [14], which are the current leading frameworks for object detection. An inevitable procedure of the two-stage methods is the per-proposal prediction, which becomes a speed bottleneck of these methods in case of large number of proposals. In contrast, the one-stage methods do not introduce the region proposals thus have high computational efficiency. Commonly adopted object detectors include OverFeat [15], SSD [8], YOLO [13], DSSD [16] and RetinaNet [11]. OverFeat is one of the first modern one-stage object detectors based on deep networks. SSD [8] and YOLO [13] are carefully designed to speedup the implementation at the cost of degradation in accuracy. Recently, DSSD [16] and RetinaNet [11] have renewed interest in one-stage methods, achieving impressive accuracy that rivals that of two-stage detectors, and also running at much higher speeds.
Instance Segmentation: Instance segmentation requires a pixel-level prediction between and within classes [17]. Current instance segmentation methods can be categorized into two classes, i.e., detection-dependent and detection-free method.
Combining the object detection task with the semantic segmentation task results in a more challenging instance segmentation task. DeepMask [18] and subsequent works first generate candidate segment proposals, and then classify them by Fast R-CNN [12]. As the segmentation stage is time-consuming, these methods are usually slow. Fully Convolutional Instance Segmentation (FCIS) [6] embeds the segment proposal stage into an object detection framework. However, FCIS produces systematic errors for overlapping instances and creates spurious edges, demonstrating that it is vulnerable to the inherent difficulties in segmenting instances. Mask R-CNN extends Faster R-CNN by adding a branch for predicting segmentation masks on each Region of Interest (RoI) [5]. All the above instance segmentation approaches are two-stage methods that need to generate region (or segment) proposals first, which limits their running speed. One-stage object detectors work well because they only need to predict rectangular boxes for describing objects. However, generating promising instance-level segmentation masks is beyond the capability of existing one-stage detectors.
Detection-free methods are more straightforward. Since common semantic segmentation could be well solved, which is relatively easier than classic proposal models, this kind of methods extend common semantic segmentation models like [19], [20] by exploiting handdesigned clustering algorithms. Once all pixels in an image are classified, clustering will be used to group them together for different object instances.
Multi-Scale Feature Learning: In both the detection and segmentation tasks, the problem of detecting small objects is an open problem. On the bottom layers of a deep network, objects have rich spatial information but lack clear semantic; while on the top layers the situation turns around. To tackle this problem, Feature Pyramid Network (FPN) was proposed to aggregate multi-level features [21] in a pyramidal hierarchy. Beyond the traditional bottom-up pathway in the deep networks, FPN additionally establishes a top-down paths to deliver the rich semantic information of top layers to the bottom layers [10]. In their implementation, each higher layer feature map is 2up-sampled and then merged with the and lower level feature map using element-wise summation. By doing so, the semantic information of top layers can be passed to the bottom layers successfully. However, their inaccurate spatial information is also passed to the bottom layers at the same time, which counteracts the effectiveness to some extent.
SPRNet shares similarity with detection-based instance segmentation frameworks. However, unlike Mask RCNN or any other existing two-stage methods, SPRNet is free of RoI Pooling, RoI Align or any similar mechanisms. The whole instance segmentation process in SPRNet is seamlessly integrated, SPRNet can therefore be viewed as a detection-based one-stage framework for instance segmentation.
To achieve the goal of one-stage instance segmentation, there are two core problems to be well resolved: 1) how to encode sufficient spatial and semantic information into each single pixel of the feature maps; 2) how to decode the instance mask with respect to a single pixel from feature maps. To address the two problems, we first introduce the network architecture of SPRNet that is inspired by RetinaNet [11] and motivated by the deconvolution introduced in the semantic segmentation models [19], [20], we subsequently propose a mask branch that consists of a cascade of deconvolution layers to reconstruct complete mask from a single pixel. The detailed architecture of SPRNet is demonstrated in Figure 2 where the left panel corresponds to problem 1) and the right panel corresponds to problem 2) .
3.1 Network Architecture
SPRNet uses ResNet or any other feature extractors with deep network structures as the backbone. We tackle the first problem, i.e. spatial and semantic information encoding, through multi-scale feature learning according to which the feature maps at multiple scales, e.g. those produced by the third, fourth and fifth convolutional layers (C3, C4 and C5), are applied for multi-scale detections. The network architecture of SPRNet is somewhat similar to RetinaNet [11] that is a one-stage object detector with ResNet and FPN as its backbone. However, our SPRNet is modified to overcome two
Fig. 2: SPRNet: A denotes the number of anchors, k denotes the number of classes. (3.1) is the detailed architecture. The backbone network is ResNet-50, where 4, 8
, 16
, 32
and 64
downsampled feature maps are used to construct the FPN. The 32
downsampled feature map is the pooling of the 16
downsampled one, the 64
downsampled feature map is the convolutional result on 32
downsampled feature map with stride 2, and G denotes the gate mechanism. The gated FPN is followed by three parallel branches for predicting class, box and mask respectively. (3.2) represents the spreading of the mask branch that includes multi-scale fusion, channel-wise concatenation, positive pixel sampling and consecutive deconvolutions for instance mask generation. Note that each deconvolution will result in a 2
wider feature map, so that 5 deconvolution layer will decode a 1
1 pixel into a 32
32 mask.
primary problems of RetinaNet such that it works better in detection and is competent for instance segmentation.
First, higher level feature maps are up-sampled and directly added to the lower level ones to form feature maps of FPN. The basic concern is to take advantage of both the spatial information contained in the lower level features and the semantic information contained in the higher level features to achieve multi-scale object detection. Typically, a lower level feature map is twice the length and width as the higher level feature map. However, in ResNet [22], [23], there could be dozens of layers between such pair of higher and lower level feature maps, which means the structure information contained in the higher level feature maps are highly compact. Therefore, up-sampling the higher level feature maps will lead to ambiguity of object locations and we name this phenomena as spatial shift, which will impose unexpected influence when adding the up-sampled feature maps to lower level ones.
Second, adding operation may cause unexpected gradient propagation. The derivative of ‘a+b’ with respect to ‘a’ is constant ‘b’ and vice versa. If we perform ‘+’ between two levels of features in the backbone network, during gradient back-propagation [24], the loss from the lower level summed features will directly be passed to higher level features (please refer to formulas (4) through (10)). We nevertheless hope different level of feature pyramid can concentrate on detecting objects with corresponding scale, therefore we do not encourage the loss of lower level summed features to interfere with higher level feature maps that introduce spatial shift into the lower level. A typical situation is shown in Figure 3, where gradients propagated from P3 (the level-3 feature map of FPN) to C4 (the level-4 feature map of the backbone network) have completely overwhelmed that to C3 such that C4 plays dominant role in the learning of P3 and this is actually unwanted.
To this end, we propose a flexible mechanism that enables the network to adaptively determine which backbone feature map should be the primary contributor in each level’s feature learning so as to avoid the spatial shift and unexpected gradient propagation. Specifically, both backbone feature maps need to pass through a shared separable convolution layer, then the outputs are fed to a sigmoid activation respectively to form two score maps. Each score map has element-wise correspondence to the feature map with the element value in the scope of zero to one, therefore a score map can be regarded as a gate controlling the information flow between features and the proposed mechanism is named as Gate-FPN (GFPN). The gating mechanism is somewhat similar to the weighting method [25], [26]. Specifically, we perform element-wise multiplication of two score maps with corresponding backbone feature maps, and add both products together to form a feature map in FPN. The calculation process can be denoted as formulas (1) through (3) where and
denote the backbone feature maps,
and
denote the shared weights and biases respectively. Since the score maps are adaptively learned, we believe they play important role in preventing the spatial shift and unexpected gradient propagation. The
Fig. 3: Unconstrained gradient will result in unexpected and unstable training process of FPN. As is shown in the FPN panel,the gradients propagated from P3 (the level-3 feature map of FPN) to C3 (the level-3 feature map of the backbone network) is much larger than those propagated to C4, such that C3 only has subtle influence on P3 compared with C4 in the procedure of optimization. This incurs serious gradient bias in that C3 has not been well exploited during optimization. While in the Gate-FPN panel, gradients passing to both sides will be tuned first such that C3 is the dominant contributor. Note that C3 is supposed to be the primary contributor because we do not encourage the gradient propagation from P3 to C4 to avoid spatial shift introduced by C4 and force P3 to focus on the objects with specific size.
GFPN mechanism is denoted as module ‘G’ in Figure 2 and detailed as Figure 4. Note that we share the same convolution in score map learning because we want to score each feature map under the same metric, and we prefer separable convolution instead of normal convolution because it reduces calculation and performs better. We use sigmoid activation because it works better than other typically used activation functions. The AP obtained by different activation functions can be seen in Table 1.
Each level feature map of GFPN will be sent to three branches — classification, regression, and segmentation that are represented by ‘class branch’, ‘box branch’ and ‘mask branch’ in Figure 2. (3.1) respectively. Since the regression and classification branches are the same as those of RetinaNet, we ignore their details for simplicity and discuss the mask branch in detail in the next subsection. The three branches are processed in parallel. Specifically, we perform the same operations with shared parameters at each GFPN level and the output feature maps are used together to predict instances. The overall
Fig. 4: Gate mechanism in GFPN. Note that the separable convolution is shared by two backbone feature maps.
losses consist of three parts: classification, regression, and mask, which are computed through the focal loss [11], the loss [4] and the cross-entropy loss [5] respectively.
TABLE 1: Comparison of different activation functions on Retina-FPN-50-400px
3.2 Mask Branch
To address the two problems proposed in the beginning of Section 3, we propose an Encoder-and-Decoder-like structure. The encoder part is primarily referred to as the GFPN learning, we nevertheless propose a multi-scale fusion scheme to enhance the encoder part so that details of object’s morphological information are embedded into each pixel. After that, we sample pixels which are most likely to locate at the center of the instances, and perform a decoder [27]-like reconstruction process. We name these pixels as positive pixels, each of which is gradually reconstructed as a 32 32 score map that denotes the mask of its corresponding instance. See Figure 2. (3.2) for more details.
Multi-scale Fusion: To endow a single pixel with enough information, we apply four additional convolution operations to feature maps before mask estimation, including a convolution with 256 channels and three
convolutions with dilation rates at [2, 4, 6], and each has 128 channels, see Figure 5 for more details. After channel-wise concatenating these feature maps, each single pixel carries morphological information across various scales and has large receptive fields embedded. The usage of the convolutions with dilation is necessary because the sampled pixel may not locate at the very center of an instance, it therefore has to see ‘wider’ for better capturing information of the entire instance.
In order to retrieve the positive pixels, we first generate 9 anchor boxes on each pixel. If any one of the 9 boxes has an overlap larger than 0.7 (higher than 0.5, the box training positive threshold) with any instance, we set this pixel to be a positive sample for training and its label will be used to attain the instance mask. Hence, the pixels near the same instance have exactly the same training targets. More details are in Figure 6. It is very clear that the higher the overlap threshold, the more probable the sampled pixel is the center of the instance, and vice versa. We need to leverage how to determine the best threshold. First, a low threshold usually results in more sampled pixels locating outside the instances, which makes it very hard to generate a preferable and accurate mask and makes the training difficult. In contrast, a high threshold makes the training relatively easy, but results in the instability of the network. This is simply because we sample pixels that correspond to the highest 100 classification scores during inference, but we cannot guarantee that these pixels are at the very center of the instances. Therefore, appropriately setting the overlap threshold becomes vital for training mask branch. In our ablation studies, we compare how different thresholds affect the final mAP. According to our results, a 0.7 threshold achieves robustness while retaining easy training.
Single Pixel Reconstruction: A shared decoder [27] is exploited to realize the accurate mask computation, which uses a series of operations to gradually reconstruct
Fig. 5: Multiscale convs could properly capture local and global information of instances. Kernels are designed to have the same size as any possible valid anchor boxes, so that the positive pixel will carry sufficient information.
the segmentation masks from individual positive pixels. Specifically, we first use three consecutive deconvolution layers with no activation to gradually generate the instance mask from a positive pixel, then we use the deonvolution with ReLU activation twice to construct the final mask. We also add a nearest interpolation shortcut from the
mask to the final classification layer. Note that we do not use any activation like Relu after each of the first three deconvolutions because each feature map is quite small, and Relu may set many neurons to zero such that the important information contained in the feature maps is destroyed. After recovering the
-pixel instance mask, we convolve the mask with
kernels and sample the highest channel value at each pixel location from the
mask map according to the classification branch score to obtain the instance mask. In the loss part, we use binary cross-entropy function and only calculate the on-class mask loss of all K classes.
3.3 Implementation details
One-stage instance segmentation is a brand new approach, and we illustrate its important details as follows.
Label Preparation: We generate 9 anchor boxes with 3 different sizes and 3 different ratios on every pixel at each level of the feature maps. In box labeling, we set anchors that have overlaps larger than 0.5 with any instance label as the positive anchors, and set those anchors with overlaps less than 0.4 as the negative ones. In mask generation, we set the positive threshold to 0.7. The positive and negative anchors can be found in Figure 6.
In the preparation of the training labels for mask prediction, we firstly find the anchors with overlaps larger than 0.7 with any target box at each pixel, and then select the anchor at each pixel location with highest overlap for training. We use 300 pixels corresponding to the 300 anchors, and each pixel is responsible for generating its corresponding instance.
Fig. 6: Green box is the target. Red box has an overlap greater than 0.7 with it, so the pixel at red dot is labeled as positive. Blue box is hardly overlapped with the target, hence, blue dot is labeled as negative.
Training: SPRNet is easy to train. We train a total 25 epochs on the MS-COCO 2017 train dataset using Adam optimizer with initial learning rate of and gradient clip at
. We use ResNet backbone pretrained on ImageNet, and train the entire network in an end-to-end way.
Inference: During inference, the only difference from training is that we no longer use anchor overlap as the metric to sample pixels since there are no target labels, we use pixels with the highest 100 scores outputted by the classification branch, and after each sampled pixel is reconstructed into a instance mask, we use bilinear interpolation to resize them to the actual box size outputted by the regression branch.
In this section, we provide comparative experiments with the current state-of-the-art methods, along with comprehensive ablation experiments. All the experiments are conducted on MS-COCO dataset [30]. We report the standard MS-COCO metrics including AP, and
and
at different scales). We report the results in terms of box AP and mask AP jointly. Follow the previous work in [5], we train using 115k training images (train), and report ablations on the 5k validation images (val).
4.1 Main Results
We compare SPRNet with the state-of-the-art instance segmentation methods in Table 2. All instantiations of our model have comparable performance with previous state-of-the-art models including MNC [28], FCIS [6], Mask R-CNN [5], and the winners of MS-COCO 2015, 2016 and 2017 segmentation challenges. Without embellishments or any data augmentation, the AP of the SPRNet with ResNet-50-GFPN backbone is 1.6 percents lower than that of the Mask R-CNN with ResNet-50-FPN that also includes horizontal flip training and online hard example mining (OHEM) [29]. However, the SPRNet with ResNet-50-GFPN backbone has higher computational efficiency than the comparative method owing to the one-stage framework. In addition, we obtain higher detection recall than R-CNN based frameworks due to the natural advantage of one-stage detectors. More importantly, our method could robustly distinguish edges on overlapped objects. The overlapping issue is extremely severe in FCIS and later solved by Mask R-CNN. Our one-stage SPRNet has shown good capability of handling this issue. We visualize some examples of the SPRNet predictions in Figure 7. SPRNet achieves good results even under challenging conditions. Examples contain crowded objects, objects in extreme sizes, objects with uncommon morphology. Our methods show promising results in distinguishing borders and maintaining edge consistence.
4.2 Ablation Experiments
To better understand the reason of SPRNet’s effectiveness, we run a number of ablations. Results are shown in Table 3 and discussed in detail below.
Fusion Paths: In Table 3(a), we show two alternative strategies to obtain one pixel with large receptive fields. C33x4 means four consecutive 33 convolutions, which is as same as the classification or the regression branch; C33-1,2,4,6 means four parallel 3
3 convolutions with
Fig. 7: Instance segmentation: SPRNet generates boxes and masks with high accuracy and recall. Each instance is drawn in a distinguished color.
TABLE 2: APs on MS-COCO 2017 val dataset, indicating that our model is not as powerful as Mask R-CNN but runs at a higher speed.
different dilation rates [1,2,4,6], which has similar computation costs as C33x4, but is able to capture more morphological transformation of an instance. Results show 1.2 percents’ improvement in AP. Moreover, with the increasing of object shapes, their performance gap gets larger.
FPN vs. GFPN: Table 3(b) ablates the performance gap between FPN and the proposed GFPN. We provide the comparative results about mask mAP and box mAP. When evaluating box predictions, we remove the mask branch from SPRNet to make it degrade to a RetinaNet variant with GFPN. Results show GFPN outperforms FPN steadily under all criteria.
Mask Generation with Shortcuts: Table 3(c) compares the methods w/ or w/o shortcut connection. For the Deconv+shortcut model, we add an additional shortcut from feature map (with 4
upsampling) to the final classification layer, and it improves the AP score by 0.4 percent.
Mask IoU Threshold: Table 3(d) compares the models trained with different mask IoU threshold when preparing training samples. A small threshold (e.g., 0.5) results in an inferior AP score. This is actually predictable based on the fact that a small threshold makes it hard to converge for training.
4.3 Analysis
We compare SPRNet and Mask R-CNN in Figure 8, with each instance drawn in a distinguished color. Both the methods use the same ResNet-50 backbone. The samples include objects at various scales. Our method shows comparable performance with Mask R-CNN in most
Fig. 8: Mask R-CNN vs. SPRNet. With the natural advantage of one-stage frameworks, SPRNet has a higher recall than Mask R-CNN (the first column), while Mask R-CNN has better performance in instance details.
TABLE 3: Ablation experiments concern the results of three main parts of SPRNet and threshold w.r.t. mask prediction. (a) Fusion Paths: obtain one pixel with large receptive fields (two strategies). (b) FPN vs. GFPN: The structural feature learning (gated and non-gated). (c) Mask Generation with Shortcuts: the final mask generation with and without shortcuts. (d) Mask IoU Threshold: A proper threshold could maximize the performance.
cases. Specifically, Mask R-CNN has a higher accuracy (i.e., more accurate mask boundaries for the detected objects) while SPRNet has a higher recall (i.e., more objects are detected). Mask R-CNN beats SPRNet in extreme situations with more detailed and aligned predictions. We deem the performance difference predictable because Mask R-CNN uses its final box predictions to accurately sample RoIs, making mask prediction a very easy binary semantic segmentation task. In contrast, one-stage frameworks must find another totally different but plausible approach to achieve accurate generation of the instance masks while improving the computational speed. Despite using atrous convolution to compact spatial information into a single pixel, it is still challenging to recover a very detailed and accurate mask from pixels. Also, due to the one-stage framework, regression values will largely influent the final accuracy in both box and mask prediction. For regression boxes that are seriously shifted from the ground truth boxes, mask could easily be mis-aligned. However, SPRNet delivers a higher speed. Under the same experimental settings (i.e., GTX 1080Ti), Mask R-CNN using ResNet-50-FPN as the backbone runs at 7 fps, while our SPRNet runs at 9 fps, which is about 30% faster.
Besides mask AP evaluation metric, we provide two more sets of experiments, concerning the AP of the BBox (bounding box) object detection and Average Recall (AR) of both object detection and instance segmentation. We justify the effect of GFPN on box detection in the first set of experiments, in which we find that SPRNet could outperform RetinaNet, especially in objection detection with more detailed categories. It’s noteworthy that as the input size increases, RetinaNet shows a downtrend in large objects detection, while SPRNet effectively alleviates this problem, and this is because the GFPN prevents the gradients from passing from lower levels to higher ones, making the FPNlike structure performs as similar as the ordinary C4 or C5-based frameworks in large objects detection. In the second set of experiments, we compare the AR results of both box detection and mask prediction using RetinaNet, Mask R-CNN and SPRNet respectively, and find that SPRNet has largely increased the overall number of detected objects with small, median and large sizes.
TABLE 4: Object Detection: box AP on MS-COCO. The comparison between famous one and two-stage frameworks shows that using the same backbone, SPRNet with GFPN has surpassed every existing one-stage detectors with a remarkable improvement on small object detection accuracy.
TABLE 5: Object Detection: box AP of RetinaNet with FPN or GFPN. We use ResNet-50 as the common backbone, and we compare detailed AP under different input sizes. We report box AP without training or using the mask branch, which means we directly compare FPN and GFPN in terms of detection.
5.1 BBox Object Detection
We compare SPRNet with the state-of-the-art MS-COCO bounding-box object detection in Table 4. With an input size of 500 pixels, we achieve 1.1 percents’ improvement in box AP compared with best one-stage detector RetinaNet.
To analyze the effect of the gating mechanism for FPN, we conduct experiments to compare RetinaNet with GFPN or traditional FPN and the results are demonstrated in Table 5. By introducing the gate mechanism, the RetinaNet with GFPN outperforms the other one with FPN by 1.2, 1.6 and 3.1 percents’ in AP for small, medium and large object detection, respectively. The effectiveness of gradient blocking is also shown in Table 5, with an improvement of up to 4.3 percents’ in AP for large object detection. Similarly, the effectiveness of the gating mechanism is well shown by the AP improvement of up to 1.8 percents’ for small object.
Using ResNet-50 as the backbone, SPRNet easily beats all existing one-stage methods. Our elaborately designed gating mechanism has shown promising effects in solving previously mentioned problems of the common ‘bottom up’ structure. By automatically abandoning inferior information between features, we find a promising approach to improve the performance of detecting small objects. By blocking gradient flows, larger objects detection becomes isolated to smaller ones, which yields more than 2 percents’ improvement in final AP. Especially, it is the first time that a one-stage method using ResNet-50 beats the best two-stage methods using ResNet-101 and RoIAlign at amazing 20.2 percents in .
In Table 5, we compare only the effect of GFPN by removing the mask branch both in training and inference. It is noteworthy that the RetinaNet’s detection accuracy of large objects is far less than our methods. As we mentioned before, GFPN provides a better way to feature fusion, and this is the reason of improvements on small object detection. In the meanwhile, GFPN restricts the gradient propagation from low levels into higher levels, and this is the reason of improvements on large object detection. These improvements happen on every input scale. More importantly, on the metric of , GFPN has been improved by up to 1.1 percents in AP, which means more objects are likely to be well detected. On the metric of
, GFPN has outperformed FPN by up to 1.1 percents in AP, and this is very important for practical application because our methods delivers more accurate
TABLE 6: (test on val) Box ARs of RetinaNet and Mask-RCNN with or without GFPN. Significant improvement on ARs can be observed.
TABLE 7: (test on val) Segmentation ARs of SPRNet and Mask-RCNN with or without GFPN. Significant improvement on ARs can be observed.
detection on objects that are most likely to be detected. We do not try other input scale that is smaller than 400 or larger than 800 pixels, the reason is that existing ResNet backbone may fail in extracting spatial information when objects get too small, and convolution kernels may also be limited in receptive fields when objects get too large.
5.2 Recall on objects
One of the most important metrics to evaluate the detection performance is the Average Recall. Given an image, the more instances are detected, the higher AR it will get.
For two-stage detectors, recalls could be considered in two parts. The first is FPN’s recall, where anchors will be classified as positive or negative, and only positive boxes will be sent into following subnetworks, where the final recall is derived. While in one-stage frameworks, without proposal network, we will only get one recall.
In Table 6 and Table 7, we compare Mask R-CNN, RetinaNet and SPRNet in both object detection and instance segmentation tasks.
Implemented on Mask R-CNN and RetinaNet, GateFPN has largely increased the AR under all metrics. By simply importing Gate-FPN, SPRNet has not only surpassed Mask R-CNN in overall AR, but also achieved prominent improvements on small object detection. The results prove that Gate-FPN is a very solid and flexible module for general use in different detection frameworks.
We present the SPRNet as a one-stage approach to image instance segmentation, without introducing the region proposals. SPRNet achieves comparable performance with the state-of-the-art two-stage models while running at a faster speed. By introducing GFPN, we bring one-stage detectors into a higher level in that we enable it to deliver better detection than prevalent one- and two-stage detectors. This work represents a feasible solution for delivering accurate and fast instance-level recognition.
[1] Y. Zhou, S. Huo, W. Xiang, C. Hou, and S. Kung, “Semisupervised salient object detection using a linear feedback control system model,” IEEE Transactions on Cybernetics, vol. 49, no. 4, pp. 1173–1185, 2019. 1
[2] B. Wang, X. Yuan, X. Gao, X. Li, and D. Tao, “A hybrid level set with semantic shape constraint for object segmentation,” IEEE Transactions on Cybernetics, vol. 49, no. 5, pp. 1558–1569, 2019. 1
[3] J. Shen, Z. Liang, J. Liu, H. Sun, and D. Tao, “Multiobject tracking by submodular optimization,” IEEE Transactions on Cybernetics, vol. PP, no. 99, pp. 1–12, 2018. 1
[4] S. Ren, K. He, R. Girshick, and J. Sun, “Faster r-cnn: Towards real-time object detection with region proposal networks,” in Advances in Neural Information Processing Systems, 2015, pp. 91–99. 1, 2, 5
[5] K. He, G. Gkioxari, P. Doll´ar, and R. Girshick, “Mask r-cnn,” in IEEE International Conference on Computer Vision. IEEE, 2017, pp. 2980–2988. 1, 3, 5, 7
[6] L. Yi, H. Qi, J. Dai, X. Ji, and Y. Wei, “Fully convolutional instance-aware semantic segmentation,” in IEEE Conference on Computer Vision and Pattern Recognition, 2017. 1, 3, 7, 8
[7] S. Liu, L. Qi, H. Qin, J. Shi, and J. Jia, “Path aggregation network for instance segmentation,” in IEEE Conference on Computer Vision and Pattern Recognition, 2018, pp. 8759–8768. 1
[8] W. Liu, D. Anguelov, D. Erhan, C. Szegedy, S. Reed, C.-Y. Fu, and A. C. Berg, “Ssd: Single shot multibox detector,” in European Conference on Computer Vision. Springer, 2016, pp. 21–37. 1, 2, 3
[9] J. Redmon and A. Farhadi, “Yolo9000: Better, faster, stronger,” in IEEE Conference on Computer Vision and Pattern Recognition, 2017. 1, 10
[10] T.-Y. Lin, P. Doll´ar, R. B. Girshick, K. He, B. Hariharan, and S. J. Belongie, “Feature pyramid networks for object detection.” in IEEE Conference on Computer Vision and Pattern Recognition, vol. 1, no. 2, 2017, p. 4. 1, 2, 3
[11] T.-Y. Lin, P. Goyal, R. Girshick, K. He, and P. Doll´ar, “Focal loss for dense object detection,” IEEE Transactions on Pattern Analysis and Machine Intelligence, 2018. 1, 2, 3, 5
[12] R. Girshick, “Fast r-cnn,” in IEEE International Conference on Computer Vision, 2015, pp. 1440–1448. 2, 3
[13] J. Redmon, S. Divvala, R. Girshick, and A. Farhadi, “You only look once: Unified, real-time object detection,” in IEEE Conference on Computer Vision and Pattern Recognition, 2016, pp. 779–788. 2, 3
[14] J. Dai, Y. Li, K. He, and J. Sun, “R-FCN: object detection via region-based fully convolutional networks,” CoRR, vol. abs/1605.06409, 2016. [Online]. Available: http://arxiv.org/abs/1605.06409 2
[15] P. Sermanet, D. Eigen, X. Zhang, M. Mathieu, R. Fergus, and Y. Lecun, “Overfeat: Integrated recognition, localization and detection using convolutional networks,” arXiv:1312.6229, 2013. 3
[16] C. Y. Fu, W. Liu, A. Ranga, A. Tyagi, and A. C. Berg, “Dssd:deconvolutional single shot detector,” arXiv:1701.06659, 2017. 3
[17] B. Hariharan, P. Arbel´aez, R. Girshick, and J. Malik, “Simultaneous detection and segmentation,” in European Conference on Computer Vision. Springer, 2014, pp. 297–312. 3
[18] P. O. Pinheiro, R. Collobert, and P. Doll´ar, “Learning to segment object candidates,” in Advances in Neural Information Processing Systems, 2015, pp. 1990–1998. 3
[19] L.-C. Chen, G. Papandreou, I. Kokkinos, K. Murphy, and A. L. Yuille, “Deeplab: Semantic image segmentation with deep convolutional nets, atrous convolution, and fully connected crfs,” IEEE Transactions on Pattern Analysis and Machine Intelligence, vol. 40, no. 4, pp. 834–848, 2018. 3
[20] J. Long, E. Shelhamer, and T. Darrell, “Fully convolutional networks for semantic segmentation,” in IEEE Conference on Computer Vision and Pattern Recognition, 2015, pp. 3431–3440. 3
[21] X. Wang, Z. Hou, W. Yu, Z. Jin, Y. Zha, and X. Qin, “Online scale adaptive visual tracking based on multilayer convolutional features,” IEEE Transactions on Cybernetics, vol. 49, no. 1, pp. 146–158, 2019. 3
[22] S. Xie, R. Girshick, P. Doll´ar, Z. Tu, and K. He, “Aggregated residual transformations for deep neural networks,” in IEEE conference on Computer Vision and Pattern Recognition. IEEE, 2017, pp. 5987–5995. 4
[23] K. He, X. Zhang, S. Ren, and J. Sun, “Deep residual learning for image recognition,” in IEEE Conference on Computer Vision and Pattern Recognition, 2016, pp. 770–778. 4
[24] Y. LeCun, B. E. Boser, J. S. Denker, D. Henderson, R. E. Howard, W. E. Hubbard, and L. D. Jackel, “Handwritten digit recognition with a back-propagation network,” in Advances in Neural Information Processing Systems, 1990, pp. 396–404. 4
[25] Y. Li, C. Jia, X. Kong, L. Yang, and J. Yu, “Locally weighted fusion of structural and attribute information in graph clustering,” IEEE Transactions on Cybernetics, vol. 49, no. 1, pp. 247–260, 2019. 4
[26] Z. Chen, J. Lin, T. Zhou, and F. Wu, “Sequential gating ensemble network for noise robust multiscale face restoration,” IEEE Transactions on Cybernetics, 2019. 4
[27] L. Chen, Y. Zhu, G. Papandreou, F. Schroff, and H. Adam, “Encoder-decoder with atrous separable convolution for semantic image segmentation,” CoRR, vol. abs/1802.02611, 2018. [Online]. Available: http://arxiv.org/abs/1802.02611 6
[28] J. Dai, K. He, and J. Sun, “Instance-aware semantic segmentation via multi-task network cascades,” in IEEE Conference on Computer Vision and Pattern Recognition, 2016, pp. 3150–3158. 7, 8
[29] A. Shrivastava, A. Gupta, and R. Girshick, “Training region- based object detectors with online hard example mining,” in IEEE Conference on Computer Vision and Pattern Recognition, 2016, pp. 761–769. 7, 8
[30] T.-Y. Lin, M. Maire, S. Belongie, J. Hays, P. Perona, D. Ramanan, P. Doll´ar, and C. L. Zitnick, “Microsoft coco: Common objects in context,” in European Conference on Computer Vision. Springer, 2014, pp. 740–755. 7
[31] J. Huang, V. Rathod, C. Sun, M. Zhu, A. Korattikara, A. Fathi, I. Fischer, Z. Wojna, Y. Song, S. Guadarrama et al., “Speed/accuracy trade-offs for modern convolutional object detectors,” in IEEE Conference on Computer Vision and Pattern Recognition, vol. 4, 2017. 10
[32] C. Szegedy, S. Ioffe, V. Vanhoucke, and A. A. Alemi, “Inception-v4, inception-resnet and the impact of residual connections on learning.” in AAAI, vol. 4, 2017, p. 12. 10
[33] A. Shrivastava, R. Sukthankar, J. Malik, and A. Gupta, “Beyond skip connections: Top-down modulation for object detection,” arXiv:1612.06851, 2016. 10
Jun Yu (M’13) received the B.Eng. and Ph.D. degrees from Zhejiang University, Zhejiang, China. He is currently a Professor with the School of Computer Science and Technology, Hangzhou Dianzi University, Hangzhou, China. He was an Associate Professor with the School of Information Science and Technology, Xiamen University, Xiamen, China. From 2009 to 2011, he was with Nanyang Technological University, Singapore. From 2012 to 2013, he was a Visiting Researcher at Microsoft Research Asia (MSRA). Over the past years, his research interests have included multimedia analysis, machine learning, and image processing. He has authored or coauthored more than 80 scientific articles. Prof. Yu serves as an associate editor for the IEEE Transactions on Circuits and Systems for Video Technology, Journal of Pattern Recognition, Journal of Information Sciences. He served as a program committee member or reviewer of top conferences and prestigious journals. He is a Professional Member of the Association for Computing Machinery (ACM) and the China Computer Federation (CCF).
Jinghan Yao is an undergraduate of Hangzhou Dianzi University, Zhejiang, China. He is currently an senior student in the School of Honor, majors in Computer Science and Technology. Since 2015, he has been doing research on machine learning, image processing, computer vision and high performance computing.
Jian Zhang received the Ph.D. degree from Zhejiang University, Zhejiang, China. He is currently an Associate Professor with the School of Science and Technology, Zhejiang International Studies University, Hangzhou, China. From 2009 to 2011, he was with Department of Mathematics of Zhejiang university as a Post-doctoral Research Fellow. In 2016, he had been doing research on machine learning at Simon Fraser University (SFU) as a Visiting Scholar. His research interests include but not limited to machine learning, computer animation and image processing. He was awarded the certificates of outstanding contribution in reviewing by Journal of Pattern Recognition and Neurocomputing.
Zhou Yu received the B.Eng. and Ph.D. degrees from Zhejiang University, Zhejiang, China, in 2010 and 2015, respectively. He is currently a Lecturer with the School of Computer Science and Technology, Hangzhou Dianzi University, and his research interests includes multimodal data analysis, computer vision, machine learning and deep learning.
Dacheng Tao (F15) is Professor of Computer Science and ARC Laureate Fellow in the School of Computer Science and the Faculty of Engineering and Information Technologies, and the Inaugural Director of the UBTECH Sydney Artificial Intelligence Centre, at the University of Sydney. He mainly applies statistics and mathematics to Artificial Intelligence and Data Science. His research results have expounded in one monograph and 200+ publications at prestigious journals and prominent conferences, such as IEEE T-PAMI, T-IP, T-NNLS,T-CYB, IJCV, JMLR, NIPS, ICML, CVPR, ICCV, ECCV, ICDM; and ACM SIGKDD, with several best paper awards, such as the best theory/algorithm paper runner up award in IEEE ICDM07, the best student paper award in IEEE ICDM13, the 2014 ICDM 10-year highest-impact paper award, the 2017 IEEE Signal Processing Society Best Paper Award, and the distinguished paper award in the 2018 IJCAI. He received the 2015 Austrlian Scopus-Eureka Prize and the 2018 IEEE ICDM Research Contributions Award. He is a Fellow of the Australian Academy of Science, AAAS, IEEE, IAPR, OSA and SPIE.