A New Local Transformation Module for Few-shot Segmentation

2019·arXiv

Abstract

1 Introduction

Image segmentation is a basic computer vision task [1]. In recent years, with the rapid development of deep learning method, several convolution neural network based segmentation methods have improved the performance of image segmentation greatly, such as FCN [1] and DeepLab v3 [2]. However, these methods rely heavily on a large amount of annotations. In order to overcome this shortcoming, few-shot segmentation task [3] is proposed to achieve segmentation of new class with a few of manual annotations, such as one annotation (1-shot segmentation) and five annotations (5-shot segmentation). Few-shot segmentation is a challenging task due to the asymmetry of training data and testing data.

In few-shot segmentation task, the annotated and unlabeled images are called support images and query images respectively [3]. The existing models [3] [4] [5] [6] [7] usually consist of three terms. 1) The support branch that extracts feature from support images. 2) The query branch that extracts feature from query images. 3) The transformation module that transfers the features between support branch and query branch to facilitate the segmentation of query branch. The essential term is the transformation module and the most challenging obstacle is how to design a transformation module that is class-agnostic, so that the transformation module can be generalized to new classes efficiently. The existing methods [5] [7] use the global cues of the support image to model the transformation process, which however ignores the geometry relationships of the local features. This paper demonstrates that the geometry relationships of the local features are very useful to the transformation module.

This paper proposes a new transformation module based on local cues, where the relationship of the local features is used to accomplish the transformation. Our idea is to use linear transformation of the relationship matrix in a high-dimensional metric embedding space to accomplish the transformation. To this end, we firstly map the local features into an embedding space, where cosine distance is used to obtain the relationship matrix of local features. Then, the relationship matrix is transformed linearly by the generalized inverse matrix of the annotated matrix of support image. After linear transformation, the result is regarded as an attention map containing high-level semantic information, by which we establish a new attention transformation module. We verify the effectiveness of our transformation module on Pascal VOC 2012 dataset [9]. The value of mIoU achieves at 57.0% in 1-shot and 60.6% in 5-shot, which outperform the state-of-the-art method by 1.6% and 3.5%, respectively.

2 Proposed Method

2.1 Problem Definition

Few-shot segmentation is a task that uses a few of annotations to segment unknown images for new classes. Let )be a set of support images and the corresponding manual annotations. Let be query images set that needs to be segmented. The images in S and Q belong to the same new class . Let be the training dataset of known classes that already exist, and . The goal of few-shot segmentation is to build a model ) by that outputs binary mask for query image based on S.

2.2 Overview

Similar to the existing few-shot segmentation network, the proposed framework includes a support branch, a query branch and a transformation module, as shown in Fig. 1. In order to make the network more generalized to unseen classes, our feature extraction backbone adopts relatively shallow layers, such as the first three layers of Resnet50 [18]. In addition, the support branch and the query branch share the feature extraction backbone.

Fig. 1. The pipeline of the proposed method. The Feature Extractor extracts the features . The feature is weighted spatially by the groundtruth mask obtain the feature . The deep features are mapped into an embedding space by respectively. Simultaneously, the feature is learned by , the proposed Transformer outputs attention map A, which weights ˆspatially. The segmentation result is obtained through Upsam module finally.

After obtaining deep features and from support image and query image, is weighted spatially by the annotated mask to get the features , which guarantees the features only containing corresponding foreground regions. Such process can be represented as

where i, j is spatial location of feature map or annotated mask .

Then, the learned features and are mapped into a high-dimensional embedding space by further convolution operations to get corresponding embedding features and respectively, so that cosine distance can be used to calculate the relationship between the feature pixels in this space. Simultaneously, the feature is learned by convolution operation to get ˆ.

Based on the embedding and the groundtruth mask of support images , the proposed Transformer applies linear transformation of matrix to obtain the attention map A with high-level semantic cues. The detailed description refers to section 2.3. The attention map A finally filters the deep features ˆto ˆby

where i, j is spatial location of feature maps ˆor attention map A. It is seen that the attention map A indicates a rough object area to be segmented. This coarse object area provides high-level semantic information for the subsequent Upsam sub-network.

We next use Upsam sub-network to generate segmentation mask from ˆ. The Upsam sub-network is shown in Fig. 2. Specifically, in order to handle the scale changes of the object better, we introduce multi-scale feature fusion module ASPP [15] and residual connection [18] in Upsam. The output of Upsam is the probability map M with the same size to the query image, and the cross entropy loss in Eq. (3) is used to supervise the training of the model, i.e.,

where Y is the groundtruth mask of the query image, M is the predicted probability map of our few-shot segmentation model, and i, j is the spatial location of Y or M.

Fig. 2. The structure of Upsam module. The residual connection [18] and multi-scale strategy ASPP [15] are implemented to handle the scale variations of object.

2.3 Transformation module

Existing methods often establish the transformation between support image and query image by the global features of the support image, thus lose local geometric information, which however is also important to the transformation. The proposed transformation module is designed based on the relationship between local pixels (represented by the relationship matrix in NON-Local model [8]), and more accurate segmentation can be realized by propagating local relationships.

The detailed steps are illustrated in Fig. 3. The transformation is achieved by the linear transformation of the relationship matrix. Specifically, the relationship matrix between (for support image) and (for query image) is linearly transformed by the generalized inverse matrix of (groundtruth mask matrix of the support image), which overcomes the difficulty of transforming local relationship matrix to high-level semantic information. The result of linear transformation is the attention map A of the query image.

Relationship Matrix For few-shot segmentation task, it is very important to model the relationship between each pair of deep local features of support image and query image. Due to the local computational nature of the convolution operation, the relationships between long-distance pixels cannot be established directly. Therefore, NON-Local [8] structure was proposed to conquer it, where the feature tensor is reshaped into a matrix, and a relationship matrix is established by matrix product. This relationship matrix contains the relationships between each pair of local deep features. We imitate the relationship matrix in NON-Local [8] to establish the relationship in few-shot segmentation.

In NON-Local [8], it is only desirable to establish long-distance constraints, and the matrix product is just used to describe the relationship between two local features. For few-shot segmentation task, this is a rough description of feature similarity. Therefore, in our proposed transformation module, the feature is firstly mapped into an embedding space, in which the cosine distance can be used to calculate the relationship between local features. Such process is represented as

where represents the ith local information in the embedding , and represents the jth local information in the embedding . So we have established a relationship matrix R between query image and support image.

Linear transformation based on generalized inverse matrix With the above relationship matrix R, how to convert this relationship matrix R to high-level semantic information of query image becomes another key point. Let and be the binary groundtruth mask of support image and query image respectively. Based on Eq. (4), the true relationship matrix between query image and support image can be simplified by matrix product between and , i.e.,

where demonstrates matrix product. The original size of and are . The reshaped size of and are 1 and 1 respectively. The size of is , which contains the relationship information of each pair of local feature pixels of and . Our target is to obtain based on Eq. (5).

We suppose the matrix R is approximately equal to the true relationship matrix between query image and support image. Furthermore, we relax

Fig. 3. The detailed information of the proposed transformation module. The em- bedding are reshaped to matrix. Then, the relationship matrix R is obtained based on by cosine similarity. Finally, the relationship matrix R is transformed linearly by the matrix of generalized inverse matrix of . After reshape operator, the attention map A is obtained.

the binary groundtruth mask to the soft attention map A, which provides high-level semantic information. Since is known for few-shot segmentation task, the problem is transformed to get the attention map A based on,

Moreover, since the is not square matrix, its inverse matrix does not exist. But it can be regarded as matrix with row full rank. According to the generalized inverse matrix theory [19], the transformation problem can be represented by:

where demonstrates matrix product. (()is the right inverse matrix (one type of generalized inverse matrix) of . This is just a process that applies the generalized inverse matrix of to linearly transform the relationship matrix R. Finally, the attention map can be obtained by Eq. (7) directly.

In order to ensure that the learned relation matrix R is consistent with during training, the mean square error loss is used to supervise it, i.e.,

Attention Map By linearly transforming the relationship matrix R by Eq. (7), the attention map of the query image A is obtained, with reshaped size to . Moreover, we normalize it to 0 1 by

where ˆA is normalized counterpart of A, for the convenience of expression, we do not distinguish them.

The deep feature of the query image ˆis filtered by the normalized attention map ˆA to get ˆby Eq. (2). Then ˆis proceeded by Upsam (as shown in Fig. 2) to obtain the segmentation result.

In order to ensure the accuracy of the attention map A, we regard it as the foreground probability map of the segmentation result, and 1 as the background probability map. The two maps are concatenated and resized to the same size of original query image (by bilinear interpolation). The combined map can be regarded as a segmentation result and supervised by the cross entropy loss.

where Y is the groundtruth mask of the query image, is the probability map derived from attention map A. i and j are the spatial location of Y and .

For 5-shot, there are five support images. In order to combine the attention maps provided by the five different images, we simply average the attention maps

In the training stage, we combine the three losses in Eq. (3), (10) and (8) to supervise the learning of our model, i.e.,

where are the weights of corresponding loss function.

3 Experiment

The project of our method is built based on the Pytorch library, Adam [16] optimizer is adopted to update the parameters, and all experimental code is executed on a machine equipped with a Titan XP GPU. We set the initial learning rate to 1e-4. The backbone network of our feature extraction is pretrained on ImageNet [17] dataset, and the parameters of the previous layers of backbone are frozen, and we apply the first three layers of Resnet50 [18] as our backbone.

3.1 Detail of Implementation

We validate the proposed method on the Pascal VOC 2012 [9] dataset and its enhanced dataset SDS [12]. Similar to the existing methods [3–7], we split images of 20 classes into four subsets, each of which contains images of five classes, the detailed description can be found in Table 1. For these four subsets, three of them are selected as the training set, and the rest one is used as the test set to validate the effectiveness of the proposed method. In the training stage, we

Table 1. The detailed setting for splitting the sub-dataset to evaluate the few-shot segmentation. There are 4 sub-datasets, and represents the ith subset, where i = {0, 1, 2, 3}. When the i-th sub-dataset is selected for evaluation, the rest three datasets are used for training.

randomly select two images for each class, one as a support image and another as a query image until all images of training classes were selected. In the testing stage, in order to make a fair comparison with the existing methods, we use the same random seed in the existing method to sample the same 1000 pairs of images as the test data for each evaluation sub-dataset.

Table 2. The comparison results (mIoU value) on four evaluation sub-datasets in 1-shot. The best results are in bold.

We use the mean intersection over union of foreground (mIoU) to measure the performance of our proposed method, which is widely used in few-shot segmentation. In addition, the FB-IoU proposed in co-FCN [4] is also considered, which includes mean intersection over union of foreground and background.

3.2 Comparison with Benchmarks

In order to verify the effectiveness of our method, we compare with existing method in 1-shot and 5-shot. We follow CA-Net [7] to adopt DenseCRF [14] and multi-scale evaluation strategy to improve the performance, which are always employed in existing [15] semantic segmentation method. The detailed results can be found in Table 2, Table 3 and Table 4. We can see the values of mIoU by our method achieve at 57.0% in 1-shot and 60.6% in 5-shot, which outperform the state-of-the-art few-shot segmentation method CA-Net [7] by 1.6% and 3.5%

Table 3. The comparison results (mIoU value) on four evaluation sub-datasets in 5-shot. The best results are in bold.

Fig. 4. The subjective results of the proposed method. From left to right: query image, ground-truth mask of the query image, the support image, ground-truth mask of the support image and the segmentation result, respectively.

Table 4. The comparison results (FB-IoU value) on four evaluation sub-datasets in 1-shot and 5-shot. The best results are in bold.

Table 5. The ablation results of three loss functions. The ticking indicates that the loss function is used.

respectively. The improvement in 5-shot indicates the superiority of our method when it comes to more annotations. In addition, the values of FB-IoU in 1-shot and 5-shot achieve at 71.8%, 74.6% respectively, which also outperforms the comparison methods obviously.

Table 6. The effectiveness of Dense-CRF post-processing and Multi-scale strategy are demonstrated. The ticking indicates that the strategy is used.

3.3 Ablation

In the training stage, the weight and are set to 1. In order to validate the effectiveness of our three loss functions, ablation experiment is implemented. The detailed results can be found in Table 5. We can see and improve the performance by 1.1% and 1.0% respectively. In addition, we follow CA-Net [7] to employ DenseCRF [14] and multi-scale evaluation in our test stage. The ablation of these two strategies is also conducted, and the detailed results can be found in Table 6. We can see that DenseCRF [14] and the multi-scale evaluation strategy can improve the mIoU value by 0.4% and 0.3% respectively.

3.4 Subjective Result

The subjective results of the proposed method are shown in Fig. 4. The support image, the ground-truth mask of the support image, the query image, the ground-truth mask of the query image and the segmentation result are displayed from left column to right column, respectively. It is seen that the proposed method segments objects from these images successfully.

4 Conclusion

This paper proposes a new transformation module for few-shot segmentation. Rather than focusing on global cues, the relationships of local features are used to form the transformation. Local feature relationship matrix calculated by the cosine similarity is used to represent the relationships of local features. Linear transformation of relationship matrix based on generalized inverse of the groundtruth matrix is implemented to transform the relationship matrix. We also map the features into a high-dimensional metric embedding space to enhance the generalization of the proposed module. We propose a new few-shot segmentation network based on transformation module, and better results are obtained in terms of both mIoU value and FB-IoU value.

Acknowledgment

This work was supported in part by the National Natural Science Foundation of China under Grant 61871087, Grant 61502084, Grant 61831005, and Grant 61601102, and supported in part by Sichuan Science and Technology Program under Grant 2018JY0141.

References

1. Long, J., Shelhamer, E., Darrell, T. (2015). Fully convolutional networks for se- mantic segmentation. In Proceedings of the IEEE conference on computer vision and pattern recognition (pp. 3431-3440).

2. Chen, L. C., Papandreou, G., Schroff, F., Adam, H. (2017). Rethinking atrous con- volution for semantic image segmentation. arXiv preprint arXiv:1706.05587.

3. Boots, Z. L. I. E. B., Shaban, A., Bansal, S. (2017). One-shot learning for semantic segmentation. BMVC.

4. Levine, T. D. A. E. S., Rakelly, K., Shelha-mer, E. (2018). Conditional networks for few-shot semantic segmentation. In ICLR workshop.

5. Zhang, X., Wei, Y., Yang, Y., Huang, T. (2018). Sg-one: Similarity guidance network for one-shot semantic segmentation. arXiv preprint arXiv:1810.09091.

6. Hu, T., Yang, P., Zhang, C., Yu, G., Mu, Y., Snoek, C. G. (2019). Attention-based Multi-Context Guiding for Few-Shot Semantic Segmentation.In Proceedings of the Association for the Advance of Artificial Intelligence.

7. Zhang, C., Lin, G., Liu, F., Yao, R., Shen, C. (2019). CANet: Class-Agnostic Seg- mentation Networks with Iterative Refinement and Attentive Few-Shot Learning. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (pp. 5217-5226).

8. Wang, X., Girshick, R., Gupta, A., He, K. (2018). Non-local neural networks. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (pp. 7794-7803).

9. Everingham, M., Eslami, S. A., Van Gool, L., Williams, C. K., Winn, J., Zisserman, A. (2015). The pascal visual object classes challenge: A retrospective. International journal of computer vision, 111(1), 98-136.

10. Caelles, S., Maninis, K. K., Pont-Tuset, J., Leal-Taix, L., Cremers, D., Van Gool, L. (2017). One-shot video object segmentation. In Proceedings of the IEEE conference on computer vision and pattern recognition (pp. 221-230).

11. Dong, N., Xing, E. (2018). Few-Shot Semantic Segmentation with Prototype Learn- ing. In BMVC (Vol. 1, p. 6)

12. Hariharan, B. , Arbelaez, P. , Bourdev, L. D. , Maji, S. , Malik, J. . (2011). Semantic contours from inverse detectors. IEEE International Conference on Computer Vision, ICCV 2011, Barcelona, Spain, November 6-13, 2011. IEEE.

13. Faktor, A., Irani, M. (2013). Co-segmentation by composition. In Proceedings of the IEEE International Conference on Computer Vision (pp. 1297-1304).

14. Krhenbhl, P., Koltun, V. (2011). Efficient inference in fully connected crfs with gaussian edge potentials. In Advances in neural information processing systems (pp. 109-117).

15. Chen, L. C., Papandreou, G., Kokkinos, I., Murphy, K., Yuille, A. L. (2017). Deeplab: Semantic image segmentation with deep convolutional nets, atrous convolution, and fully connected crfs. IEEE transactions on pattern analysis and machine intelligence, 40(4), 834-848.

16. Kingma, D. P., Ba, J. (2014). Adam: A method for stochastic optimization. arXiv preprint arXiv:1412.6980.

17. Deng, J., Dong, W., Socher, R., Li, L. J., Li, K., Fei-Fei, L. (2009, June). Imagenet: A large-scale hierarchical image database. In 2009 IEEE conference on computer vision and pattern recognition (pp. 248-255). Ieee.

18. He, K., Zhang, X., Ren, S., Sun, J. (2016). Deep residual learning for image recog- nition. In Proceedings of the IEEE conference on computer vision and pattern recognition (pp. 770-778).

19. Rao, C. R., Mitra, S. K. (1972). Generalized inverse of a matrix and its applications. Icams Conference (Vol.1).

Designed for Accessibility and to further Open Science