With the deep learning approach of Krizhevsky et al. [9] introduced in ILSVRC-2012 [19], deep learning-based image processing has acquired tremendous progress. Especially, the supervised learning has been paid increasing attention since the learning performance of the network is greatly affected by both quality and quantity of the labeled data. Meanwhile, the supervised learning requires the manually labeled data, which is time-consuming and expensive. When it comes to semantic segmentation, pixel-by-pixel annotation is required, which particularly leads to the exorbitant preparation of the labeled data. Recently, a various techniques such as active learning [15], interactive segmentation [16], weakly-supervised leaning [11] and so on have been developed to solve the labeling cost problem in semantic segmentation.
Semi-supervised learning has been introduced to address the ever-increasing size of modern data combined with the difficulty of obtaining label. In particular, semi-supervised learning improves the network performance with the use of relatively smaller number of labeled data, when a large number of unlabeled data is available. A simple build up of unlabeled data can be easily collected from various data sources via web crawling, vehicle logging, and etc. Especially, the manual labelling process to allow to make the collected unlabeled data useful in learning process is costly compared to other tasks in the progress of semantic segmentation. To address the cost issue of labeling, the semi-supervised learning technique best suits the semantic segmentation which requires expensive labeled data.
Prior studies [22, 24] employ the semi-supervised learning techniques for image classi-fication task in a various ways, and show the significant improvement in accuracy. However, the semantic segmentation is a different task compared to the classification in that while the prediction of classification results in a class vector, the semantic segmentation performs the structured prediction per regional location and predicts structural characteristics of regions. Prior studies [13, 23] investigate the between-pixel relationship and verify that the structural relationship between pixels is important in semantic segmentation.
In this paper, we propose a semi-supervised learning with the structured consistency loss for semantic segmentation. The proposed loss forces the predictions of segmentation network consistent in not only pixel-wise, but also inter-pixel relationship, which allows the network to learn more powerful generalization capabilities to predict in harmony with neighboring pixels. Furthermore, when cutmix [26] is incoporated, network can get the better generalization performance with lowering GPU memory utilization by restricting the regions to calculate the structured consistency loss in cutmix box. Via numerical results, it is shown that the proposed method ranks the first place with mIOU 83.84 in the Cityscapes benchmark pixel-level semantic labeling task [6]. We note that this is the first study to verify that the semi-supervised learning can achieve the state-of-the-art performance in semantic segmentation using the Cityscapes benchmark. In addition, the proposed semi-supervised learning technique can be applied in parallel with other researches for further improvements since our contribution is not in network architecture, but in learning techniques.
Semantic Segmentation. The early period of semantic segmentation approaches are mostly based on Fully Convolutional Network (FCN) [14, 20]. To improve the performance of earlier segmentation models, the loss of spatial information is mediated with the use of encoder-decoder architecture [1, 17, 18] or dilated convolution (a.k.a. Atrous Convolution) expanding the receptive field [3, 25]. For the further enhancement on the localization performance, [4] applies the Atrous Spatial Pyramid Pooling (ASPP) in semantic segmentation, and PSPNet [28] proposes a feature pyramid pooling module to gather global contextual information around the image object or stuff. More recently, Chen et al. [5] suggestes the well-organized architecture with combining encoder-decoder architecture and dilated convolution, which have been followed by many subsequent methods to achieve state-of-the-art performance [12, 21, 30]. Zhu et al. [29] presents the inspiring improvement of results thanks to labelling of video image using temporal information, boundary relaxation loss to address the boundary issue and class uniform sampling for the class imbalance problem.
Semi-Supervised Learning. In recent years, the semi-supervised method has become a one of the most prominent theme, but is limitedly employed to classification task in earlier times. The usage of loss function computed on unlabeled data encourages the model to enhance generalization ability to unseen data in same domain. Grandvalet and Bengio [8] propose the entropy minimization loss to verify that the decision boundary tends to lie on a low density region of class distribution. The consistency loss is suggested in Laine and Aila [10] so as to encourage the model to produce the same output distribution when its inputs are perturbed. The consistency regularization loss plays a breakthrough role in the follow-up semi-supervised researches. To employ the consistency loss efficiently, exponential-moving-average (EMA) technique which builds the teacher network by accumulating weights of student network to generate the more accurate guessed label is invented by Tarvainen and Valpola [22]. MixMatch [2] combines entropy minimization, consistency loss, and MixUp regularization [27] for the generalization performance improvement of the network. Additionally, the results of Xie et al. [24] nearly match for the performance of models trained on the full sets of CIFAR-10 with only using 10 % of this dataset, thanks to sophisticated augmentation method with realistic noise.
Structured Prediction. The prediction results of semantic segmentation have an equal shape of input image, and also the prediction vector of each pixel has a strong correlation with each other, especially close one. In Xie et al. [23], a local pair-wise (8-neighbors) distillation is used to make an efficient feature distillation. Liu et al. [13] presents an impressive progress of feature distillation by using pair-wise distillation, where the total inter-pixel similarities are calculated in a specific feature map, and the feature of teacher is distilled to student.
The knowledge distillation method which is similar to both the purpose and implementation of the consistency loss, is used to train the student network by forcing to resemble with the teacher network. Accordingly, the distance functions are usually utilized as a loss function by both the knowledge distillation and consistency loss. The consistency loss with cutmix [26] for semantic segmentation has been introduced in French et al. [7]. The authors in [7] investigate the difference between classification and semantic segmentation in terms of the low density region. The low density region lies in the boundary of class distribution in classification, while it lies in a regional boundary of object in semantic segmentation. They select the cutmix method instead of mixup to conserve the local boundarys of image, but with a conventional consistency loss.
Since semantic segmentation can be regarded as a pixel by pixel classification, it is reasonable to simply apply the conventional consistency loss which is useful for the classification to the semantic segmentation. The recent study [7] has made improvements by applying the consistency loss to the semantic segmentation. However, different from classification, the semantic segmentation has a characteristic of structured prediction that the predictions of pixels have correlation with each other. Therefore, if the existing consistency loss is used as it is, it is difficult to achieve the high performance improvement without consideration on the characteristic of semantic segmentation. In order to boost the performance of semi-supervised semantic segmentation, in this section we introduce the structured consistency loss, encouraging the network to predict properly by focusing on inter-pixel relationship in cutmix box.
3.1 Overall Training
Overall training of our network is composed of labeled, and unlabeled image in a batch simultaneously. The total loss is therefore a semi-supervised loss given as
Figure 1: Architecture of the unsupervised training is depicted. Two predictions (ps(um), pt( ´ua, ´ub)) from the student network and teacher network are used to calculate the consistency loss and structured consistency loss.
Figure 2: The calculation of the consistency loss is described. The consistency loss is calculated in each paired pixel and averaged out. For the simplicity, Image size is reduced to 10 x 10 and has three cutmix boxes whose boundary is red line.
where is a loss with unlabeled image u and is a supervised cross entropy loss with boundary label relaxation [29] for semantic segmentation with labeled image x and written
as
Pand are softmax probability for each class, and the set of classes within a w by w window boundary region, respectively. Note that reduces to the standard cross entropy loss with one-hot label when w equals one, i.e., w=1.
3.2 Architecture for unsupervised training
The cutmix augmentation is originally designed for classification task, where it mixes samples by cutting box and pasting from one sample into another. Following French et al. [7], we use the cutmix augmentation method for semantic segmentation by generating N boxes with random size and position. The total area of boxes is approximately the half of the image dimension to make balance between two images ua and ub.
Our main architecture of unsupervised approach is described in Fig.1. First, we generate randomly augmented input images ( ´ua, ´ub) from the original images (ua, ub) by random augmentation used in Zhu et al. [29] except the cutmix, and create a cutmix image (um) from the augmented images. The teacher network makes prediction results (pt( ´ua), pt( ´ub)) from the input images ( ´ua, ´ub) and cutmix the results (pt( ´ua, ´ub)). The student network uses the cutmix results (pt( ´ua, ´ub)) as a guessed label introduced by Berthelot et al. [2]. By getting the cutmix image (um) as input, the student network makes prediction (ps(um)). Finally, we calculate two consistency losses using the prediction of student network and teacher network. Note that teacher and student network have the same network architecture, but the weights of teacher network are not from its own weights using gradient update, but from the student network by EMA manipulation. The guessed label can promote the generalization performance to the student network since the label is made by two original images information when the input to the student network is a complex image as several patches stuck on. The unlabeled loss term is separated into two parts such as consistency loss and structured consistency loss. The formula of unlabeled loss is given as follows:
where and are hyperparameters for each loss term of and , respectively, and is the structured consistency loss which would be addressed in more detail in the following subsection. is the conventional consistency loss as an average of pixel-wise squared L2 loss described in Fig.2 and Eq.4. ps(umrepresents a prediction of student network for the ith pixel of cutmix unlabeled image um, while pt( ´ua´ubrepresents a cutmix of predictions of the teacher network for the ith pixel for unlabeled images ´ua and ´ub. We denote a set of all the pixel indices in the image as ,
3.3 Structured consistency loss
As explained in section 2, we utilize a concept of pair-wise knowledge distillation technique suggested by Liu et al. [13] for the structured consistency loss. Although the knowledge distillation and the consistency loss differs in underlying philosophy, the concept of consideration on inter-pixel relationship of pair-wise knowledge distillation is the great help in extending the conventional consistency loss into the structural one. The proposed structured consistency loss can be written as
where ati j and asij denote the similarity between the ith pixel and the jth pixel produced from the teacher network and student network, respectively, and pi represents the prediction vector of the ith pixel. However, the structured consistency loss derived from Eq.6 is not efficient because the most of pixel pairs are far from each other, have very low correlation, and therefore have little effect on performance. Moreover, Eq.6 requires extremely high computation cost for computing all of pixel pairs in prediction map.
In order to handle the above problems, we employ the cutmix augmentation, which limits the pixel pairs inside the local region and reduces computing complexity dramatically. By calculating the inter-pixel cosine similarity within N cutmix box separately, we can reduce the computing complexity approximately N times. The detailed operation is described in
Figure 3: The calculation of structured consistency loss is described. Assume that the above boxes are derived from teacher network, while the below boxes are derived from student network. Inter-pixel cosine similarity is calculated at each cutmix box. The hatched area of left cutmix box which is covered by the other box is not included in calculation. For simplicity, we set Nbox=2.
Fig.3. Moreover, by using cutmix, the network can learn the ability to make the accurate predictions with limited local information, which can provide the prediction similar to that of global area. The structured consistency loss with cutmix is represented as
where anrepresents the similarity between the ith pixel and the jth pixel in nth cutmix box, is the structured consistency loss which is the sum of N cutmix boxes calculation, N is the number of cutmix boxes, and Hn Wn} denotes a set of all the pixel indices in the nth cutmix box except for the region covered by other cutmix box.
To make the structured consistency loss more efficient, we restrict the number of cutmix boxes to Nbox and the maximum number of pairs to be calculated in each box to Npair. In Fig.3, we eliminate the region covered by another cutmix box, and also randomly drop out the similarity vectors excessing over Npair. Then, we derive the efficient and effective structured consistency loss as follows:
where the customized function Droprandomly removes the elements of X to keep the maximum number of elements as n. In the process of attaching the cutmix box, the box attached earlier is likely to be covered by the box attached behind, and then the structed consistency loss is calculated using posterior Nbox of boxes. In addition, the loss is not calculated via masking the area covered by another cutmix box. By doing this, only the pixels that are structurally-relevant within the cutmix box can learn the relationship each other.
4.1 Implementation details
Network Structures. We adopt a state-of-the-art segmentation architecture, Deeplabv3+ with WideResNet38 from Zhu et al. [29]. Output stride of the last layer of encoder and the low level feature transferred to decoder is 8, and 2 respectively. EMA [22] is used for the teacher network to generate the guessed label. Inference with the test sets and the validation sets is also carried out by this model. The parameter value of EMA weight is set to 0.999 which is averaged out in every training steps by following [22].
Training Procedure. We use a SGD optimizer with the polynomial learning rate policy. Following the setup of Zhu et al. [29], the initial learning rate is 0.002, the power set to 1.0, weight decay of 0.001 and momentum of 0.9. We use the synchronized batch normalization with a batch size of 2 per GPU, one for labeled data and the other for unlabeled one, on 8 V100 GPUs. The training epoch is set to 175 based on the labeled data, while, for unlabeled data, we adopt cutmix augmentation. From the empirical experiments, we set the hyper-parameters N, Nbox, Npair and to 32, 16, 9000, 20 and 3, respectively. We use additional training skills such as Mapillary pre-training, class uniform sampling and boundary label relaxation [29] so as to get the strong baseline network.
Cityscapes. Cityscapes is a widely accepted validation medium among recent studies [7, 29] owing to its high-quality labeled image database. Cityscapes offers the pixel-level annotations of 5,000 images and coarse annotations of 20K images. The reliable pixel-level annotations are also split into three subsets, that is, training, validation and test, each of which consists of 2,975, 500 and 1,525 images, respectively, and gauge the performance of our methodology. We also use coarsely annotated images for class uniform sampling to overcome the imbalance between classes, which are also applied with ignoring the annotations associated with unlabeled images in unlabeled training phase for greater learning exposure.
Table 1: Comparison of Per-class mIoU results of recent methods on Cityscapes with the proposed method marked as "Ours" and "Ours+". Including both published and unpublished methods, the results obtained by proposed method provide the best overall performance. Note that the results for "Ours" is obtained from training data only while "Ours+" is obtained from training and validation data.
4.2 Results
Figure 4: Qualitative results of our proposed method on the Cityscapes test set, where the pictures depict the predicted segmentation masks.
The experimental results from our semi-supervised learning technique with the structured consistency loss are summarized in Table 1. The experiments are conducted on Cityscapes test images with multi-scale strategy (0.5, 1.0 and 2.0), horizontal-flip, and overlapping-tile methods following [29]. The performance results achieved by our proposed method (marked as "Ours" and "Ours+") rank the first place in the Cityscapes benchmark, including the results of unpublished studies. It is also noticeable that the result with only training set exhibits the highest performance result compared with the performance results of published methods. Graphical results on the Cityscapes test set are presented in Fig.4.
We believe that the our observed performance is attributed to (i) the effectiveness of structured consistency loss added to semi-supervised learning in semantic segmentation and (ii) the usage of coarsely labeled images in unlabeled training phase. Extant studies of semantic segmentation use the coarsely labeled data only in the early training stage, or sporadically use to aid the segmentation of image with rare classes. From our numerical results, the coarsely labeled data can be fully exploited not only by using the coarsely label in restricted way, but also by semi-supervised learning approach.
4.3 Ablation Study
Effect of Semi-Supervised Losses. In this section, we conduct the additional experiments to demonstrate the superiority and effectiveness of two consistency losses. We first train the baseline network with only the supervised method whose result is summarized in the fifth row of Table 2. It is shown in Table 2 that semi-supervised learning is beneficial to the semantic segmentation because the performance increases 0.37% by adding consistency loss to the baseline, and performance further increases with the addition of the structured consistency loss to 81.90%, which verifies the significant improvement by our proposed method. The qualitative results have been visually provided in Fig.5 by allowing us to gauge the qualitative performance. Thanks to the inter-pixel consistency processed with the structured consistency loss, the proposed method exhibits the less errors in small and visually confusing region(s).
Exponential-Moving-Average (EMA) Application. The EMA is a well-known method for its great generalization ability [2, 22], beneficial to apply for both building a teacher network and employing in validation (or test) phase. In recent studies in semi-supervised learning, however, not all the studies apply EMA to the weight of teacher network, or the inference network which implies that there is no rule of thumb. In order to examine the performance dependency on EMA, we conduct an experiment whose results are summarized in Table 3. When the EMA weight is used in the teacher network, mIoU increases 0.135% on average, while the use of EMA weight in validation seems to have no significant effect on performance. Since the use of EMA weight in teacher network and validation yields the greatest mIoU, we proceed to use the EMA weight in both.
Table 2: Comparison of mIoU results of recent methods, supervised-only (baseline), consistency loss, and structured consistency loss on Cityscapes validation images.
Table 3: Comparison of mIoU depending on whether exponential-moving-average (EMA) is applied to both/either teacher network and/or validation on Cityscapes image.
Figure 5: Qualitative results on the validation data of Cityscapes. From the left to the right are original images, supervised-only, the addition of consistency loss, the addition of structured consistency loss, and ground truth. Red box is zoomed-region where is the visually confusing.
In this paper, we propose the structured consistency loss in semi-supervised learning for semantic segmentation. With the cutmix augmentation, the structured consistency loss fully exploits the relationship among the local regions, and enhances the network generalization, which can also be simultaneously applied to the general network. It is verified via numerical results that our proposed method achieves the state-of-the-art premier performance in the Cityscapes benchmark suite, not only ranking the first place only among publication results, but also among all results including unpublished results. Additionally, it is verified that the semi-supervised learning is highly effective to solve practical real world problems under data-insufficiency when it is accompanied with cutmix augmentation as well as the structured consistency loss.
[1] Vijay Badrinarayanan, Alex Kendall, and Roberto Cipolla. Segnet: A deep convolutional encoder-decoder architecture for image segmentation. IEEE transactions on pattern analysis and machine intelligence, 39(12):2481–2495, 2017.
[2] David Berthelot, Nicholas Carlini, Ian Goodfellow, Nicolas Papernot, Avital Oliver, and Colin Raffel. Mixmatch: A holistic approach to semi-supervised learning. arXiv preprint arXiv:1905.02249, 2019.
[3] Liang-Chieh Chen, George Papandreou, Iasonas Kokkinos, Kevin Murphy, and Alan L Yuille. Semantic image segmentation with deep convolutional nets and fully connected crfs. arXiv preprint arXiv:1412.7062, 2014.
[4] Liang-Chieh Chen, George Papandreou, Iasonas Kokkinos, Kevin Murphy, and Alan L Yuille. Deeplab: Semantic image segmentation with deep convolutional nets, atrous convolution, and fully connected crfs. IEEE transactions on pattern analysis and machine intelligence, 40(4):834–848, 2017.
[5] Liang-Chieh Chen, Yukun Zhu, George Papandreou, Florian Schroff, and Hartwig Adam. Encoder-decoder with atrous separable convolution for semantic image segmentation. In Proceedings of the European conference on computer vision (ECCV), pages 801–818, 2018.
[6] Marius Cordts, Mohamed Omran, Sebastian Ramos, Timo Rehfeld, Markus Enzweiler, Rodrigo Benenson, Uwe Franke, Stefan Roth, and Bernt Schiele. The cityscapes dataset for semantic urban scene understanding. In Proc. of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2016.
[7] Geoff French, Timo Aila, Samuli Laine, Michal Mackiewicz, and Graham Finlayson. Consistency regularization and cutmix for semi-supervised semantic segmentation. arXiv preprint arXiv:1906.01916, 2019.
[8] Yves Grandvalet and Yoshua Bengio. Semi-supervised learning by entropy minimization. In Advances in neural information processing systems, pages 529–536, 2005.
[9] Alex Krizhevsky, Ilya Sutskever, and Geoffrey E Hinton. Imagenet classification with deep convolutional neural networks. In Advances in neural information processing systems, pages 1097–1105, 2012.
[10] Samuli Laine and Timo Aila. Temporal ensembling for semi-supervised learning. arXiv preprint arXiv:1610.02242, 2016.
[11] Jungbeom Lee, Eunji Kim, Sungmin Lee, Jangho Lee, and Sungroh Yoon. Ficklenet: Weakly and semi-supervised semantic image segmentation using stochastic inference. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 5267–5276, 2019.
[12] Xiangtai Li, Li Zhang, Ansheng You, Maoke Yang, Kuiyuan Yang, and Yunhai Tong. Global aggregation then local distribution in fully convolutional networks. arXiv preprint arXiv:1909.07229, 2019.
[13] Yifan Liu, Ke Chen, Chris Liu, Zengchang Qin, Zhenbo Luo, and Jingdong Wang. Structured knowledge distillation for semantic segmentation. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 2604–2613, 2019.
[14] Jonathan Long, Evan Shelhamer, and Trevor Darrell. Fully convolutional networks for semantic segmentation. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 3431–3440, 2015.
[15] Radek Mackowiak, Philip Lenz, Omair Ghori, Ferran Diego, Oliver Lange, and Carsten Rother. Cereals-cost-effective region-based active learning for semantic segmentation. arXiv preprint arXiv:1810.09726, 2018.
[16] Kevis-Kokitsi Maninis, Sergi Caelles, Jordi Pont-Tuset, and Luc Van Gool. Deep extreme cut: From extreme points to object segmentation. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 616–625, 2018.
[17] Hyeonwoo Noh, Seunghoon Hong, and Bohyung Han. Learning deconvolution network for semantic segmentation. In Proceedings of the IEEE international conference on computer vision, pages 1520–1528, 2015.
[18] Olaf Ronneberger, Philipp Fischer, and Thomas Brox. U-net: Convolutional networks for biomedical image segmentation. In International Conference on Medical image computing and computer-assisted intervention, pages 234–241. Springer, 2015.
[19] Olga Russakovsky, Jia Deng, Hao Su, Jonathan Krause, Sanjeev Satheesh, Sean Ma, Zhiheng Huang, Andrej Karpathy, Aditya Khosla, Michael Bernstein, Alexander C. Berg, and Li Fei-Fei. ImageNet Large Scale Visual Recognition Challenge. International Journal of Computer Vision (IJCV), 115(3):211–252, 2015. doi: 10.1007/ s11263-015-0816-y.
[20] Pierre Sermanet, David Eigen, Xiang Zhang, Michaël Mathieu, Rob Fergus, and Yann LeCun. Overfeat: Integrated recognition, localization and detection using convolutional networks. arXiv preprint arXiv:1312.6229, 2013.
[21] Towaki Takikawa, David Acuna, Varun Jampani, and Sanja Fidler. Gated-scnn: Gated shape cnns for semantic segmentation. arXiv preprint arXiv:1907.05740, 2019.
[22] Antti Tarvainen and Harri Valpola. Mean teachers are better role models: Weightaveraged consistency targets improve semi-supervised deep learning results. In Advances in neural information processing systems, pages 1195–1204, 2017.
[23] Jiafeng Xie, Bing Shuai, Jian-Fang Hu, Jingyang Lin, and Wei-Shi Zheng. Improving fast segmentation with teacher-student learning. arXiv preprint arXiv:1810.08476, 2018.
[24] Qizhe Xie, Zihang Dai, Eduard Hovy, Minh-Thang Luong, and Quoc V Le. Unsupervised data augmentation for consistency training. arXiv preprint arXiv:1904.12848, 2019.
[25] Fisher Yu and Vladlen Koltun. Multi-scale context aggregation by dilated convolutions. arXiv preprint arXiv:1511.07122, 2015.
[26] Sangdoo Yun, Dongyoon Han, Seong Joon Oh, Sanghyuk Chun, Junsuk Choe, and Youngjoon Yoo. Cutmix: Regularization strategy to train strong classifiers with localizable features. In International Conference on Computer Vision (ICCV), 2019.
[27] Hongyi Zhang, Moustapha Cisse, Yann N Dauphin, and David Lopez-Paz. mixup: Beyond empirical risk minimization. arXiv preprint arXiv:1710.09412, 2017.
[28] Hengshuang Zhao, Jianping Shi, Xiaojuan Qi, Xiaogang Wang, and Jiaya Jia. Pyramid scene parsing network. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 2881–2890, 2017.
[29] Yi Zhu, Karan Sapra, Fitsum A Reda, Kevin J Shih, Shawn Newsam, Andrew Tao, and Bryan Catanzaro. Improving semantic segmentation via video propagation and label relaxation. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 8856–8865, 2019.
[30] Yueqing Zhuang, Fan Yang, Li Tao, Cong Ma, Ziwei Zhang, Yuan Li, Huizhu Jia, Xiaodong Xie, and Wen Gao. Dense relation network: Learning consistent and contextaware representation for semantic image segmentation. In 2018 25th IEEE International Conference on Image Processing (ICIP), pages 3698–3702. IEEE, 2018.