Countering Noisy Labels By Learning From Auxiliary Clean Labels

2019·Arxiv

Abstract

Abstract

We consider the learning from noisy labels (NL) problem which emerges in many real-world applications. In addition to the widely-studied synthetic noise in the NL literature, we also consider the pseudo labels in semi-supervised learning (Semi-SL) as a special case of NL. For both types of noise, we argue that the generalization performance of existing methods is highly coupled with the quality of noisy labels. Therefore, we counter the problem from a novel and unified perspective: learning from the auxiliary clean labels. Specifically, we propose the Rotational-Decoupling Consistency Regularization (RDCR) framework that integrates the consistency-based methods with the self-supervised rotation task to learn noise-tolerant representations. The experiments show that RDCR achieves comparable or superior performance than the state-of-the-art methods under small noise, while outperforms the existing methods significantly when there is large noise.

1 Introduction

Deep neural networks (DNNs) have achieved considerable improvements in learning tasks with voluminous labeled data (He et al. 2016). Nevertheless, collecting high-quality labels is both expensive and time-consuming. The hungriness for labels substantially restricts the prevalence of DNNs in many real-world applications where extensive data comes along with scant labels. In general, there are two approaches to collect enormous amount of labels, which result in two types of noise in learning from noisy labels (NL) literature (Fr´enay and Verleysen 2013). Firstly, crowdsourcing provides an inexpensive and efficient way to distribute tedious annotation tasks that require human intelligence to a pool of paid workers (Crowston 2012). But, the easily accessible label sets are often corrupted due to malicious or careless workers. Such noise can be modeled by noisy at random (NAR) (Fr´enay and Verleysen 2013), which are widely studied in the form of the synthetic symmetric and asymmetric noise (Han et al. 2018; Chen et al. 2019b). For clarity, we refer the ”simplified-NL” setting as the one involving the synthetic noise. Secondly, it is a common practice in semi-supervised learning (SemiSL) to assign pseudo labels to the unlabeled data based

on predictions of networks pre-trained on limited labeled data (Yarowsky 1995; Lee 2013). The pseudo labels are inevitably noisy due to the natural consequence of the generalization error, belonging to the family of noisy not at random (NNAR) (Fr´enay and Verleysen 2013). Although the pseudo labels in Semi-SL and the synthetic noise in simplified-NL are often studied independently, they are closely related from the view of noisy labels (Fr´enay and Verleysen 2013); therefore, we aim to combat the two types of noise a unified manner.

Even with the recent advances in NL (Chen et al. 2019a; Arazo et al. 2019; Lee et al. 2019), the generalization performance is still highly coupled with the quality of noisy labels. As the quantity of wrong labels increases, the model could hardly distinguish hard examples from the mislabeled ones whilst the mistakes could even reinforce themselves. According to the experiment results from the literature (Hataya and Nakayama 2019; Athiwaratkun et al. 2019; Han et al. 2018), we notice that having a small subset of clean labels together with vast unlabeled data is superior to having a large amount of noisy labels, given that the size of data with the true labels is the same. It clearly evidences that having an identifiable set of clean data points provides a significant performance boost.

Based on the observation, in this paper, we propose to counter the two types of noise from a novel perspective that has yet been explored: learning from an auxiliary set of labels which emulates the clean labels by leveraging self-supervised learning (Self-SL). This does not require making extra assumptions on the noise. Regardless of the noise level, Self-SL guarantees a certain amount of supervisions from all the inputs. By training on the auxiliary labels simultaneously, we can decouple the noisy labels and further enforce noise-tolerant feature representations.

Specifically, we propose Rotational-Decoupling Consistency Regularization (RDCR), a unified multi-task framework that integrates the rotation task (Gidaris, Singh, and Komodakis 2018) with the consistency-based methods (Tarvainen and Valpola 2017; Athiwaratkun et al. 2019; Damodaran et al. 2019). The consistency-based methods have recently achieved the state-of-the-art results under the simplified-NL and the Semi-SL setting, which improves the robustness to noise by enforcing the flatter loss landscape (Laine and Aila 2016) and cleansing the noisy training labels (Damodaran et al. 2019; Chen et al. 2019a). We explore surrogate supervisions by augmenting the label set with the easily accessible rotation degrees. The rotation labels serve as a strong noise regularizer preventing the network from overfitting the noise. Our end-to-end formulation also outperforms its pre-trained counterpart (Gidaris, Singh, and Komodakis 2018). To encourage more noise-tolerant feature representations, we further investigate and incorporate the group normalization (GN) (Wu and He 2018) and the weight standardization (WS) (Qiao et al. 2019). We conduct extensive experiments to demonstrate the robustness of our RDCR on both the synthetic noise and the pseudo labels.

Overall, our contribution is threefold: (1) We provide a novel perspective that leverages auxiliary clean labels to counter the two types of noise in NL, including the symmetric and asymmetric noise in simplified-NL and pseudo labels in Semi-SL. (2) We show that we can exploit additional training signals from all the provided images and strengthen the data cleansing mechanism with the self-supervised rotation task. (3) The proposed RDCR achieves comparable or superior results than the state-of-the-art methods on two types of noise under different noise levels, while outperforms the methods significantly when there is large noise.

2 Label Noise Models

The section presents the basic notations and the two categorizations of the noise. We consider the K-class image classification problem of N examples in the presence of label noise. We denote the image, true label, observed label and the pseudo label of the n-th example respectively as and , where and . The training set of N examples is denoted by . Under the Semi-SL set- ting, we have the prior knowledge to identify a small set of L examples with ground truth labels from D, denoted as . The other labels can be regarded as absent or useless.

We adopt a convolutional neural network (CNN) as the classifier. The mapping of CNN is denoted by , followed by a fully connected layer to the space of K categories, where the parameters are shown in the subscript of the functions. For simplicity, we use to denote the CNN representation of the n-th sample.

Based on (Fr´enay and Verleysen 2013), we categorize label noise into two types according to the relationship among random variables. We denote x, y, and as the random variables of the image, the true label, and the observed label respectively. The variable indicates whether the observed label is corrupted. For interested reader, please refer to (Fr´enay and Verleysen 2013) for details. The two types of noise are:

1. Noisy at Random (NAR): In NAR, the variable E is depending on the true label y while independent to the image. In certain cases, some of the classes are confusingly similar and often lead to asymmetric noise. For simplicity, we will consider the pairwise asymmetric noise (Han

Figure 1: Illustration of the RDCR: The model receives 4 copies of the same image in different orientations and passes them into the shared convolutional layer . The doublehead architecture corresponds to 2 sets of predictions, the 1-out-of-K categories, and the four rotation labels. Only the un-rotated images are used to calculate the supervised and unsupervised loss, whilst the rotation loss depends on all the inputs of four orientations.

et al. 2018), i.e., , we select a class and randomly annotate some images in class as .

The widely studied symmetric noise (Han et al. 2018; Chen et al. 2019b) can be regarded as a special case of NAR where E is independent of the true label y and the underlying image x. The generation process of symmetric noise is as follows: A biased coin is flipped to decide whether to change the observed label, in the meanwhile, a dice of dimensions is thrown to assigned the wrong label (Fr´enay and Verleysen 2013). It depicts the situations that the spammers on crowdsourcing platforms intentionally assign random labels.

2. Noisy Not at random (NNAR): NNAR is the most general and complicated noise type that some of the classes and images are prone to be misclassified. The labeling error is now depending on the underlying image, and the images from different classes that are visually similar can be mistaken with each other. It simulates the real-world label noise such as the one in Clothing1M (Xiao et al. 2015). Another less obvious case would be the pseudo labels in Semi-SL. It encodes the knowledge learned by the model; however, the samples in the low-density regions or around the decision boundaries are subject to label error. In this paper, we especially focus on the noise in pseudo labels of consistency-based methods.

For the NAR, the noise can be characterized by a noise transition matrix , where the (i, j)-th entry denotes the probability of i-th class sample being labeled as class j. In this paper, We adopt two noise transition matrices that correspond to symmetric and asymmetric noise following (Han et al. 2018; Chen et al. 2019b).

3 Rotational-Decoupling Consistency Regularization (RDCR)

Before presenting the framework, we would like to emphasize the three motivations for our Rotational-Decoupling Consistency Regularization (RDCR). First of all, to combat the pseudo labels and synthetic noise with a uni-fied approach, we propose a generic formulation of the consistency-based methods that encompass some recent works in both NL and Semi-SL literature (Athiwaratkun et al. 2019; Tarvainen and Valpola 2017; Chen et al. 2019a; Damodaran et al. 2019). We select the consistency-based methods as the backbone of RDCR since they improve the robustness to the noise by smoothing the loss landscape (Laine and Aila 2016; Athiwaratkun et al. 2019) and correcting the noisy training labels (Chen et al. 2019a). Next, since the generalization performance of these methods is still heavily relying on noisy labels, we introduce auxiliary clean rotation targets to avoid overfitting the noise while exploiting additional training signal. Thirdly, to encourage more noise-tolerant feature representations, we investigate the normalization methods and select an effective combination. In the following, we begin with the consistency-based method and subsequently describe the proposed RDCR framework together with the normalization methods.

3.1 Consistency-based methods

Consistency-based methods have achieved state-of-the-art results under simplified-NL (Chen et al. 2019a; Damodaran et al. 2019) and Semi-SL setting (Oliver et al. 2018; Athiwaratkun et al. 2019); however, they are often studied independently. We provide a unified formulation of these approaches. The objective function involves the two loss terms:

where the weights and determine the strength of the two losses, and the latter one often follows the cosine ramp-up schedule (Tarvainen and Valpola 2017; Laine and Aila 2016). The second term is the unsupervised/consistency loss that can be either KullbackLeibler (KL) divergence (Miyato et al. 2018) or Mean Square Error (MSE) (Laine and Aila 2016). The first term is the supervised loss that computes the cross-entropy between predictions and the observed labels on the selected subset of size S. For the simplified-NL and the Semi-SL settings, we use D and respectively as the selected subset (Athiwaratkun et al. 2019; Chen et al. 2019a). Regarding the type of noise one may encounter, NNAR is unavoidable in both settings due to the consistency term, while the former one may face an additional NAR from the noisy dataset D.

The class of methods deals with the noisy supervisions by encouraging the flatter loss landscape and cleansing the noisy training data. Predictions under different model-space and input-space perturbations are required to be consistent (Bachman, Alsharif, and Precup 2014). Each of the input is at least evaluated twice, which results in a student and a teacher prediction for computing the consistency loss. The relatively stable teacher predictions serve as pseudo/target labels for the student predictions, which are often cleaner than the observed labels.

3.2 Decoupling with rotation labels

Even though the existing consistency-based methods demonstrate certain robustness with respect to label corruptions, we still observe significant performance degradation under the extreme noise because of the heavy reliance on noisy labels. Enlightened by the significance of clean labels, we extend the previous framework with an auxiliary self-supervised rotation task to extract additional supervisions from all the provided images and to strengthen the data cleansing mechanism.

By applying four rotation transformations to the N examples, we can obtain a set of transformed images following (Gidaris, Singh, and Komodakis 2018). The augmented self-supervised label 1, ..., 4N, and the four classes correspond to the 0, 90, 180 and 270 rotation degrees respectively.

In order to achieve higher performance than models trained separately on each task (Athiwaratkun et al. 2019; Gidaris, Singh, and Komodakis 2018), we adopt the multi-task learning framework. The two tasks correspond to making predictions in the space of K object categories and in the space of the four orientations. In this way, the optimization objective becomes the combination of the supervised loss on , the consistency loss on D, and the rotation loss on , where each of them incurs gradients flow back to the joint CNN structure. The joint training scheme encourages more robust representations across tasks, at the same time, calibrating discriminative representations for each specific task (Liu, Johns, and Davison 2018). The objective function of the RDCR extends the Eq. 1 with the rotation loss :

where is an extra fully connected layer that maps the hidden features to the rotation targets and is the weight of the rotation loss . Following (Gidaris, Singh, and Komodakis 2018), we use the cross-entropy function for .

We attribute the superiority of our method to two key properties of the self-supervised rotation labels, which supplements capabilities of the supervised and consistency loss.

Firstly, we are able to accumulate a considerable amount of supervisory signals by leveraging the vast unlabeled or noisily-labeled images, notwithstanding the noise level. In essence, the rotation loss is forcing the model to be rotation covariance, i.e., given rotated images, the model produces the corresponding labels according to the predefined mapping between the angles and the labels (Marcos et al. 2017). In order to identify the orientation, the model is required to perceive the objects appear in the images.

Secondly, the rotation labels reinforce the data cleansing mechanism by enhancing the quality of pseudo labels. Without making extra assumptions on the noise, the noisy labels are assessed and benchmarked against the rotation labels. One may argue that the rotation labels are also corrupted as some objects are rotation agnostic (Feng, Xu, and Tao 2019), but the noise is actually negligible under the multi-task framework. We specifically conduct an experiment to verify our claim in Sec. 4.1.

Overall, we can decouple the noise presented in the observed and pseudo labels in a unified manner, while including the rotation loss provides strong regularization against incorrectly labeled data.

3.3 Inducing Noise-tolerant Representations

We adopt two normalization methods, the weight standardization (WS) (Qiao et al. 2019) and the group normalization (GN) (Wu and He 2018), to induce more robust representations, despite they are originally proposed neither for rotated inputs nor the noisy labels.

On one hand, WS is introduced to supplement the consistency loss as it can further smooth the loss landscape via decreasing the Lipschitz constants of the loss (Qiao et al. 2019). Considering that the incorrect labels tend to distort and sharpen the decision boundary, WS can help to enforce a more consistent local neighborhoods and thus merge images into coherent clusters (Laine and Aila 2016).

On the other hand, given that the batch size is quadrupled with the rotation transformations, we conjecture that it is unnecessary or even harmful to perform normalization on all the rotated inputs using the batch normalization (BN) (Ioffe and Szegedy 2015). GN bypasses the issue through computing the normalization statistics within each group of channels, which is independent of the batch sizes. We empirically justify that GN generalizes better than BN when the rotation loss is included.

With GN and WS, we improve the robustness and generalization performance with few computational overheads. The full comparisons among the normalization methods are in Sec. 4.2.

4 Experiments

We compare our RDCR with the state-of-the-art models in both NL and Semi-SL literature on CIFAR-10 and CIFAR-100 datasets (Krizhevsky and Hinton 2009). Both benchmarks consist of 32-by-32 RGB natural images. CIFAR-10 and CIFAR-100 are composed of 50,000 training images and 10,000 testing images from 10 and 100 classes respectively. We adopt the MT (Tarvainen and Valpola 2017) and fast-SWA (Athiwaratkun et al. 2019) models for the consistency regularization part of our implementations, which leads to two instances of the RDCR framework: Rotational-Decoupling-MT (RD-MT) and Rotational-Decoupling-fast-SWA (RD-fast-SWA). All the MT-based models are trained for 180 epochs. We use the 13-layer CNN following the common practice (Tarvainen and Valpola 2017; Athiwaratkun et al. 2019) for fair comparisons. Except that the BN layers are replaced by the GN+WS ones in which the

Table 1: Test accuracy (%) on CIFAR-10 (top) and CIFAR-100 (bottom). The ”Best” results show the highest value achieved during training, the ”Last” results are the performances at the end, and the ”CE” represents the vanilla training merely on observed labels with the cross-entropy loss. The experiments consider the NAR in the form of symmetric and asymmetric noise.

number of channels is set to 16 as suggested by (Wu and He 2018). Unless specified, the hyper-parameters of the consistency regularization part exactly follow those of the original MT or fast-SWA implementations. We will leave the specific configuration for each setting in the subsections.

For all the experiments, we retain 1% of the training data for validation. It is common and realistic to have a tiny validation set (Ren et al. 2018; Oliver et al. 2018).

In addition, the complex interaction between the three loss functions invokes the exploding gradient problem described in (Bengio et al. 1994). We introduce the gradient norm clipping proposed by (Pascanu, Mikolov, and Bengio 2013) to scale down the gradients if the Euclidean norm exceeds the predefined threshold. To be specific, we set for any gradient whose norm is greater than the threshold .

Overall, the experiments show that:

• For the synthetic symmetric and asymmetric noise, our methods are superior or comparative to the existing NL methods under all noise levels. We attain significant improvements for noise level greater than 60%.

• As RDCR deals with the pseudo labels, it is applicable to Semi-SL. In the Semi-SL setting, our methods consistently outperform baseline methods, especially when fewer clean labels are presented.

Table 2: Test accuracy (%) on CIFAR-10 (top) and CIFAR-100 (bottom) under four levels of symmetric noise. Note that each method may use different architectures and sizes of the validation set.

4.1 Synthetic noise (simplified-NL)

To verify the effectiveness of our methods, we conduct extensive experiments on the two datasets on symmetric and asymmetric noise under different noise levels. Although the pseudo labels are also involved due to the consistency term, they are not the main issue given that the synthetic noise are noisier. We trained MT and fast-SWA model for 180 and 360 epochs respectively.

To decouple the noisy labels gradually, we use different weight scheduling schemes for each weight. The supervised weight follows the cosine ramp-down schedule (Loshchilov and Hutter 2016) to reduce the effect of observed labels in the later stage. In CIFAR-100, we use a higher to rely more on pseudo labels. It is set to 1000 for symmetric noise and 40% asymmetric noise, while for 80% symmetric noise, we set it to 10,000. Lastly, for noise level smaller or equal to 60%, the rotation weight follows the linear ramp-up schedule starting from 0 to 0.3. For 80% symmetric noise, ramp-up from 0.3 to 0.5. With the linear ramp-up, the model can gradually focus on the clean rotation labels at the later stage of training, which help to retain the generalizability attained in the early stage from the noisy labels.

Our methods consistently outperform the MT model, as shown in Tab. 1. The higher ”Best” values indicate that our methods exploit more valuable information from the data. The gap between ”Best” and ”Last” can be viewed as a rough quantification of the negative effect of noise on generalization, we obtain smaller gaps than the baselines.

In addition, Tab. 2 compares with the state-of-the-art methods under four levels of symmetric noise. Note that these methods may use different architectures and different

Figure 2: Verification of the data cleansing mechanism by comparing the confusion matrices of pseudo labels on the CIFAR-10 training data between MT model and our RDMT. We use the 80% symmetric noise here and both models are trained 180 epochs. The x-axis is the predicted label, the y-axis is the true label, and the values represent the ratio (%). The higher the ratios in the diagonal indicate the better robustness and noise-correction capability.

sizes of the clean validation set, so the results should be interpreted carefully. Nevertheless, we can see that when the noise is large, we obtain significantly higher performance. For noise smaller than 60%, the adverse effect of noise is even unapparent.

Verification of the training data cleansing mechanism.

To verify that the rotation labels reinforce the cleansing process of the training data, we can examine the quality of pseudo labels used to substitute the observed labels. We consider a worse case of 80% symmetric noise. MT model can hardly infer correct labels based on the highly corrupted observed labels, results in erroneous pseudo labels as shown in Fig.2 (a). Note that the confusion matrix is calculated on the pseudo labels of the training data. If no extra knowledge is given, the mistakes often reinforce each other and prevent the model to correct itself. The error in Fig.2 (a) resembles the symmetric noise since MT memorizes most of the observed labels. In contrast, our method produces a concentrated confusion matrices where most of the pseudo labels are correct, which restores the data cleansing mechanism MT strives for. This directly leads to lower generalization error, as shown in Tab. 1.

4.2 Pseudo Labels (Semi-SL)

In order to evaluate the robustness of our framework against the pseudo labels, we follow the regular Semi-SL setting which we have vast unlabeled data and a small set of correctly labeled data. For consistency-based methods, the pseudo labels are the only type of noise involved. The fewer the clean labels, the higher is the noise in pseudo labels. For CIFAR-10, we consider the cases where only 1000, 2000,

Table 3: CIFAR-10/100 test error rates (%) with a 13-layer CNN under the different number of labels. The number of labels decides the noise level of the pseudo labels, i.e., NNAR. The first sector presents the results from the literature, the second one includes the baseline models we implement by replacing BN with GN+WS, and the third one displays the self-supervised baselines following (Gidaris, Singh, and Komodakis 2018). The last sector is the results of the proposed framework. Noth that the choice of the normalization layer and the number of training epochs are shown in the parenthesis. For the last three sectors, the mean and standard deviation are computed on three runs.

and 4000 clean labels are given; for CIFAR-100, we test the 10,000 labels case. Note that most of the baselines in Tab.3 use a similar 13-layer CNN, we intentionally select the architecture for fairness. We also include the self-supervised pre-trained and fine-tuned baselines. For the two baselines, the networks are pre-trained only with the rotation targets for 180 epochs with fixed to 10/1 for CIFAR10/100. ”Rot + Linear” is the pure Self-SL baseline, where we train a fully connected layer on top while keeping the other layers fixed for another 180 epochs. For ”Rot + Fine-tune”, we follow the training protocol of fast-SWA.

With regard to the hyper-parameters, we apply another 20 learning rate cycles with 30 epochs each based on the pre-trained MT to get the fast-SWA model, i.e., 780 epochs in total. The weight for the supervised loss is always set to 1. The follows the same cosine ramp-up schedule as the original MT and fast-SWA. As for rotation weight , it is set constant to 10 and 1 in all CIFAR-10 and CIFAR-100 experiments respectively.

The results are displayed in Tab.3. Our method outperforms the state-of-the-art results on the two benchmark datasets under the different number of clean labels. To be specific, for the fast-SWA model, we decrease the error rate from 9.05% to 6.80% and from 33.62% to 30.62% for CIFAR-10 and CIFAR-100 respectively.

Remarkably, the improvements are even larger when there are fewer labeled samples, i.e., higher pseudo label noise. With merely half or one-fourth of the labels, we are able to achieve better performance than the baselines using 4000

Table 4: Comparing the test error (%) of different normalization layers using a 13-layer CNN. We run the experiments run on CIFAR-10 with 4000 labels for 180 training epochs based on the MT model.

labels. The performance of the consistency-based models relies on the number of labels to a great extent. However, for RDCR, the quantity makes relatively little differences as we have access to the informative self-supervised labels for all the data.

Ablation Studies. We compare GN+WS against the instance normalization (IN) (Ulyanov, Vedaldi, and Lempitsky 2016), the layer normalization (LN) (Lei Ba, Kiros, and Hinton 2016), GN, and the originally used BN, as shown in Tab.4. IN and LN can be viewed as the special cases of GN, where the number of groups is set to one and the number of channels respectively.

Including the rotation loss consistently improves the accuracy regardless of the underlying normalization layers, while we gain the greatest improvement with GN+WS.

5 Related Work

We will briefly discuss the papers in NL, Semi-SL, and SelfSL that are closely related to our model.

NL: It remains challenging to learn robustly against the ubiquitous noisy labels as DNNs can memorize the noise with their high capacity (Zhang et al. 2016). A set of researches focus on designing criteria to select and reweight samples to avoid over-training on noisy labels, including but not limited to MentorNet (Jiang et al. 2017), Co-teaching (Han et al. 2018), and Reweighting (Ren et al. 2018). These methods suffer from a waste of samples (Chen et al. 2019a). In addition to the above methods, Lee et al. apply a generative classifier on a pre-trained network for robust inference, Tanaka et al. propose to alternatively update the network parameters and labels, and Arazo et al. model the loss with a two-component mixture model. Another set of researches extend consistency regularization to deal with label noise, where Damodaran et al. introduce a Wasserstein distance, and Chen et al. formulate a meta manifold regularizer. Similar to our setting, Hataya and Nakayama consider bi-quality data that includes a small set of clean labels and some noisily-labeled data.

Semi-SL: There are extensive methods in the SemiSL literature (Chapelle, Scholkopf, and Zien 2009) that improve generalization with the unlabeled data, including but not limited to self-training (Yarowsky 1995), generative (Chongxuan et al. 2017), and disagreement-based (Zhou and Li 2005) models. The consistency-based (Tarvainen and Valpola 2017; Luo et al. 2018) approaches aim to train clas-sifiers that are robust to random perturbations (Bachman, Alsharif, and Precup 2014) by enforcing consistent local neighborhoods. TempEns (Laine and Aila 2016) and MT (Tarvainen and Valpola 2017) aggregate the past predictions and weights respectively by EMA fast-SWA (Athiwaratkun et al. 2019) averages selected points traversed along the cyclical learning trajectories with equal weighting. Recently, the concurrent work (Zhai et al. 2019) also apply Self-SL to consistency-based methods. Their work is developed independently of ours and is limited to only the Semi-SL setting. Note that the pseudo labels in self-training (Yarowsky 1995), pseudo-labeling (Lee 2013), and consistency-based methods belong to NNAR.

Self-SL: The Self-SL researches focus on designing spe-cific unsupervised pre-training objectives that are benefi-cial to the downstream tasks, e.g., depth prediction, object detection, and image classification (Kolesnikov, Zhai, and Beyer 2019; Doersch and Zisserman 2017). For image classification, the learning task can be solving a jigsaw puzzle (Noroozi and Favaro 2016), counting visual primitives (Noroozi, Pirsiavash, and Favaro 2017), and colorizing gray-scale photos (Larsson, Maire, and Shakhnarovich 2017). Image rotation task has been applied to image generation (Lucic et al. 2019) and Semi-SL image classifica-tion (Zhai et al. 2019). Nevertheless, the paradigm has yet to be explored thoroughly in NL.

6 Conclusions and Future Work

In this paper, we propose a unified framework RDCR to deal with the synthetic noise in simplified-NL and the pseudo labels in Semi-SL. Our RDCR decouples the noisy labels to stimulate the data cleansing process and to exploit extra supervisions from all the inputs. Extensive experiments show that RDCR achieves superior or comparative results under different noise types and levels against existing methods. In future work, we may incorporate consistency regularization in the space of rotation inputs to eliminate the possible noise in rotation labels. In addition, our success hints the potential in applying other Self-SL methods. We can include more auxiliary tasks and labels (Doersch and Zisserman 2017) to reduce the reliance on observed labels.

References

Arazo, E.; Ortego, D.; Albert, P.; O’Connor, N. E.; and McGuinness, K. 2019. Unsupervised label noise modeling and loss correction. arXiv preprint arXiv:1904.11238.

Athiwaratkun, B.; Finzi, M.; Izmailov, P.; and Wilson, A. G. 2019. There are many consistent explanations of unlabeled data: Why you should average. ICLR.

Bachman, P.; Alsharif, O.; and Precup, D. 2014. Learning with pseudo-ensembles. In Advances in Neural Information Processing Systems, 3365–3373.

Bengio, Y.; Simard, P.; Frasconi, P.; et al. 1994. Learning long-term dependencies with gradient descent is difficult. IEEE transactions on neural networks 5(2):157–166.

Chapelle, O.; Scholkopf, B.; and Zien, A. 2009. Semi-supervised learning (chapelle, o. et al., eds.; 2006)[book reviews]. IEEE Transactions on Neural Networks 20(3):542–542.

Chen, D.-D.; Wang, W.; Gao, W.; and Zhou, Z.-H. 2018. Tri-net for semi-supervised deep learning. In Proceedings of the 27th International Joint Conference on Artificial Intelligence, 2014–2020. AAAI Press.

Chen, P.; Liao, B.; Chen, G.; and Zhang, S. 2019a. A meta approach to defend noisy labels by the manifold regularizer psdr. arXiv preprint arXiv:1906.05509.

Chen, P.; Liao, B.; Chen, G.; and Zhang, S. 2019b. Understanding and utilizing deep neural networks trained with noisy labels. arXiv preprint arXiv:1905.05040.

Chongxuan, L.; Kun, X.; Jun, Z.; and Bo, Z. 2017. Triple generative adversarial nets. In Advances in neural information processing systems, 4088–4098.

Crowston, K. 2012. Amazon mechanical turk: A research tool for organizations and information systems scholars. In Shaping the Future of ICT Research. Methods and Approaches. Springer. 210– 221.

Damodaran, B. B.; Fatras, K.; Lobry, S.; Flamary, R.; Tuia, D.; and Courty, N. 2019. Pushing the right boundaries matters! wasserstein adversarial training for label noise. arXiv preprint arXiv:1904.03936.

Doersch, C., and Zisserman, A. 2017. Multi-task self-supervised visual learning. In Proceedings of the IEEE International Conference on Computer Vision, 2051–2060.

Feng, Z.; Xu, C.; and Tao, D. 2019. Self-supervised representation learning by rotation feature decoupling. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 10364– 10374.

Fr´enay, B., and Verleysen, M. 2013. Classification in the presence of label noise: a survey. IEEE transactions on neural networks and learning systems 25(5):845–869.

Gidaris, S.; Singh, P.; and Komodakis, N. 2018. Unsupervised representation learning by predicting image rotations. arXiv preprint arXiv:1803.07728.

Han, B.; Yao, Q.; Yu, X.; Niu, G.; Xu, M.; Hu, W.; Tsang, I.; and Sugiyama, M. 2018. Co-teaching: Robust training of deep neural networks with extremely noisy labels. In Advances in neural information processing systems, 8527–8537.

Hataya, R., and Nakayama, H. 2019. Unifying semi-supervised and robust learning by mixup.

He, K.; Zhang, X.; Ren, S.; and Sun, J. 2016. Deep residual learning for image recognition. In Proceedings of the IEEE conference on computer vision and pattern recognition, 770–778.

Ioffe, S., and Szegedy, C. 2015. Batch normalization: Accelerating deep network training by reducing internal covariate shift. arXiv preprint arXiv:1502.03167.

Jiang, L.; Zhou, Z.; Leung, T.; Li, L.-J.; and Fei-Fei, L. 2017. Mentornet: Learning data-driven curriculum for very deep neural networks on corrupted labels. arXiv preprint arXiv:1712.05055.

Kolesnikov, A.; Zhai, X.; and Beyer, L. 2019. Revisiting self-supervised visual representation learning. arXiv preprint arXiv:1901.09005.

Krizhevsky, A., and Hinton, G. 2009. Learning multiple layers of features from tiny images. Technical report, Citeseer.

Laine, S., and Aila, T. 2016. Temporal ensembling for semi-supervised learning. arXiv preprint arXiv:1610.02242.

Larsson, G.; Maire, M.; and Shakhnarovich, G. 2017. Colorization as a proxy task for visual understanding. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 6874–6883.

Lee, K.; Yun, S.; Lee, K.; Lee, H.; Li, B.; and Shin, J. 2019. Robust inference via generative classifiers for handling noisy labels. arXiv preprint arXiv:1901.11300.

Lee, D.-H. 2013. Pseudo-label: The simple and efficient semi-supervised learning method for deep neural networks. In Workshop on Challenges in Representation Learning, ICML, volume 3, 2.

Lei Ba, J.; Kiros, J. R.; and Hinton, G. E. 2016. Layer normalization. arXiv preprint arXiv:1607.06450.

Liu, S.; Johns, E.; and Davison, A. J. 2018. End-to-end multi-task learning with attention. arXiv preprint arXiv:1803.10704.

Loshchilov, I., and Hutter, F. 2016. Sgdr: Stochastic gradient descent with warm restarts. arXiv preprint arXiv:1608.03983.

Lucic, M.; Tschannen, M.; Ritter, M.; Zhai, X.; Bachem, O.; and Gelly, S. 2019. High-fidelity image generation with fewer labels. arXiv preprint arXiv:1903.02271.

Luo, Y.; Zhu, J.; Li, M.; Ren, Y.; and Zhang, B. 2018. Smooth neighbors on teacher graphs for semi-supervised learning. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 8896–8905.

Ma, X.; Wang, Y.; Houle, M. E.; Zhou, S.; Erfani, S. M.; Xia, S.-T.; Wijewickrema, S.; and Bailey, J. 2018. Dimensionality-driven learning with noisy labels. arXiv preprint arXiv:1806.02612.

Marcos, D.; Volpi, M.; Komodakis, N.; and Tuia, D. 2017. Rotation equivariant vector field networks. In Proceedings of the IEEE International Conference on Computer Vision, 5048–5057.

Miyato, T.; Maeda, S.-i.; Ishii, S.; and Koyama, M. 2018. Virtual adversarial training: a regularization method for supervised and

semi-supervised learning. IEEE transactions on pattern analysis and machine intelligence.

Noroozi, M., and Favaro, P. 2016. Unsupervised learning of visual representations by solving jigsaw puzzles. In European Conference on Computer Vision, 69–84. Springer.

Noroozi, M.; Pirsiavash, H.; and Favaro, P. 2017. Representation learning by learning to count. In Proceedings of the IEEE International Conference on Computer Vision, 5898–5906.

Oliver, A.; Odena, A.; Raffel, C. A.; Cubuk, E. D.; and Goodfellow, I. 2018. Realistic evaluation of deep semi-supervised learning algorithms. In Advances in Neural Information Processing Systems, 3235–3246.

Park, S.; Park, J.; Shin, S.-J.; and Moon, I.-C. 2018. Adversarial dropout for supervised and semi-supervised learning. In ThirtySecond AAAI Conference on Artificial Intelligence.

Pascanu, R.; Mikolov, T.; and Bengio, Y. 2013. On the difficulty of training recurrent neural networks. In International conference on machine learning, 1310–1318.

Qiao, S.; Shen, W.; Zhang, Z.; Wang, B.; and Yuille, A. 2018. Deep co-training for semi-supervised image recognition. In Proceedings of the European Conference on Computer Vision (ECCV), 135– 152.

Qiao, S.; Wang, H.; Liu, C.; Shen, W.; and Yuille, A. 2019. Weight standardization. arXiv preprint arXiv:1903.10520.

Ren, M.; Zeng, W.; Yang, B.; and Urtasun, R. 2018. Learning to reweight examples for robust deep learning. arXiv preprint arXiv:1803.09050.

Tanaka, D.; Ikami, D.; Yamasaki, T.; and Aizawa, K. 2018. Joint optimization framework for learning with noisy labels. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 5552–5560.

Tarvainen, A., and Valpola, H. 2017. Mean teachers are better role models: Weight-averaged consistency targets improve semi-supervised deep learning results. In Advances in neural information processing systems, 1195–1204.

Ulyanov, D.; Vedaldi, A.; and Lempitsky, V. 2016. Instance normalization: The missing ingredient for fast stylization. arXiv preprint arXiv:1607.08022.

Wang, Y.; Liu, W.; Ma, X.; Bailey, J.; Zha, H.; Song, L.; and Xia, S.-T. 2018. Iterative learning with open-set noisy labels. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 8688–8696.

Wu, Y., and He, K. 2018. Group normalization. In Proceedings of the European Conference on Computer Vision (ECCV), 3–19.

Xiao, T.; Xia, T.; Yang, Y.; Huang, C.; and Wang, X. 2015. Learning from massive noisy labeled data for image classification. In Proceedings of the IEEE conference on computer vision and pattern recognition, 2691–2699.

Yarowsky, D. 1995. Unsupervised word sense disambiguation rivaling supervised methods. In 33rd annual meeting of the association for computational linguistics.

Zhai, X.; Oliver, A.; Kolesnikov, A.; and Beyer, L. 2019. Self-supervised semi-supervised learning. arXiv preprint arXiv:1905.03670.

Zhang, C.; Bengio, S.; Hardt, M.; Recht, B.; and Vinyals, O. 2016. Understanding deep learning requires rethinking generalization. arXiv preprint arXiv:1611.03530.

Zhou, Z.-H., and Li, M. 2005. Tri-training: Exploiting unlabeled data using three classifiers. IEEE Transactions on Knowledge & Data Engineering (11):1529–1541.

designed for accessibility and to further open science