Multi-Class Classification from Noisy-Similarity-Labeled Data

2020·Arxiv

Abstract

Abstract

A similarity label indicates whether two instances belong to the same class while a class label shows the class of the instance. Without class labels, a multi-class classifer could be learned from similarity-labeled pairwise data by meta classifcation learning [Hsu et al., 2019]. However, since the similarity label is less informative than the class label, it is more likely to be noisy. Deep neural networks can easily remember noisy data, leading to overfting in classifcation. In this paper, we propose a method for learning from only noisy-similarity-labeled data. Specifcally, to model the noise, we employ a noise transition matrix to bridge the class-posterior probability between clean and noisy data. We further estimate the transition matrix from only noisy data and build a novel learning system to learn a classifer which can assign noisefree class labels for instances. Moreover, we theoretically justify how our proposed method generalizes for learning classifers. Experimental results demonstrate the superiority of the proposed method over the state-of-the-art method on benchmark-simulated and real-world noisy-label datasets.

1 Introduction

Supervised classifcation crucially relies on the amount of data and the accuracy of corresponding labels. Since the data volume grows very quickly while supervision information cannot catch up with its growth, weakly supervised learning (WSL) is becoming more and more prominent [Zhou, 2017, Han et al., 2019, Wang et al., 2019, Li et al., 2017, 2018, Krause et al., 2016, Khetan et al., 2017, Hu et al., 2019a]. Among WSL, similarity-based learning is one of the hotest emerging problems [Bao et al., 2018, Hsu et al., 2019]. Compared with class labels, similarity labels are usually easier to obtain [Bao et al., 2018], especially when we encounter some sensitive issues, e.g., religion and politics. Take an illustrative example from Bao et al. [Bao et al., 2018]: for sensitive maters, people ofen hesitate to directly answer “What is your opinion on issue A?”; while they are more likely to answer “With whom do you share the same opinion on issue A?”. Intuitively, similarity information can not only alleviate embarrassment but also protect personal privacy to some degree.

Existing methods for similarity-based learning can be divided into two categories generally: semi-supervised clustering [Wagstaf et al., 2001, Xing et al., 2003] and weaklysupervised classifcation [Bao et al., 2018, Shimada et al., 2019]. Te frst category utilizes pairwise similarity and dissimilarity data for clustering. For example, pairwise links were used as constraints on clustering [Li and Liu, 2009]; Similar and dissimilar data pairs were used for metric learning, which learns a distance function over instances and can easily convert to clustering tasks [Niu et al., 2014]. Te second category aims at classifcation, which not only separates diferent clusters but also identifes which class each cluster belongs to. For example, similarity and unlabeled (SU) learning proposed an unbiased estimator for binary classifcation [Bao et al., 2018]; Meta classifcation learning (MCL) showed a method to learn a multi-class classifer from only similarity data [Hsu et al., 2019].

All existing methods are based on the strong assumption that similarity labels are entirely accurate. However, similarity labels are hard to be fully accurate for many applications. For example, for some sensitive maters, people may not be willing to provide their real thoughts even when facing easy questions. It is commonly known that deep networks can memorize all the training data even there is noisy supervision, which tends to lead to the overfting problem [Zhang et al., 2016, Zhong et al., 2019a, Li et al., 2019, Yi and Wu, 2019, Zhang et al., 2019, Tanno et al., 2019, Zhang et al., 2018]. Tus, if we directly employ the existing deep learning algorithms to deal with noisy similarity-based supervision, the test performance will inevitably degenerate because of overfting. To the best of our knowledge, no pioneer work has been done to tackle the problem of binary classifcation with noisy similarity information, not to mention how to learn multi-class classifers with theoretical guarantees.

In this paper, we study the problem of how to learn a Multi-class classifer from Noisy-Similarity-labeled data, which is called MNS classifcation. Specifcally, we assume that latent clean class labels Y fip into latent noisy labels , leading to noisy similarity labels . Te corresponding graphical model, representing the interactions among variables, is

Figure 1: Te assumed graphical representation for the proposed Multi-class classifcation with Noisy-Similarity-labeled data (called MNS classifcation), where denotes the input instance; denotes the clean class label; denotes the noisy class label; pairwise similarity supervision between denotes the neural network parameters; T denotes the noise transition matrix. Te latant variables are denoted by white circles and the observable variables are presented by grey circles.

shown in Figure 1. Based on this, we could further model the noise in the problem by using a transition matrix, i.e., represents the probabilities that the clean class label i fips into the noisy class label . We will show that under a mild assumption that anchor points ( defned in 3.3) exist in the training data, we can estimate the transition matrix by only employing noisy-similarity-labeled data. Ten, we build a deep learning system for multi-class classifcation from only noisy-similarity-labeled data. Note that if a good classifer can be learned, the corresponding method can be easily extended to learn metrics or clusters, because accurate labels and similar and dissimilar pairs can be assigned by the good classifer. In other words, the proposed method can not only learn a classifer from noisy-similarity-labeled data but metrics and clusters. Te contributions of this paper are summarized as follows:

• We propose a deep learning system for multi-class classifcation to address the problem of how to learn from noisy-similarity-labeled data.

• We propose to model the noise by using the transition matrix based on a graphical model. We show that the transition matrix can be estimated from only noisy-similarity-labeled data. Te efectiveness will be verifed on both synthetic and real data.

• We theoretically establish a generalization error bound for the proposed MNS classifcation method, showing that the learned classifer will generalize well on unseen data.

• We empirically demonstrate that the proposed method can efectively reduce the side efect of noisy-similarity-labeled data. It signifcantly surpasses the baselines on many datasets with both synthetic noise and real-world noise 1.

Te rest of this paper is organized as follows. In Section 2, we formalize the MNS classifcation problem, and in Section 3, we propose the MNS learning and practical implementation. Generalization error bound is analysed in Section 4. Experimental results are discussed in Section 5. We conclude our paper in Section 6.

2 Framing the MNS classifcation Problem

Problem setup. Let D be the distribution of a pair of random variables where represents the dimension; Y = [C] is the label space and [C] = is the number of classes. Our goal is to predict a label for any given instance . Diferent from the traditional multi-class classifcation, in our seting, the class labels are not observable. Instead, we have noisy similarity labels similarity labels S indicate the similarities between examples, i.e., where denote the class labels for instances . For noisy similarity labels, some of them are identical to the clean similarity labels, but some are diferent and we do not know which of them are clean. To the best of our knowledge, no existing work has discussed how to learn with the noisy similarity labels. We would like to review how the state-of-the-art work learns a classifer from the clean similarity labels.

MCL classifcation [Hsu et al., 2019]. Meta classifcation learning (MCL) utilizes the following likelihood to explain the similrity-based data

By introducing an independence assumption: Appendix D], in other words, are independent to each other given and ; they can simplify the likelihood expression as

Ten taking a negative logarithm on Equation 2, the fnal loss function can be derived as

where , which can be learned from a neural network. However, class label noise is ubiquitous in our daily life [Kaneko et al., 2019, Hu et al.,

2019b, Zhong et al., 2019b, Acuna et al., 2019, Lee et al., 2018, Tanaka et al., 2018, Wang et al., 2018], not to mention the weaker supervision: similarity labels. Te performance of classifers will get worse if we still use the state-of-the-art methods designed for clean similarity labels. Tis motivates us to fnd a novel algorithm for learning from noisy-similarity-labeled data.

3 MNS Learning

In this section, we propose a method for multi-class classifcation from noisy-similarity-labeled data.

3.1 Modeling noise in the supervision

To learn from the noisy-similarity-labeled data, we should model the noise. To model the noise, we introduce a graphic model in Figure 1 to describe the interactions among variables, where only input instances X and noisy similarity labels are observed while both clean class labels Y and noisy class labels are latent. Rather than modeling the similarity-label noise directly, we assume that noise frst occurs on latent class labels and as a consequence, similarity labels turn to noisy ones, i.e., noisy similarity labels {0, 1} indicate the similarities between noisy examples, and assumption is reasonable. For example, in the sensitive maters, to hide one’s thought on the question “With whom do you share the same opinion on issue A?”, people would like to randomly choose a fake opinion about the issue and answer the question conditioned on the fake opinion. Specifcally, to precisely describe label noise, we utilize a Cheng et al., 2017]. Te transition matrix is generally dependent on instances, i.e., . Given only noisy examples, the instance-dependent transition matrix is non-identifable without any additional assumption [Xia et al., 2019]. In this paper, we assume that given is independent on instance . Tis assumption considers the situations where noise relies only on the classes, which has been widely adopted in the class-label-noise learning community [Han et al., 2018, Xia et al., 2019]. Empirical results on real-datasets verify the efciency of the assumptions.

We denote by the distribution of the noisy-similarity-labeled data and the classifer is supposed to be learned from a training sample drawn from

Figure 2: An overview of the proposed method. We add a noise transition matrix layer to model the noise. By minimizing the proposed loss , a classifer g can be learned for assigning clean labels. Te detailed structures of the Neural Network are provided in Section 5. Note that for the noisy similarity labels, some of them are correct and some are not. Te similarity label for dogs is correct and the similarity label for cats is incorrect.

3.2 Likelihood-based estimator

Intuitively, according to fgure 1, we can explain the noisy-similarity-based data by using the following likelihood model

In order to calculate the above likelihood, we have to marginalize the clean class label Y and noisy class label . Tanks to our proposed deep learning system (summarized in Figure 2), , modeled by a noise transition matrix T, could be learned only from noisy data (shown in Section 3.3). Terefore, we only need to marginalize noisy class label . With the independence assumption , we can calculate the likelihood with the following expression

where the proportion relationship holds because P(X) is constant for given X such that can be omited. Note that

Let g

Ten by taking a negative logarithm on Equation 5 and substituting we obtain the objective function of the proposed method, i.e.,

Let us look inside Equation 8. Intuitively, outputs the predicted noisy categorical distribution of instance is exactly the predicted noisy similarity, indicating the probability of data pairs belonging to the same noisy class. For clarity, we visualize the predicted noisy similarity in Figure 3. If are predicted belonging to the same class, i.e., dicted noisy similarity should be relatively high ( in Figure 3(a)). By contrast, if are predicted belonging to diferent classes, the predicted noisy similarity should be relatively low ( in Figure 3(b)).

Further, let , denoting the predicted noisy similarity. Substituting into Equation 8, can convert into a binary cross-entropy loss version, i.e.,

Let us look inside Equation 9. We could treat as the loss function denoting the loss of using to predict

Figure 3: Examples of predicted noisy similarity. Assume class number is are categorical distribution of instances respectively, which are shown above in the form of area charts. is the predicted similarity value between two instances, calculated by the inner product between two categorical distributions.

our problem can be formulated in the traditional risk minimization framework [Mohri et al., 2018]. Te expected and empirical risks of employing estimator f can be defned as

and

where n is training sample size of the noisy-similarity-labeled data.

Te whole pipeline is summarized in Figure 2. Te sofmax function outputs an estimator for the clean class posterior, i.e., denotes the estimated posterior. Afer the sofmax layer, a noise transition matrix layer is added. According to Equation 7, by pre-multiplying the transpose of the transition matrix, we can obtain a predictor for the noisy class posterior, which can be further used to compute the prediction of the noisy similarity label, i.e., . Terefore, by minimizing , as the training data goes to infnity, will converge to noisy similarity will converge to the optimal classifer for predicting noisy class labels. Meanwhile, given the true transition matrix, g(X) will converge to the optimal classifer for predicting clean class labels.

3.3 Estimate noise transition matrix T

However, the transition matrix is unknown. We will discuss how to estimate the transition matrix for the noisy-similarity-labeled data in this subsection.

Anchor points [Liu and Tao, 2015, Patrini et al., 2017, Yu et al., 2018] have been widely used to estimate the transition matrix for noisy-class-labeled data [Niu et al., 2018]. We illustrate that they can also be used to estimate the transition matrix for the noisy-similarity-labeled data. Specifcally, an anchor point x for class y is defned as P(Y = y|X = x) = 1 and be an anchor point for class i such that . Ten we have

Equation 12 shows that given anchor points for each class and the noisy class posterior distribution, the transition matrix can be estimated. Note that the noisy class posterior can be estimated by using the pipeline in Figure 2 without the transition matrix layer. However, it is a bit strong to have access to anchor points. Instead, we assume that anchor points exist in the training data but unknown to us. Empirically, we select examples with the highest as anchor points for the i-th class.

3.4 Implementation

Given the true transition matrix, we can directly build a neural network as shown in Figure 2 to learn a multi-class classifer only from the noisy-similarity-labeled data. When the true transition matrix is unknown, we estimate it with the method proposed in Section 3.3 and then we can train the whole network as normal. Te proposed algorithm is summarized in Algorithm 1.

4 Generalization error

In this section, we will theoretically analyze the generalization ability of the proposed method. Although it looks complex, we will show that it will generalize well.

Assume that the neural network has d layers with parameter matrices and the activation functions are Lipschitz continuous, satisfying We denote by the standard form of the neural network. Ten the output of the sofmax function is defned as is the output of the noise transition matrix layer. Let be the classifer learned from the hypothesis space F determined by the neural network, i.e., Note that the risks are defned in Section 3.2.

Teorem 1. Assume the parameter matrices have Frobenius norm at most , and the activation functions are 1-Lipschitz, positive-homogeneous, and applied element-wise (such as the ReLU). Assume the transition matrix is given, and the instances are upper bounded by for all X, and the loss function upper bounded by . Ten, for any , with probability at least

A detailed proof is provided in Appendix.

Teorem 1 implies that if the training error is small and the training sample size is large, the expected risk of the learned classifer for noisy classes will be small. If the transition matrix is well estimated, the learned classifer for the clean class will also have a small risk according to Equation 7. Tis theoretically justifes why the proposed method works well. In the experiment section, we will show that the transition matrices will be well estimated and that the proposed method will signifcantly outperform the baselines.

5 Experiments

In this section, we empirically investigate the performance of noise transition matrix estimation and the proposed method for MNS classifcation on three synthetic noisy datasets and two real-world noisy datasets.

5.1 Experiments on synthetic noisy datasets

Datasets. We synthesize noisy-similarity-labeled data by employing three widely used datasets, i.e., MNIST [LeCun, 1998], CIFAR-10, and CIFAR-100 [Krizhevsky et al., 2009]. grayscale images of 10 classes including 60,000 training images and 10,000 test images. color images including 50,000 training images and 10,000 test images. CIFAR-10 has 10 classes while CIFAR-100 has 100 classes. For all the three benchmark datasets, we leave out 10% of the training examples as a validation set, which is for model selection.

Noisy similarity labels generation. First, we artifcially corrupt the class labels of training and validation sets according to noise transition matrices. Specifcally, for each instance with clean label i, we replace its label by j with a probability of . Afer that, we assign data pairs noisy similarity labels and remove . In this paper, we consider the symmetric noisy seting defned in Appendix. Noise-0.5 generates severe noise which means almost half labels are corrupted while Noise-0.2 generates slight noise which means around 20% labels are corrupted.

Baselines. We compare our proposed method with state-of-the-art methods and conduct all the experiments with default parameters by PyTorch on NVIDIA Tesla V100. Specifcally, we compare with the following two algorithms:

• Meta Classifcation Likelihood (MCL) [Hsu et al., 2019], which is the state-of-the-art method for multi-classifcation from clean-similarity-labeled data.

• KLD-based Contrastive Loss (KCL) [Hsu and Kira, 2016], which is a strong baseline. It uses Kullback–Leibler divergence to mesure the distance between two distributions.

Network structure. For MNIST, we use LeNet. For CIFAR-10, we use pre-trained ResNet-32. For CIFAR-100, we use VGG8. For all networks, as shown in Figure 2, the output number of the last fully connected layer is set to be the number of classes. We add a noise transition matrix layer afer the sofmax. Since the loss functions of MNS, MCL and KCL are designed for instance pairs, a pairwise enumeration layer [Hsu et al., 2018] is adapted before calculating the loss.

Optimizer. We follow the optimization method in [Patrini et al., 2017] to learn the noise transition matrix , we use the Adam optimizer with initial learning rate 0.001. On MNIST, the batch size is 128 and the learning rate decays every 10 epochs by a factor of 0.1 with 30 epochs in total. On CIFAR-10, the batch size is also 128 and the learning rate decays every 40 epochs by a factor of 0.1 with 120 epochs in total. On CIFAR-100, the batch size is 1000 and the learning rate drops at epoch 80 and 160 by a factor of 0.1 with 200 epochs in total.

Results. Te results in Tables 1, 2, and 3 demonstrate the test accuracy and stability of four algorithms on three benchmark datasets. Overall, we can see that when similarity labels are corrupted, MNS( ) achieves the best performance among three similarity-based learning methods, approaching or even exceeding MNS(T) which is given the true noise transition matrix. Specifcally, On MNIST and CIFAR10, when the noise rates are high, MNS( ) performs beter than MNS(T). Tis should because that and the networks are learned jointly as shown in Algorithm 1.

Table 1: Average Means and Standard Deviations (Percentage) of Classifcation Accuracy over 5 trials on MNIST. KCL, MCL and MNS only have access to noisy similarity labels. Specifcally, MCL( ) denotes the method in which we estimate noise transition matrix frst and then use the estimated for training while MCL(T) skips the frst step and directly use the true noise transition matrix.

Table 2: Average Means and Standard Deviations (Percentage) of Classifcation Accuracy over 5 trials on CIFAR10.

Table 3: Average Means and Standard Deviations (Percentage) of Classifcation Accuracy over 5 trials on CIFAR100.

On MNIST, when the noise rate is relatively low (under 0.4), KCL has the highest accuracy; MCL and MNS also perform well. Intuitively, compared with inner product, Kullback-Leibler divergence measures the similarity between two distributions beter, but it may introduce bad local minima or small gradients for learning [Hsu et al., 2019] such that it has poor performances on more complex datasets or higher noise rate. For example, when the noise rate increases (beyond 0.3), the accuracy of KCL drops dramatically, falling form 99.06 at Noise-0.3 to 85.20 at Noise-0.6. By contrast, MNS and MCL are more robust to noise. Both methods decrease slightly as the noise rate rises while our method is always a litle beter than the state-of-the-art method MCL.

On CIFAR-10 and CIFAR-100, there is a signifcant decrease in the accuracy of all methods and our method achieves the best results across all noise rate, i.e., at Noise-0.6, MNS gives an accuracy uplif of about 6.5% and 10% on CIFAR-10 and CIFAR-100 respectively compared with the state-of-the-art method MCL.

5.2 Experiments on real-world noisy datasets

Datasets. We verify the efectiveness of the proposed method on two real-word datasets with noisy supervision, i.e., Clothing1M [Xiao et al., 2015] and Food-101 [Bossard et al., 2014]. Specifcally, Clothing1M has 1M images with real-world noisy labels and additional 50k, 14k, 10k images with clean labels for training, validation and testing. We only use noisy training set in training phase and leave out 10% as validation set for model selection and test our model on 10k testing set. Food-101 consists of 101 food categories, with 101,000 images. For each class, 250 manually reviewed clean test images are provided as well as 750 training images with real-world noise. For Food-101, we also leave out 10% for validation. In particular, we use Random Crop and Random Horizontal Flip for data augmentation. Since datasets contain some amount of class label noise already, we do not need to corrupt the labels artifcially. We generate noisy-similarity-labeled data by using the noisy-class-labeled data directly. Baselines. Te same as the synthetic experiment part. Network structure and optimizer. For all experiments, we use pre-trained ResNet-

50. On Clothing1M, the batch size is 256 and the learning rate drops every 5 epochs by a

factor of 0.1 with 10 epochs in total. On Food-101, the batch size is 1000 and the learning rate drops at epoch 80 and 160 by a factor of 0.1 with 200 epochs in total. Other setings are the same as the synthetic experiment part.

Table 4: Classifcation Accuracy on real-world noisy dataset Clothing1M.

Table 5: Classifcation Accuracy on real-world noisy dataset Food-101.

Results. From Table 4 and 5, We can see that on ) achieves the best accuracy. On ) also performs distinguishedly, uplifing about 23% in accuracy compared with MCL. Specifcally, the gap between MCL and MNS( ) is huge in Table 5 while is not in Table 4. Let us review the defnition of similarity-labeled data: if two instances belong to the same class, they will have similarity label S = 1, otherwise S = 0. Tat is to say, for a k-class dataset, only around of similarity-labeled data has similarity labels S = 1, and the rest has similarity labels S = 0. For Clothing1M (Table 4), the k = 14. For Food-101 (Table 5), the k = 101. Terefore, the generated similarity-labeled data from Food-101 is much more unbalanced than that from Clothing1M. As a result, the baseline performed badly on Food-101, making the gap huge in Table 5.

5.3 Noise transition matrix estimation

To estimate T, we frst learn the noisy predictor . For each dataset, the network and optimizer remain the same as above but the noise transition matrix layer is exclude. T is then estimated using the method proposed in Section 3.3.

Figure 4: True transition matrix T at Noise-0.6 and corresponding of two datasets with 10 classes: MNIST and CIFAR-10.

Here we only show the estimated transition matrices of three synthetic noisy datasets because we have the exact values of the true transition matrices such that we could assess the estimation accuracies. Estimated transition matrices of real-world noisy datasets are provided in Appendix. From Figure 4 and 5, we can see that transition matrices estimated with the proposed method are very close to the true one. By employing the calculation method of estimation error as achieve 0.0668, 0.1144 and 0.1055 in error respectively.

Figure 5: True transition matrix T at Noise-0.6 and corresponding that we only show the frst 10 rows and columns of the matrix.

6 Conclusion

Tis paper proposes a noisy-similarity-based multi-class classifcation algorithm (called MNS classifcation) by designing a novel deep learning system exploiting only noisy-similarity-labeled data. MNS classifcation provides an efective way for making predictions on sensitive maters where it is difcult to collect high-quality data such that similarities with noise could be all the information available. Te core idea is to model the noise in the latent noisy class labels by using a noise transition matrix while only noisy similarity labels are observed. By adding a noise transition matrix layer in the deep neural network, it turns to robust to similarity label noise. We also present that noise transition matrix can be estimated in this seting. Experiments are conducted on benchmark-simulated and real-world label-noise datasets, demonstrating our method can excellently solve the above weakly supervised problem. In future work, investigating diferent types of noise for diverse real-life scenarios might prove important.

References

David Acuna, Amlan Kar, and Sanja Fidler. Devil is in the edges: Learning semantic boundaries from noisy annotations. In Proceedings of the IEEE Conference on Computer Vision and Patern Recognition, pages 11075–11083, 2019.

Han Bao, Gang Niu, and Masashi Sugiyama. Classifcation from pairwise similarity and unlabeled data. In International Conference on Machine Learning, pages 461–470, 2018.

Peter L Bartlet and Shahar Mendelson. Rademacher and gaussian complexities: Risk bounds and structural results. Journal of Machine Learning Research, 3(Nov):463–482, 2002.

Lukas Bossard, Mathieu Guillaumin, and Luc Van Gool. Food-101 – mining discriminative components with random forests. In European Conference on Computer Vision, 2014.

Jiacheng Cheng, Tongliang Liu, Kotagiri Ramamohanarao, and Dacheng Tao. Learning with bounded instance-and label-dependent label noise. arXiv preprint arXiv:1709.03768, 2017.

Noah Golowich, Alexander Rakhlin, and Ohad Shamir. Size-independent sample com- plexity of neural networks. arXiv preprint arXiv:1712.06541, 2017.

Bo Han, Qanming Yao, Xingrui Yu, Gang Niu, Miao Xu, Weihua Hu, Ivor Tsang, and Masashi Sugiyama. Co-teaching: Robust training of deep neural networks with extremely noisy labels. In Advances in neural information processing systems, pages 8527– 8537, 2018.

Jiangfan Han, Ping Luo, and Xiaogang Wang. Deep self-learning from noisy labels. In Proceedings of the IEEE International Conference on Computer Vision, pages 5138–5147, 2019.

Yen-Chang Hsu and Zsolt Kira. Neural network-based clustering using pairwise con- straints. In ICLR workshop, 2016. URL https://arxiv.org/abs/1511. 06321.

Yen-Chang Hsu, Zhaoyang Lv, and Zsolt Kira. Learning to cluster in order to transfer across domains and tasks. In International Conference on Learning Representations (ICLR), 2018. URL https://openreview.net/forum?id=ByRWCqvT-.

Yen-Chang Hsu, Zhaoyang Lv, Joel Schlosser, Phillip Odom, and Zsolt Kira. Multi-class classifcation without multi-class labels. arXiv preprint arXiv:1901.00544, 2019.

Mengying Hu, Hu Han, Shiguang Shan, and Xilin Chen. Weakly supervised image classi- fcation through noise regularization. In Proceedings of the IEEE Conference on Computer Vision and Patern Recognition, pages 11517–11525, 2019a.

Wei Hu, Yangyu Huang, Fan Zhang, and Ruirui Li. Noise-tolerant paradigm for training face recognition cnns. In Proceedings of the IEEE Conference on Computer Vision and Patern Recognition, pages 11887–11896, 2019b.

Takuhiro Kaneko, Yoshitaka Ushiku, and Tatsuya Harada. Label-noise robust generative adversarial networks. In Proceedings of the IEEE Conference on Computer Vision and Patern Recognition, pages 2467–2476, 2019.

Ashish Khetan, Zachary C Lipton, and Anima Anandkumar. Learning from noisy singly- labeled data. arXiv preprint arXiv:1712.04577, 2017.

Jonathan Krause, Benjamin Sapp, Andrew Howard, Howard Zhou, Alexander Toshev, Tom Duerig, James Philbin, and Li Fei-Fei. Te unreasonable efectiveness of noisy data for fne-grained recognition. In European Conference on Computer Vision, pages 301–320. Springer, 2016.

Alex Krizhevsky, Geofrey Hinton, et al. Learning multiple layers of features from tiny images. Technical report, Citeseer, 2009.

Yann LeCun. Te mnist database of handwriten digits. htp://yann. lecun. com/exdb/mnist/, 1998.

Michel Ledoux and Michel Talagrand. Probability in Banach Spaces: isoperimetry and processes. Springer Science & Business Media, 2013.

Kuang-Huei Lee, Xiaodong He, Lei Zhang, and Linjun Yang. Cleannet: Transfer learn- ing for scalable image classifer training with label noise. In Proceedings of the IEEE Conference on Computer Vision and Patern Recognition, pages 5447–5456, 2018.

Chenglong Li, Chengli Zhu, Yan Huang, Jin Tang, and Liang Wang. Cross-modal ranking with sof consistency and noisy labels for robust rgb-t tracking. In Proceedings of the European Conference on Computer Vision (ECCV), pages 808–823, 2018.

Junnan Li, Yongkang Wong, Qi Zhao, and Mohan S Kankanhalli. Learning to learn from noisy labeled data. In Proceedings of the IEEE Conference on Computer Vision and Patern Recognition, pages 5051–5059, 2019.

Yuncheng Li, Jianchao Yang, Yale Song, Liangliang Cao, Jiebo Luo, and Li-Jia Li. Learning from noisy labels with distillation. In Proceedings of the IEEE International Conference on Computer Vision, pages 1910–1918, 2017.

Zhenguo Li and Jianzhuang Liu. Constrained clustering by spectral kernel learning. In 2009 IEEE 12th International Conference on Computer Vision, pages 421–427. IEEE, 2009.

Tongliang Liu and Dacheng Tao. Classifcation with noisy labels by importance reweight- ing. IEEE Transactions on patern analysis and machine intelligence, 38(3):447–461, 2015.

Mehryar Mohri, Afshin Rostamizadeh, and Ameet Talwalkar. Foundations of Machine Learning. MIT Press, 2018.

Gang Niu, Bo Dai, Makoto Yamada, and Masashi Sugiyama. Information-theoretic semi- supervised metric learning via entropy regularization. Neural computation, 26(8):1717– 1762, 2014.

Li Niu, Qingtao Tang, Ashok Veeraraghavan, and Ashutosh Sabharwal. Learning from noisy web data with category-level supervision. In Proceedings of the IEEE Conference on Computer Vision and Patern Recognition, pages 7689–7698, 2018.

Giorgio Patrini, Alessandro Rozza, Aditya Krishna Menon, Richard Nock, and Lizhen Q. Making deep neural networks robust to label noise: A loss correction approach. In Proceedings of the IEEE Conference on Computer Vision and Patern Recognition, pages 1944–1952, 2017.

Takuya Shimada, Han Bao, Issei Sato, and Masashi Sugiyama. Classifcation from pairwise similarities/dissimilarities and unlabeled data via empirical risk minimization. arXiv preprint arXiv:1904.11717, 2019.

Daiki Tanaka, Daiki Ikami, Toshihiko Yamasaki, and Kiyoharu Aizawa. Joint optimization framework for learning with noisy labels. In Proceedings of the IEEE Conference on Computer Vision and Patern Recognition, pages 5552–5560, 2018.

Ryutaro Tanno, Ardavan Saeedi, Swami Sankaranarayanan, Daniel C Alexander, and Nathan Silberman. Learning from noisy labels by regularized estimation of annotator confusion. arXiv preprint arXiv:1902.03680, 2019.

Kiri Wagstaf, Claire Cardie, Seth Rogers, Stefan Schr¨odl, et al. Constrained k-means clustering with background knowledge. In Icml, volume 1, pages 577–584, 2001.

Yisen Wang, Weiyang Liu, Xingjun Ma, James Bailey, Hongyuan Zha, Le Song, and Shu- Tao Xia. Iterative learning with open-set noisy labels. In Proceedings of the IEEE Conference on Computer Vision and Patern Recognition, pages 8688–8696, 2018.

Yisen Wang, Xingjun Ma, Zaiyi Chen, Yuan Luo, Jinfeng Yi, and James Bailey. Symmetric cross entropy for robust learning with noisy labels. In Proceedings of the IEEE International Conference on Computer Vision, pages 322–330, 2019.

Xiaobo Xia, Tongliang Liu, Nannan Wang, Bo Han, Chen Gong, Gang Niu, and Masashi Sugiyama. Are anchor points really indispensable in label-noise learning? arXiv preprint arXiv:1906.00189, 2019.

Tong Xiao, Tian Xia, Yi Yang, Chang Huang, and Xiaogang Wang. Learning from massive noisy labeled data for image classifcation. In Proceedings of the IEEE conference on computer vision and patern recognition, pages 2691–2699, 2015.

Eric P Xing, Michael I Jordan, Stuart J Russell, and Andrew Y Ng. Distance metric learning with application to clustering with side-information. In Advances in neural information processing systems, pages 521–528, 2003.

Kun Yi and Jianxin Wu. Probabilistic end-to-end noise correction for learning with noisy labels. arXiv preprint arXiv:1903.07788, 2019.

Xiyu Yu, Tongliang Liu, Mingming Gong, and Dacheng Tao. Learning with biased comple- mentary labels. In Proceedings of the European Conference on Computer Vision (ECCV), pages 68–83, 2018.

Chiyuan Zhang, Samy Bengio, Moritz Hardt, Benjamin Recht, and Oriol Vinyals. Understanding deep learning requires rethinking generalization. arXiv preprint arXiv:1611.03530, 2016.

Jing Zhang, Tong Zhang, Yuchao Dai, Mehrtash Harandi, and Richard Hartley. Deep unsupervised saliency detection: A multiple noisy labeling perspective. In Proceedings of the IEEE Conference on Computer Vision and Patern Recognition, pages 9029–9038, 2018.

Weihe Zhang, Yali Wang, and Yu Qiao. Metacleaner: Learning to hallucinate clean rep- resentations for noisy-labeled visual recognition. In Proceedings of the IEEE Conference on Computer Vision and Patern Recognition, pages 7373–7382, 2019.

Jia-Xing Zhong, Nannan Li, Weijie Kong, Shan Liu, Tomas H Li, and Ge Li. Graph convo- lutional label noise cleaner: Train a plug-and-play action classifer for anomaly detection. In Proceedings of the IEEE Conference on Computer Vision and Patern Recognition, pages 1237–1246, 2019a.

Yaoyao Zhong, Weihong Deng, Mei Wang, Jiani Hu, Jianteng Peng, Xunqiang Tao, and Yaohai Huang. Unequal-training for deep face recognition with long-tailed noisy data. In Proceedings of the IEEE Conference on Computer Vision and Patern Recognition, pages 7812–7821, 2019b.

Zhi-Hua Zhou. A brief introduction to weakly supervised learning. National Science Review, 5(1):44–53, 2017.

Appendices A Proof of Teorem 1

We have defned

and

where n is training sample size of the noisy-similarity-labeled data. First we bound the generalization error with Rademacher complexity [Bartlet and Mendelson, 2002].

Teorem 2 ([Bartlet and Mendelson, 2002]). Let the loss function be upper bounded by M. Ten, for any , with the probability

where is the Rademacher complexity defned by

and are Rademacher variables uniformly distributed from

Before further upper bound the Rademacher complexity , we discuss the special loss function and its

Lemma 1. Given transition matrix T, loss function is 1-Lipschitz with respect to

Detailed proof of Lemma 1 can be found in Section A.1. Based on Lemma 1, we can further upper bound the Rademacher complexity by the following lemma.

Lemma 2. Given transition matrix T and assume loss function Lipschitz with respect to

where H is the function class induced by the deep neural network.

Detailed proof of Lemma 2 can be found in Section A.2. Te right hand part of the above inequality, indicating the hypothesis complexity of deep neural networks, can be bound by the following theorem.

Teorem 3 ([Golowich et al., 2017]). Assume the Frobenius norm of the weight matrices are at most . Let the activation functions be 1-Lipschitz, positive-homogeneous, and applied element-wise (such as the ReLU). Let X is upper bounded by B, i.e., for any

Combining Lemma 1,2, and Teorem 2, 3, Teorem 1 is proven.

A.1 Proof of Lemma 1

Recall that

where

Take the derivative of

where

Note that the derivative of the sofmax function has some properties, i.e., if

We denote by element in V ector for those complex vectors. Because

Since , similarly we have

A.2 Proof of Lemma 2

where the frst three equations hold because given same constraint on ; the sixth inequality holds because of the Lemma [Ledoux and Talagrand, 2013].

B Defnition of transition matrix

Symmetric noisy seting is defned as follows, where C is the number of classes.

C Estimation of transition matrix on real-world noisy datasets

Here we show the estimated transition matrices of Clothing1M and the frst ten classes of Food-101. For Clothing1M, we use additional 50k images with clean labels to learn the transition matrix such that the lef in Figure 1 is very close to the true one. Te right in Figure 1 was estimated only from noisy-similarity-labeled data, which learned most of the features of true transition matrix. For was estimated from noisy-labeled data. From Figure 2 we can see that the result close to the result which verifes the efectiveness of our method.

Figure 6: ; the one in the lef hand is estimated from class labels, the one in the right hand is estimated from noisy similarity labels.

Figure 7: of the frst ten classes of Food-101; the one in the lef hand is estimated from class labels, the one in the right hand is estimated from noisy similarity labels