In the most common problem setting in supervised learning, the class to which the training data belongs is provided as a label, namely an ordinary label. There is increasing interest in generalizing the concept of ordinary labels Chapelle et al. (2010); Luo and Orabona (2010); Cour et al. (2011); Natarajan et al. (2013); Ishida et al. (2018). In this paper, we consider a problem setting where each training data is provided with a label, that is, a complementary label, which specifies one class that the pattern does not belong to. The ordinary and complementary labels respectively reflect the information regarding a class to which the training data is likely and unlikely to belong.
The learning framework was first proposed by Ishida et al. (2017). Assuming that a single complementary label is provided to each training data, the corresponding loss function considering one-versus-all and pairwise classification was defined, and also the classification risk and error were analyzed. The classification risk studied by Ishida et al. (2019) depends on a a more generalized loss function where the classification strategy is not necessarily restricted to be one-versus-all or pairwise classification.
Although choosing the correct class of a pattern involves considerable work for the annotators, the use of complementary labels can alleviate the annotation cost. However, in this framework, loss of generalities can occur if annotators choose only one class as a
complementary label. In other words, it is reasonably assumed that multiple labels can be chosen as classes to which a pattern does not belong.
Cao and Xu (2020); Feng et al. (2020) analyzed a framework where multiple complementary labels are provided to each training data. Cao and Xu (2020) formulated the number of complementary labels as a constant value, while Feng et al. (2020) formulated that as a random variable. In these works, the existing framework in Ishida et al. (2019) where single complementary label is provided to each training data is directly applied. Therefore, there is no essential difference with the framework of Ishida et al. (2019) on the structure and properties of the loss functions or the derivation of classification risk and error.
In contrast, this paper focuses on equivalency of ordinary label and complementary label and provides essential generalization of learning framework from single ordinary/complementary label. The finding that loss functions for one-versus-all and pairwise classification satisfy certain additivity and duality plays a significant role in the generalization, and the classification risk and error bound with a new expression not found in Ishida et al. (2019); Cao and Xu (2020); Feng et al. (2020) are derived. This enables us to bridge ordinary-label learning and complementary-label learning and to understand them from a unified perspective. To be more specific, the introduced loss functions satisfying additivity and duality allow a straightforward comparison of the proposed approach and those shown in the existing literature. The properties also allow our classification risk to be in a more simple form than that of Ishida et al. (2017, 2019); Cao and Xu (2020); Feng et al. (2020). Further, the derived classification error monotonically decreases corresponding to the number of complementary labels. This property is also shown in the experiment, which supports that our framework bridges learning from ordinary/complementary labels.
The remaining paper is organized as follows. Section 2 provides a review of the existing work regarding the generalization of supervised learning from ordinary labels. Section 3 introduces several key formulations pertaining to ordinary-label and complementary-label learning. Section 4 presents the generation probability model of data provided with multiple complementary labels and the loss functions for the learning conducted from such data. Furthermore, this section describes the natural generalization of the proposed loss function from those for one-versus-all and pairwise classification defined by Ishida et al. (2017) and the evaluation of the classification risk. The error bound of the one-versus-all and pairwise classification is derived in Section 5. Section 6 describes the experimental investigation performed considering real-world data to validate the classification error, and Section 7 presents the conclusions.
This section provides a review of the existing work regarding the generalization of supervised learning from the ordinary labels provided to training data.
Semi-supervised learning, in which the classification algorithm is provided with some training data labeled but not necessarily for all, has been investigated Grandvalet and Bengio (2005); Mann and McCallum (2007); Chapelle et al. (2010); Niu et al. (2013); Kipf and Welling (2017); Laine and Aila (2017) Although the prediction accuracy of such learning is less than that of fully supervised learning, this technique requires less labeled training data, thereby reducing the annotation cost.
As another type of generalization, there exists a method of learning from training data, in which each data is provided with multiple labels, only one of which specifies the one true class. Such labels are generally known as partial labels (or candidate labels) Raykar et al. (2010); Luo and Orabona (2010); Cour et al. (2011); Cid-sueiro (2012). This approach can potentially be applied when the true class of a given training data is not clear or when the label information requires to be kept private.
Framework of learning from labels, namely complementary labels, specifying classes that the pattern does not belong to has been also formulated. Providing a complementary label to a training data is equivalent to considering the other labels as candidate labels. In other words, a set of candidate labels is a comprehensive concept involving both ordinary and complementary labels as special cases. Ishida et al. (2017) provided the complementarily labeled data-generation probability model and classification risk. In addition, when learning from noisy labels, which stochastically represent incorrect information, providing complementary label to prevent the specification of a noisy label as a true label has been noted to be effective Kim et al. (2019).
However, the problem setting in Ishida et al. (2017) encounters a limitation in cases in which only one complementary label is provided to each training data, leading to a loss of generality. To overcome the limitation, Cao and Xu (2020); Feng et al. (2020) considered a problem setting where each training data is provided to multiple complementary labels. The existing framework in Ishida et al. (2019) is directly applied in these works, and thus new insights into the relationship between ordinary-label learning and complementary-label learning or the properties satisfied by the loss functions are not obtained.
In contrast, we addressed the problem by using an approach different from that of Cao and Xu (2020); Feng et al. (2020). Focusing on equivalency of ordinary label and complementary label, we found that additivity and duality satisfied by loss functions plays a significant role in bridging learning from single ordinary/complementary labels; we then derived classification risk and error bound with a new expression not found in the existing literature. The structure and properties inherent in the loss functions enable us to naturally relate and discuss learning from ordinary and complementary labels in a unified manner.
This section presents the data-generation probability model and loss functions shown in the existing literature. All these formulations were on the assumption that a single ordinary label or complementary label is provided.
3.1. Supervised Learning from Ordinary Label
In the 2) classification problem, supervised learning concerns the efficiency of a classifier , which maps a pattern to the class to which it belongs, that is, Given the discriminant function represents the confidence on the 2 class classification about y, the classifier f can be defined as f(x) = arg max
Given a pair (x, y), in which x is the pattern and y is an ordinary label representing the belonging class of x, the prediction by f can be evaluated using a loss function R. For example, the loss functions for , respectively, can be defined as Zhang (2004):
Note that the function monotonically decreases corresponding to the input.
The 0-1 loss, 0), is a standard type of function indicator function. The 0-1 loss is unsuitable for loss optimization, as it is undifferentiable at z = 0, and its gradient is always 0 for all other inputs. Consequently, the 0-1 loss is usually surrogated by other functions, such as the sigmoid and ramp losses, all of which satisfy the following condition:
where is constant. In the remaining study, we assume to satisfy (3) but not its differentiability.
In addition, we assume that (x, y), involving the pattern x and its belonging class y, is generated from a probability distribution P (x, y). The classification risk R(f) of a classifier f can be defined as follows:
where represents the expectation with respect to P (x, y).
3.2. Supervised Learning from Complementary Label
Ishida et al. (2017) discussed the formulation considering a label specifying a class that a pattern does not belong to. Given a pair (x,y) of a pattern and a classwhich the pattern does not belong, the prediction by can be evaluated by the loss functionIshida et al. (2017) defined the loss functions for one-versus-all and pairwise classification as follows:
Compared to (1) and (2), the loss functions defined in (5) and (6) are in natural forms to deal with the training data provided a labely. The extended form of the loss functions in (5) and (6) is as follows Ishida et al. (2019):
Considering (3), the function in (7) involves both one-versus-all and pairwise classification losses. Given a pattern x and its complementary label, the generation probability
of the labeled data is as follows:
Here,
P (y|x) is proportional to the sum of probabilities regarding all the other labels. Subsequently, (x,y) is assumed to be stochastically generated from the distribution
such that
L (f(x), y).
The formulation in Ishida et al. (2017) assumes that a single complementary label is provided to the training data. We generalize this formulation, assuming that 1) complementary labels are provided to each training data. In this section, we generalize loss functions for one-versus-all and pairwise classification focusing on the fact that those functions satisfy additivity and duality. Further, we derive classification risk assuming the data-generation probability model defined in Ishida et al. (2017), and then show that additivity and duality play a significant role in the analysis.
The labels specifying the incorrect classes are equally informative as the remaining labels that specify the classes to which a pattern likely belongs. The latter labels are termed as candidate labels and the set of these labels is defined as the set of all the N-size subsets of Y. From now on the formulation in this paper is constructed based on the fact that providing ) complementary labels to a training data is equivalent to providing ) candidate labels.
4.1. Generalization of Loss Function
For one-versus-all and pairwise classification from a set of multiple candidate labels Y , the corresponding loss functions can be defined as:
Note that (10) and (11) are the same as (1) and (2), respectively, if N = 1. Similarly, the newly defined equations are the same as (5) and (6) if
As a generalization of (10) and (11), we introduce a loss function is defined as the additive form of loss functions for ordinary-label learning:
where are constants. Assuming and ) becomes the same expression as (10) if satisfies (3). Under the same condition on , by assuming (12) and (11) have the same expression (see supplementary materials for complete proofs).
function L. In contrast to the loss function defined in (7), which can be applied only if a single complementary label is provided, the loss function pertaining to (12) can be applied for any number of complementary (or candidate) labels. Consequently, L in (12) can be considered as a generalized form of (5) and (6).
where is a constant. Both satisfy this condition, where and , respectively. For any x and y, we then define a functionL, which satisfies the following condition:
where is a constant. This condition is satisfied by and also by and1) respectively. Now, the following equation holds:
where
expressed as ) express the duality of the loss function
To the best of our knowledge, we are the first to reveal additivity and duality commonly found in loss functions for one-versus-all and pairwise classification. This property is critical to study the classification risk and error, and it has not been discussed in any of the existing studies Ishida et al. (2019); Cao and Xu (2020).
4.2. Assumption of Generation Probability
As stated in Section 3, (8) represents the probability ofy not being a true label. This aspect can be interpreted in the context of a candidate label;P (y|x) represents the probability with which a true label is included in Y . In contrast to (8), which can only be applied when 1), our generalized data-generation probability model is defined as:
where ) represents the probability of the true label being included in Y . In the remaining work, we assume that Y is generated from ) independently; (14) is obviously the same as (8) if
This model represents the situation where the annotator is constrained to provide N labels consistently to any pattern x. This situation can be seen as synthetic; that is, the labelling is constrained to be independent from the annotator’s belief. In fact, this model does not consider the bias in labeling by the annotator because the information in y that satisfies is treated uniformly. However, this property does not always limit reality of (14). In the following explanation we show that the synthetic property of (14) can be reasonably interpreted in a context of privacy preserving.
Given a data provided with a set of candidate labels which is generated from (14), consider the information about the ground-truth label obtained from the given data. We define a function , which represents the degree of confidence where the one given the data believes that the ground-truth label for the pattern x is Now we assume that a set of candidate labels Y is provided to the pattern x by the annotator. Further, we assume that the one given the data does not have any information about the ground-truth label , except the given set of candidate labels Y . That is, we assume that the degree of confidence, where the ground-truth label pattern , depends on whether ) = 0 hold. Here, the following equation holds (see supplementary materials for complete proofs).
where
According to (15), the degree of confidence ) is a mixture of the distribution ) and the uniform distribution 1) represents the information about labels that is available to the one given the data, while ) represents the information about labels originally owned by the annotator. This equation indicates that receiving a data provided with a set of N candidate labels generated from (14) is equivalent to receiving a data provided with a ordinary label to which random noise is added. The level of random noise (1 ) increases as the number of candidate labels N increases. That is, privacy for the generation probability of the label for a given pattern, P(y|x), can be preserved by synthetically generating candidate labels according to (14).
Note that Cao and Xu (2020); Feng et al. (2020) defined data-generation probability model in the same expression assuming multiple complementary labels are provided. Here, Cao and Xu (2020) assumes the number of labels as a constant value while Feng et al. (2020) assumes that as a stochastic variable. Those probability models are mathematically equivalent to (14), which can be easily shown; assuming
4.3. Classification Risk with Multiple Candidate Labels
Some conditions can be used to simplify the expression in (16) for one-versus-all and pairwise classification, without losing generalities. Note that redefining (3) as does not affect the loss minimization when learning. If we shift to satisfy = 0 holds for both ; therefore, the first term in (17) can be eliminated.
Similarly, ) do not affect the loss minimization. Considering = 0, we can simplify (12) as follows without loss of generalities:
Consequently, by assuming that = 0, the following expression holds for the classification risk.
This section discusses the error bound for one-versus-all and pairwise classification. For simplicity, we assume that satisfies (3) with a = 0, and the conditions infand sup2. Further, we assume that is Lipschitz continuous. In addition, we assume = 0. For the rest of this paper, we define L and R(f) by using (18) and (19), respectively.
5.1. Notations
Let us consider that a set of n training data is given, and each training data is generated with a probability of ) independently. Based on (19), the empirical classification risk ˆR(f) for the set S is:
We define the ideal classifier that minimizes the generalization error (Bayes classifier) and the empirically ideal classifier as := arg min), respectively. We define the classification error for the classifier ˆf as follows.
In the literature, the Rademacher complexity for a set of discriminant functions G over the input space X is usually defined as follows:
where is a set of independent stochastic variables, which take one value of with the same probability. In addition, represent the expectation for each element of , respectively.
5.2. Evaluation of Error Bound
The following lemmas are introduced to derive the classification error bound.
Lemma 2 We express the supremum of the difference in loss in accordance with the change in a set of candidate labels, i.e., given any
Based on these lemmas, the error bounds for the one-versus-all and pairwise classification can be defined as follows. The complete proofs are provided in the supplementary materials.
Theorem 4 Assume that function satisfies the stated condition in the beginning of this section, and is Lipschitz continuous. Then, for any , the following equation for one-versus-all classification holds with a probability of at least 1
Similarly, for pairwise classification, the following equation holds with a probability of at least 1
Note that the upper-bound of increases monotonically corresponding to N in accordance with (21) and (22). This aspect is in agreement with the fact that a decrease in the number of candidate labels leads to less ambiguous supervision of the training data. The upper-bound in (21) breaks subject to 2; this property is due to the equivalency satisfied by ordinary labels and complementary labels, as discussed in Section 4.2. Note that taking the expectation of the error bounds in the theorem 4 assuming that N as a random variable, the derived formulations specify the error bounds under the problem setting where the number of candidate labels provided to each training data stochastically fluctuates.
Figure 1: Error for 10 classification for different numbers of complementary labels N. The red and blue closed plots represent the experimental results for the one-versus-all and pairwise classification, respectively. The red and blue open plots represent the theoretical error bounds for the one-versus-all and pairwise classification, respectively.
This section describes the evaluation of the accuracy of one-versus-all and pairwise classi-fication and the validation of the formulation discussed in Section 5.2. Understanding the exact behavior of the classification error only from the derived equations in Theorem 4 is dif-ficult. Therefore, we attempt to quantitively discuss the error in a real-world classification problem. The source code for the described experiment is available online
6.1. Dataset Generation
The generation probability model of the training data provided with multiple candidate labels is as defined in (14). We generate the experimental dataset according to this defi-nition, and the annotation of the training data is performed as follows. First, we prepare pretrained K class classifiers, namely, annotators, for the ordinarily labeled training data. The annotators are multi-layer perceptrons (MLP) with softmax as the output layer activation function. The annotators are trained on the MNIST, Fashion-MNIST, KuzushijiMNIST, and CIFAR-10datasets. Table 1 summarizes the datasets. Because all these
Table 1: Datasets used in the experiment.
cases correspond to the 10 class () classification problem, the data belonging to 9 is eliminated when we assume K = 5 in the following experiment. Because the discriminant functions of the annotators are normalized using softmax, each of the functions can be treated as having a data generation probability P(y|x). Therefore, assuming that the number of candidate labels is N, we generate Y in accordance with ), calculated using the discriminant functions. Similarly, for the test dataset, the true label is provided in accordance with of the annotators.
In the experiment, we pretrained the annotators for K = 10 and K = 5. When K = 10, the classification accuracies of the annotators for the MNIST, Fashion-MNIST, KuzushijiMNIST, and CIFAR-10 datasets are 99.17%, 92.67%, 95.1%, and 81.62%, respectively. For K = 5, the corresponding accuracies are 99.90%, 95.16%, 96.56%, and 85.60%. As all the accuracies are reasonably high, the generation probability P(y|x) for each class not uniform. As discussed in section 6.2, providing a set of candidate labels generated from (14) is equivalent to providing a single ordinary label generated from the true probability distribution P(y|x) with random noise from uniform distribution. Therefore, generating synthetic data as in the experiment is natural in the context of preserving privacy of training data.
Here, we show the data augmentation algorithm used in the experiment in Algorithm 1. Algorithm 1 requires as inputs and returns by providing a set of candidate labels to each of input patterns. Further, we define Annotator as a pretrained multi-layer perceptron which returns degree of classification confidence for each patterm as a stochastic function which returns a set of candidate labels based on (14).
6.2. Evaluation of Classification Error Bound
Experimental Setup: Assuming that the number of classes K is 10 or 5, the classification accuracy is compared under different numbers of candidate labels The loss functions are defined using (18), and the function is an originsymmetric sigmoid loss, i.e., 2. We set the number of training data for each class as 1, 000, and the data were randomly selected from the complete dataset. We set the batch size as 64 for the rest of the experiment. We conducted the experiment using MLP for the MNIST, Fashion-MNIST, and Kuzushiji-MNIST datasets. The number of epochs was 300, weight decay was 10, learning rate was 5 as the optimization algorithm Kingma and Ba (2015). For CIFAR-10, the experiment was performed using DenseNet Huang et al. (2016). The number of epochs was 300, weight decay was 5 , momentum was 0.9, and the optimization algorithm was the stochastic
Table 2: Experimental classification accuracies for 10 and 5 class classification (%). The experiments were performed 5 times for each case; the mean accuracy and standard deviation are presented by the upper and lower values, respectively. The highest accuracy is boldfaced.
K = 10 K = 5 N 1 2 3 4 5 6 7 8 9
gradient descent. The initial learning rate was set as 10, and it was halved every 30 epochs. The range of discriminant functions 2] for both MLP and DenseNet.
Error Computation: The classification error was calculated according to (20). Because the exact ˆf defined in Section 5.1 was not available, we surrogated it with a classifier that could minimize the loss of the training data in the experiment. Similarly, we substituted with a classifier that could minimize the loss of the test data.
Additionally, we computed the theoretical classification error bounds according to (21) and (22). We set 1 because Theorem 4 holds with a high probability if is relatively small. Furthermore, we set 0 because it is the minimum Lipschitz constant of
Table 3: Experimental classification errors for 10 and 5 class classification (experiments were performed 5 times for each case; the mean error and standard deviation are presented by the upper and lower values, respectively. The highest error is boldfaced.
K = 10 K = 5 N 1 2 3 4 5 6 7 8 9
the shifted sigmoid loss. Assuming that both MLP and DenseNet in the experiment had adequate capacity, we set
Results: We performed 5 trials for each experiment and observed the classification accuracy, that is, rate of correct classification for the test dataset estimated by a classifier which minimizes loss for training data. The mean and standard deviation for the accuracy are listed in Table 2. The results indicate that the mean of the accuracy tends to monotonically decrease corresponding to N. That is, increase in the number of complementary labels leads to a better performance than the case of
The mean and standard deviation of the experimental classification error are listed in Table 3. Furthermore, the mean of the experimental error and theoretical error bounds are shown in Figure 1 as a logarithmic graph. The results indicate that the theoretical error bound for one-versus-all and pairwise classification are about equally tight. The experimental error tends to monotonically increase corresponding to N, which is in accordance with the discussion in Section 5.2. The increase in theoretical error bound and experimental error are similar in shape, indicating that the bound reflects qualitative property found in the experimental error.
Focusing on the fact that ordinary label and complementary label are essentially equivalent, we naturally related learning framework from single ordinary/complementary label. We found that the loss functions for one-versus-all and pairwise classification satisfy certain additivity and duality which play a significant role in the analysis of classification risk and error bound. As a result, we made it possible to understand learning from ordinary/complementary label, which are studied independently in a existing literature, as a new expression form in a unified perspective.
The analysis in this paper is under the the assumption that each training data is provided from a data-generation probability model which satisfies a certain uniformity of the assigned labels. All the existing literature studying learning framework from multiple complementary labels are under the same assumption. This assumption is natural in a context of privacy preserving. As we described in this paper, the defined data-generation probability model indicates that providing multiple candidate labels generated from the model helps to prevent leakage of information about true distribution of labels for any patterns. However, it is also interesting to analyze under the problem setting where the assumption does not hold. Studies on the data-generation probability models to expand the range of application is the important future work.
Yuzhou Cao and Yitian Xu. Multi-complementary and unlabeled learning for arbitrary losses and models. ArXiv, abs/2001.04243, 2020.
Olivier Chapelle, Bernhard Schlkopf, and Alexander Zien. Semi-Supervised Learning. The MIT Press, 1st edition, 2010.
Jes´us Cid-sueiro. Proper losses for learning from partial labels. In Advances in Neural Information Processing Systems 25, pages 1565–1573. 2012.
Timothee Cour, Ben Sapp, and Ben Taskar. Learning from partial labels. J. Mach. Learn. Res., pages 1501–1536, 2011.
Lei Feng, Takuo Kaneko, Bo Han, Gang Niu, Bo An, and Masashi Sugiyama. Learning from multiple complementary labels. ArXiv, abs/1912.12927, 2020.
Yves Grandvalet and Yoshua Bengio. Semi-supervised learning by entropy minimization. In Advances in Neural Information Processing Systems 17, pages 529–536. MIT Press, 2005.
Gao Huang, Zhuang Liu, and Kilian Q. Weinberger. Densely connected convolutional net- works. 2017 IEEE Conference on Computer Vision and Pattern Recognition, pages 2261– 2269, 2016.
Takashi Ishida, Gang Niu, Weihua Hu, and Masashi Sugiyama. Learning from complemen- tary labels. In Advances in Neural Information Processing Systems 30, pages 5639–5649. 2017.
Takashi Ishida, Gang Niu, and Masashi Sugiyama. Binary classification from positive-confidence data. In Advances in Neural Information Processing Systems 31, pages 5917– 5928. 2018.
Takashi Ishida, Gang Niu, Aditya Menon, and Masashi Sugiyama. Complementary-label learning for arbitrary losses and models. In Proceedings of the 36th International Conference on Machine Learning, pages 2971–2980, 2019.
Youngdong Kim, Junho Yim, Juseung Yun, and Junmo Kim. Nlnl: Negative learning for noisy labels. In Proceedings of the IEEE International Conference on Computer Vision, pages 101–110, 2019.
Diederick P Kingma and Jimmy Ba. Adam: A method for stochastic optimization. In In Proceedings of the International Conference on Learning Representations, 2015.
Thomas N. Kipf and Max Welling. Semi-Supervised Classification with Graph Convolu- tional Networks. In Proceedings of the 5th International Conference on Learning Representations, 2017.
Samuli Laine and Timo Aila. Temporal ensembling for semi-supervised learning. In Proceedings of the International Conference on Learning Representations, 2017.
M. Ledoux and M. Talagrand. Probability in Banach Spaces: Isoperimetry and Processes. Classics in Mathematics. Springer Berlin Heidelberg, 2013.
Jie Luo and Francesco Orabona. Learning from candidate labeling sets. In Advances in Neural Information Processing Systems 23, pages 1504–1512. 2010.
Gideon S. Mann and Andrew McCallum. Simple, robust, scalable semi-supervised learning via expectation regularization. In Proceedings of the 24th International Conference on Machine Learning, page 593600, 2007.
Colin McDiarmid. On the method of bounded differences. In Surveys in Combinatorics, London Mathematical Society Lecture Notes, pages 148–188. 1989.
Nagarajan Natarajan, Inderjit S Dhillon, Pradeep K Ravikumar, and Ambuj Tewari. Learn- ing with noisy labels. In Advances in Neural Information Processing Systems 26, pages 1196–1204. 2013.
Gang Niu, Wittawat Jitkrittum, Bo Dai, Hirotaka Hachiya, and Masashi Sugiyama. Squared-loss mutual information regularization: A novel information-theoretic approach to semi-supervised learning. In Proceedings of the 30th International Conference on Machine Learning, pages 10–18, 2013.
Vikas C. Raykar, Shipeng Yu, Linda H. Zhao, Gerardo Hermosillo Valadez, Charles Florin, Luca Bogoni, and Linda Moy. Learning from crowds. J. Mach. Learn. Res., pages 1297– 1322, 2010.
Tong Zhang. Statistical analysis of some multi-category large margin classification methods. J. Mach. Learn. Res., pages 1225–1251, 2004.
Appendix A. Proofs of Loss Function Properties
As stated in Section 4, if we define a loss function L in an additive form by (12), there
exist constants which satisfy (10) and (11). We first describe proof of (10). The
following formulation holds in accordance with
Next, we describe proof of (11). Similar to the above formulation, in accordance with
=
The third and fourth equality hold due to
Appendix B. Property of the Data-Generation Probability Model
In this section, we describe proof of (15). To begin with, we introduce the following lemma.
Lemma 5 Let any finite sets be X with size of K, and any elements of the N-size power
set . Then, the following equation holds for any function f and g over X.
For the first term, any A can be chosen from patterns, and any ) can be chosen from patterns. Because ) can be chosen from patterns, the first term in (23) holds due to . Similar for the second term.
Here, we prove (15). Let ) be the probability where a set of candidate labels Y includes a label as an indicator function, then
The fourth equality holds due to lemma 5. In the fifth equality, the second term holds
because . Therefore,
The second equality holds due to the assumption where
) = 0 hold.
Appendix C. Proof of Theorem 1
We first derive the expectation of the sum of loss for all the ordinary labels.
The second equality holds due to the definition of ). The third equality holds due
to Lemma 5. Thus, the following formulation holds due to (12).
Appendix D. Proof of Lemma 2
According to the duality described in (12) and (13), loss function L can be formulated by
L (f(x), y) for N candidate labels
where respectively if 2, otherwise
Thus, the following formulation holds for any
The second inequality holds because supremum and infimum of
tively.
the following formulation holds.
Thus,
Appendix E. Proof of Lemma 3
We decribe proof for one-versus-all classification. Under the assumption that
equivalent to , the following formulation holds due to the definition of
Let ) be an indicator function and define 1, then for the first
term,
E
= E
≤
The second equality from the last holds because are drawn from the same
probabilistic distribution. For the second term,
Thus,
The second inequality holds due to ) according to Talagrand’s con-
traction lemma Ledoux and Talagrand (2013).
Note that is incorrectly assumed in Ishida et al. (2017), which causes miscalculation in proof of Lemma 3.
We further describe proof for pairwise classification. Under the assumption that h and are equivalent,
Let we define
The third equality holds because are drawn from the same probabilistic distri-
bution. Then,
Appendix F. Proof of Theorem 4
We only describe proof for one-versus-all classification; proof for pairwise classification is
similar. We substitute the with any data (), and define the
data set as . Let a set of empirical discrimination functions and empirical risk for
) respectively. Then the following formulation holds due to Lemma 2.
According to McDiarmid’s inequalityMcDiarmid (1989), for any integer 0 the following
formulation holds with a probability at least 1
Let be any dataset where each data is drawn from the data-generation
probability model
E
The second equality holds because )) are drawn from the
same probabilistic distribution; similar for )). The last inequality holds due to
Lemma 3. ˆ) holds according to (20), thus,