Alternate Estimation of a Classifier and the Class-Prior from Positive and Unlabeled Data

2018·Arxiv

Abstract

Abstract

We consider a problem of learning a binary classifier only from positive data and unlabeled data (PU learning) and estimating the class-prior in unlabeled data under the case-control scenario. Most of the recent methods of PU learning require an estimate of the class-prior probability in unlabeled data, and it is estimated in advance with another method. However, such a two-step approach which first estimates the class-prior and then trains a classifier may not be the optimal approach since the estimation error of the class-prior is not taken into account when a classifier is trained. In this paper, we propose a novel unified approach to estimating the class-prior and training a classifier alternately. Our proposed method is simple to implement and computationally efficient. Through experiments, we demonstrate the practical usefulness of the proposed method.

1 Introduction

We consider the problem of learning a binary classifier only from positive data and unlabeled data (PU learning). This problem arises in various practical situations, such as information retrieval and outlier detection (Elkan and Noto, 2008; Ward et al., 2009; Scott and Blanchard, 2009; Blanchard et al., 2010; Li et al., 2009; Nguyen et al., 2011). One of the theoretical milestones of PU learning is Elkan and Noto (2008) and there are subsequent researches called unbiased PU learning (du Plessis and Sugiyama, 2014; du Plessis et al., 2015), where the classification risk is estimated in an unbiased manner only from PU data.

We consider the case-control scenario (Ward et al., 2009; Elkan and Noto, 2008), where positive data are obtained separately from unlabeled data and unlabeled data is sampled from the whole population. Under this setting, the true class-prior in unlabeled data is needed for the formulation of unbiased PU learning. However, since is often unknown in practice, methods for class-prior estimation in PU learning have been developed.

There are several existing methods for class-prior estimation in PU learning. Elkan and Noto (2008) is also a milestone of class-prior estimation. Recently, du Plessis et al. (2016) proposed another method based on properly penalized Pearson divergences for class-prior estimation. Ramaswamy et al. (2016) gave a method for class-prior estimation based on kernel mean embedding. Furthermore, Jain et al. (2016) developed a method based on maximum likelihood estimation of the mixture

proportion. At present, the method proposed by Ramaswamy et al. (2016) is reported as the best method in the performance for class-prior estimation.

While many methods have been proposed in both fields of PU learning and class-prior estimation, they considered estimating the class-prior and learning a classifier separately. However, such a two-step approach of first estimating the class-prior and then training a classifier may not be the best approach—an estimation error of the class-prior in the first step is not taken into account when a classifier is trained later. In this paper, we propose a method of alternately estimating the class-prior and a classifier. The proposed algorithm is simple to implement, and it requires low computational costs. Experiments show that our proposed method is promising compared with existing approaches.

2 Formulation of PU Learning

In this section, we formulate the problem of PU learning and describe an unbiased PU learning method. Suppose that we have a positive dataset and an unlabeled dataset i.i.d. as

where p(x|y) is the class-conditional density, p(x) is the marginal density, and is the unknown class-prior for the positive class. We assume that and belong to a compact input space X. Let us define a classifier as , where f(x) is a score function, i.e., , and sign(y) denotes the sign of y. Our goal is to obtain a classifier h only from a positive dataset and an unlabeled dataset.

classification risk:

where is a loss function and E denotes the expectation over the unknown joint density p(x, y).

du Plessis et al. (2015) showed that the risk can be expressed only with the positive and unlabeled data as

where and are the expectations over p(x|y = +1) and p(x) respectively. From this expression, we can easily obtain an unbiased estimator of the classification risk from empirical data, by simply replacing the expectations by the corresponding sample averages.

We use the logarithmic loss function, i.e., and for . Then the above risk can be expressed as

We can derive some other loss functions from the logarithmic loss function and f(x). For example, if we use the sigmoid function as the score function, i.e., , where g is a function such that . In this case, the loss becomes the logistic loss: and .

3 Alternate Estimation

In this section, we propose our algorithm for learning a classifier and estimating the class-prior alternately only from PU data. Let us regard the class-prior as a parameter and denote it by . We define the risk as

and denote its empirical version by .

3.1 Algorithm in Population

In this subsection, we assume an access to infinite samples and propose an algorithm in population. After the theoretical analysis of the algorithm in population, we also define the empirical version of the algorithm. The algorithm trains a classifier f and estimates the class-prior alternately by iterating the following steps:

• Given estimated class-prior , train a classifier h by finding f that minimizes the empirical risk .

• Let us denote the minimizer as . Treat as an approximation of p(y = +1|x) and update by taking the expectation of over unlabeled data.

We denote the estimated class-prior after the kth iteration as . In the initial step, we set an initial prior such that . Then, we iterate the above alternate estimation until convergence. Intuitively, this is based on the following relationship for the true conditional probability p(y = +1|x).

In the rest of this section, we theoretically justify this estimation procedure.

3.2 Convergence to Non-negative Class-prior

In this subsection, we investigate theoretical properties of our algorithm in population for the true risk minimizer.

Assumptions: Firstly we put the following assumptions on the classification model f, p(y = +1|x), p(x) and p(x|y = +1). Let us denote the function space of f by F.

Assumption 1. f is a continuously differentiable function .

Assumption 2. Probability density function p(x) is -Lipchitz continuous, and p(x|y = +1) is -Lipchitz continuous. This means that the following inequalities hold for all :

where denotes the -norm.

We use Assumption 1 to derive a stationary point based on the Euler-Lagrange equation. Assumption 2 guarantees convergence to the true class-prior by assuming the smoothness of the probability distributions.

Then, we introduce the following quantity , which represents the largest possible class-prior given the positive and unlabeled data.

Definition 1 (Non-negative class-prior). Let us define as follows:

As Ward et al. (2009) proved, if we do not put any assumptions, class-prior estimation in the case-control scenario is an intractable task. Hence, we also put the following assumption on the true class-prior.

Assumption 3.

Assumption 3 makes us possible to identify the class-prior. The idea of estimating shared with other existing methods such as du Plessis et al. (2016).

Next, we prove that our algorithm defined with population converges to when is large enough.

3.2.1 Theoretical Convergence Guarantee:

Here, we prove our main theorem which guarantees the convergence of our class-prior estimator to the true class-prior. Given an estimator of the class-prior at the kth step, the optimization problem for learning a classifier is written as follows:

Let us denote the solution of problem (2) as . Then, we obtain a new estimator of class-prior as follows:

In order to prove the theorem, we prove the following two lemmas. The proofs of the lemmas are shown in Appendix.

Lemma 1. Suppose that Assumption 1 holds. The stationary point of the risk minimization problem is given as follows:

Lemma 2. If , then we have

Lemma 1 implies that the function given as a stationary point approaches to ,

but it is truncated by 1 if . This lemma explicates the behavior of a classifier trained under an incorrectly estimated class-prior. Lemma 2 shows the difference between an estimated class-prior used for training a classifier and another class-prior estimated by the trained classifier.

We prove the following theorem, which shows that, in population, the convergence of our class-prior estimator to the true class-prior. The theorem follows directly from the lemmas and the assumptions mentioned above. This proof is also shown in Appendix.

Theorem 1. For the initial class-prior such that in Algorithm 1 converges to as .

Necessity of : Readers might feel that we do not have to consider ; rather we might simply consider . However, it causes a problem because, in that case, there is no stationary point in our optimization problem due to the existence of in (1).

3.3 Algorithm

Here, we define an empirical version of our algorithm. Let us denote the the solution of the empirical risk minimization problem at the kth step as and the estimated class-prior after the kth iteration as . In the initial step, we set an initial prior such that . Then, we iterate the following alternate estimation procedure.

• At the kth step, given estimated class-prior , train a classifier h by finding f that minimizes the empirical risk , where is an empirical estimator of (1) with replaced by .

• Treat as an approximation of p(y = +1|x) and obtain by the average of over samples from unlabeled data.

However, unlike the property in population, choice of the initial class-prior has influence on the convergence. We observed that, if the initial class-prior is much larger than the true class-prior , tends to approach to 1 as . For example, we observed this phenomenon when the initial class-prior was , but the true class-prior was . This phenomenon is considered to be related with the estimation error and violation of the assumptions. In order to avoid this phenomenon, we introduce and as heuristics. should be a large value which is less than 1, but close to should be a small positive value. If for all x, the initial class-prior is wrong. Hence, we reset by when . Then, we restart our algorithm under a new estimator of the class-prior . A pseudo-code of our algorithm is shown in Algorithm 1.

Our algorithm is similar to expectation-maximization(EM) algorithm (Dempster et al., 1977). However, there is a crucial difference between our algorithm and EM algorithm. EM algorithm aims to maximize the likelihood, but our algorithm is based on the property of the stationary point to estimate the class-prior and do not consider the maximization of the likelihood through iterations. For this reason, although our algorithm is similar to EM algorithm, the mechanism of the estimation is quite different.

4 Experiments

In this section, we report experimental results which were conducted using numerical and real datasets1). In Comparison Tests and Benchmark Tests, we used 6 classification datasets from UCI repository2). The details of datasets are given in Table 1. In all experiments, we set and . In Comparison Tests and Benchmark Tests, we iterated 150 times in alternate estimation to make sure that it converges. In practice, it is not necessary to iterate as many steps as we did in the experiments.

We used the sigmoid function for representing the probabilistic model f:

where g(x) is a real-valued function. For g(x), we used two linear models. We denote the parameter of models as and the parameter space as . The first model is

where and denotes the transpose. This model has (d+1)-dimensional parameters when x is a d-dimensional vector. The second model is

Table 1: Specification of datasets

This model does not use the bias term. The second model might be unnatural, but empirically it performed well as we show below. We refer to the first model as AltEst1 and the second model as AltEst2 respectively. In the Numerical Tests and the Comparison Tests section, we used only AltEst1. In the Benchmark Tests section, we used both models. We assume the parameter space is a compact set. Because we also assume the input space X is a compact set, the models (3) and (4) satisfy Assumption 1.

4.1 Numerical Tests

In this subsection we numerically illustrate the convergence to the true class-prior of AltEst1. We used samples from a mixture distribution of the following two class-conditional distributions:

where denotes the univariate normal distribution with mean and variance .

Behavior of Classifiers: This experiment shows how to approach the true classifier through alternate estimation. We generated 100 positive samples and 10000 unlabeled samples. We made three datasets with different class-priors , and . We plotted the classifiers estimated from the initial class-prior 0.9 in each round (in total 10 rounds) in Fig. 1.

Behavior of Updated Class-Priors: This experiment shows how to move the estimated class-prior after the optimization under an inaccurate class-prior. We generated 100 positive samples and generated 10000 unlabeled samples with the different class-prior , and . In Fig 2, the horizontal axis represents the inserted class-prior and the vertical axis represents the updated class-prior after one iteration using the inserted class-prior. The blue line represents y = x and visualizes the fixed points. If the estimated class-prior is on the line in a round, the estimated class-prior will not change after one iteration.

In our two Gaussian datasets, we believe that the non-negative class-prior matched the true class-prior and the theoretical result explained the behavior of our algorithm.

Figure 1: Behavior of classifiers. The horizontal axis is value of x and the vertical axis is probabilistic density. The vertical line represents the classifier.

Figure 2: Behavior of updated class-priors. The horizontal axis is the initial class-prior and the vertical axis is the returned class-prior after one iteration.

4.2 Comparison Tests

To demonstrate the usefulnes of the proposed algorithm compared with two methods proposed by Ramaswamy et al. (2016), ”KM1” and ”KM2”, which are based on kernel mean embedding. We used the digits, usps, connect-4, mushroom, waveform, and spambase datasets from UCI repository. For the digits, usps, connect, and mushroom datasets, we projected data points onto the 50-dimensional space by principal component analysis (PCA). For each binary labeled dataset, we made 8 different pairs of positive and unlabeled data. Firstly, given the binary labeled dataset, we made 4 pairs of positive and unlabeled data with the different class-priors for the unlabeled dataset. The class-prior was chosen from {0.2, 0.4, 0.6, 0.8}. Next, we flipped the labels, i.e., we used negative data as positive data and positive data as negative data, and made another 4 datasets with the different class-priors. As a result, we obtained 8 pairs of positive and unlabeled data. We evaluated the performance by the absolute error defined as , where is an estimated class-prior. For each pair of 8 pairs made from 6 datasets, we estimated the class-prior 5 times and calculated the average absolute error. The number of positive data was fixed at 400. The numbers of unlabeled data were 400, 800, 1600, and 3200. This setting is almost all same as Ramaswamy et al. (2016). The result is shown in Fig 3. The horizontal axis is corresponding to the number of unlabeled data and the vertical axis is corresponding to the average absolute error . We set the initial class-prior of our algorithm as 0.9.

Our algorithm has preferable performance compared with the existing method in several

cases. The reason why our algorithm could not work well in some datasets such as the waveform might be due to the violation of assumptions.

Figure 3: Comparison: The horizontal axis is the number of unlabeled data and the vertical axis is the averaged absolute error . We set the initial class-prior of our algorithm as 0.9.

4.3 Benchmark Tests

In this subsection, we investigate the experimental performance in more detail. Unlike the previous experiment, we report estimators of the class-prior instead of the average absolute error rate. We used the digits, mushrooms, usps, waveform and spambase datasets. For digits, mushrooms and usps datasets, we reduced the dimension by PCA. For the digits dataset, we projected the data points onto the 100-dimensional space and the 200-dimensional space. For the mushrooms and usps datasets, we projected the data points onto the 100-dimensional data space. For each dataset, we drew 400 positive data and 1600 unlabeled data. After learning a classifier and estimating the class-prior, we also checked the accuracy of the classifier using test data. For the digits, mushrooms and usps datasets, we used 1000 test data. For the waveform and spambase datasets, we used 300 test data since the size of the original datasets were limited. The result is shown in Tables with the estimated class-prior and the error rate of classifiers. To evaluate the accuracy of a classifier, we compared classifiers trained by AltEst1 and AltEst2 with three classifiers trained by convex PU learning with logistic loss and the model (3) given the true class-prior and two estimated class-priors by KM1 and KM2. To evaluate the performance of class-prior estimation, we compared our algorithm with the methods proposed by Ramaswamy et al. (2016). We set the initial class-prior of our algorithm as 0.9. We ran the experiments 100 times and calculated the mean and standard deviation. We evaluated the performance of classifiers and, by reducing by PCA, we also checked the robustness of the algorithms to the dimension.

As shown in Tables , AltEst2 returned a preferable estimator of the class-prior. We can observe that our algorithms work stably in various settings, but the existing methods were easily influenced by the true class prior and the dimensionality of data. For the waveform and spambase datasets, our algorithms could not show good results. For these datasets, the accuracies of the classifier were low even given the true class-prior. We believe that the accuracy of the classifier affects the estimation of the class-prior, i.e., it is difficult to estimate the class-prior with data which is difficult to be classified. In practice, we need to specify the model more carefully to gain accuracy of a classifier.

5 Discussion and Conclusion

In this paper, we proposed a novel unified approach to estimating the class-prior and training a classifier alternately. Our method has benefits from the one-step estimation compared with other conventional two-step approaches which estimate the class-prior firstly and then train a classifier. We showed the theoretical guarantees in population and proposed an algorithm. In experiments, our method showed preferable performances. Moreover, we confirmed that, if we set the initial class-prior as the value that is close to the true class-prior, the behavior of our algorithm are improved in practice. An important future direction is to extend our method to adopt deep neural networks to gain more accuracy.

Acknowledgments

MS was supported by JST CREST JPMJCR1403.

References

Gilles Blanchard, Gyemin Lee, and Clayton Scott. Semi-supervised novelty detection. Journal of Machine Learning Research, 11(Nov):2973–3009, 2010.

A. P. Dempster, N. M. Laird, and D. B. Rubin. Maximum likelihood from incomplete data via the em algorithm. Journal of the Royal Statistical Society, Series B, 39(1):1–38, 1977.

M. C. du Plessis and M. Sugiyama. Class prior estimation from positive and unlabeled data. IEICE Transactions on Information and Systems, E97-D(5):1358–1362, 2014.

M. C. du Plessis, G. Niu, and M. Sugiyama. Convex formulation for learning from positive and unlabeled data. In ICML, pages 1386–1394, 2015.

M. C. du Plessis, G. Niu, and M. Sugiyama. Class-prior estimation for learning from positive and unlabeled data. In ACML, pages 221–236, 2016.

Charles Elkan and Keith Noto. Learning classifiers from only positive and unlabeled data. In ICDM, pages 213–220, 2008.

Izrail Moiseevitch Gelfand, Richard A Silverman, et al. Calculus of variations. Courier Corporation, 2000.

Shantanu Jain, Martha White, Michael W Trosset, and Predrag Radivojac. Nonparametric semi- supervised learning of class proportions. In NIPS, 2016.

Xiao-Li Li, Philip S Yu, Bing Liu, and See-Kiong Ng. Positive unlabeled learning for data stream classification. In ICDM, pages 259–270, 2009.

Minh Nhut Nguyen, Xiaoli-Li Li, and See-Kiong Ng. Positive unlabeled leaning for time series classification. In IJCAI, pages 1421–1426, 2011.

Harish Ramaswamy, Clayton Scott, and Ambuj Tewari. Mixture proportion estimation via kernel embeddings of distributions. In ICML, pages 2052–2060, 2016.

Ralph Tyrell Rockafellar. Convex analysis. Princeton university press, 2015.

Clayton Scott and Gilles Blanchard. Novelty detection: Unlabeled data definitely help. In AISTATS, pages 464–471, 2009.

Gill Ward, Trevor Hastie, Simon Barry, Jane Elith, and John R Leathwick. Presence-only data and the em algorithm. Biometrics, 65(2):554–563, 2009.

Table 2: digits 0 vs. 1, 2, 3, 4, 5, 6, 7, 8 with dimension 100 via PCA: The estimated class-prior (Prior: %) and the error rate of classification in test data (Error: %) are shown. Best and equivalent methods (under 5% t-test) are bold.

Table 3: digits 0 vs. 1, 2, 3, 4, 5, 6, 7, 8 with dimension 200 via PCA: The estimated class-prior (Prior: %) and the error rate of classification in test data (Error: %) are shown. Best and equivalent methods (under 5% t-test) are bold.

Table 4: mushrooms, usps, waveform and spambase (mushrooms and usps with dimension 100 via PCA): The estimated class-prior (Prior: %) and the error rate of classification in test data (Error: %) are shown. Best and equivalent methods (under 5% t-test) are bold.

A Proof of Lemma 1

To prove the lemma, the KKT condition is important. Here, we briefly explain the KKT condition. We consider the following optimization problem with continuous convex functions, a(x) and b(x):

Then we consider its Lagrange function:

where is the Lagrangian multiplier. If as . If , when as a result of minimization. As a result, we can derive the KKT condition as follows:

Proof. Let us consider maximizing J for . We introduce Lagrangian variables and , which are functions with range , and the Lagrange functional

Then we consider the maximization of L with respect to and , i.e.,

Now we obtain

Therefore,

Next, we consider the dual problem defined as follows:

Because of the convexity of for f, the following equality can be obtained Rockafellar (2015).

Hence, we discuss the solution of the dual problem. Given and , we apply the Euler-Lagrangeequation for calculating Gelfand et al. (2000). The maximizer satisfies the following equation,

Therefore,

As a result, we can derive the following KKT condition:

where and are the optimal functions, and the following results:

1. If , by (5), . This solution satisfies the constraint only when x is in the domain , where .

2. If and , by (5), . It is because as if holds only when x is in the domain , where , because

Otherwise, there is no feasible solution. As a result, the solution for the optimization problem is

B Proof of Lemma 2

Proof. By definition,

Then, we have

is strictly positive because it takes the integral on the

domain of x, which is .

C Proof of Theorem 1

Proof. Before showing the proof of the theorem, we prove the following lemma.

Lemma 3. For and , the following inequality holds.

Proof.

Then, we prove . This can be proved by mathematical induction. Let be a classifier learned at the k-th round, namely,

Then, if , we have

Since holds by the definition of , we have

Therefore, . Considering that , we have for all k > 0.

Based on this fact, we can prove the convergence. By the definition of , there exists such that

Hence, let be , and we have

Since the assumption of Lipchitz continuity of probability density functions, is -Lipchitz continuous, which implies

Let be and be . Then we have

where d is the dimension of x. Considering that , we have . Therefore, using Lemma 3, we have

Taking the sum of both sides for yields

Combined (5) with , we can conclude that as , which implies .

designed for accessibility and to further open science