b

DiscoverSearch
About
My stuff
Alternate Estimation of a Classifier and the Class-Prior from Positive and Unlabeled Data
2018·arXiv
Abstract
Abstract

We consider a problem of learning a binary classifier only from positive data and unlabeled data (PU learning) and estimating the class-prior in unlabeled data under the case-control scenario. Most of the recent methods of PU learning require an estimate of the class-prior probability in unlabeled data, and it is estimated in advance with another method. However, such a two-step approach which first estimates the class-prior and then trains a classifier may not be the optimal approach since the estimation error of the class-prior is not taken into account when a classifier is trained. In this paper, we propose a novel unified approach to estimating the class-prior and training a classifier alternately. Our proposed method is simple to implement and computationally efficient. Through experiments, we demonstrate the practical usefulness of the proposed method.

We consider the problem of learning a binary classifier only from positive data and unlabeled data (PU learning). This problem arises in various practical situations, such as information retrieval and outlier detection (Elkan and Noto, 2008; Ward et al., 2009; Scott and Blanchard, 2009; Blanchard et al., 2010; Li et al., 2009; Nguyen et al., 2011). One of the theoretical milestones of PU learning is Elkan and Noto (2008) and there are subsequent researches called unbiased PU learning (du Plessis and Sugiyama, 2014; du Plessis et al., 2015), where the classification risk is estimated in an unbiased manner only from PU data.

We consider the case-control scenario (Ward et al., 2009; Elkan and Noto, 2008), where positive data are obtained separately from unlabeled data and unlabeled data is sampled from the whole population. Under this setting, the true class-prior  π = p(y = +1)in unlabeled data is needed for the formulation of unbiased PU learning. However, since  πis often unknown in practice, methods for class-prior estimation in PU learning have been developed.

There are several existing methods for class-prior estimation in PU learning. Elkan and Noto (2008) is also a milestone of class-prior estimation. Recently, du Plessis et al. (2016) proposed another method based on properly penalized Pearson divergences for class-prior estimation. Ramaswamy et al. (2016) gave a method for class-prior estimation based on kernel mean embedding. Furthermore, Jain et al. (2016) developed a method based on maximum likelihood estimation of the mixture

proportion. At present, the method proposed by Ramaswamy et al. (2016) is reported as the best method in the performance for class-prior estimation.

While many methods have been proposed in both fields of PU learning and class-prior estimation, they considered estimating the class-prior and learning a classifier separately. However, such a two-step approach of first estimating the class-prior and then training a classifier may not be the best approach—an estimation error of the class-prior in the first step is not taken into account when a classifier is trained later. In this paper, we propose a method of alternately estimating the class-prior and a classifier. The proposed algorithm is simple to implement, and it requires low computational costs. Experiments show that our proposed method is promising compared with existing approaches.

In this section, we formulate the problem of PU learning and describe an unbiased PU learning method. Suppose that we have a positive dataset and an unlabeled dataset i.i.d. as

image

where p(x|y) is the class-conditional density, p(x) is the marginal density, and  π = p(y = +1)is the unknown class-prior for the positive class. We assume that  xiand  x′ibelong to a compact input space X. Let us define a classifier  h : X → {−1, 1}as  h(x) = sign(f(x) − 0.5), where f(x) is a score function, i.e.,  f : X → (0, 1), and sign(y) denotes the sign of y. Our goal is to obtain a classifier h only from a positive dataset and an unlabeled dataset.

image

classification risk:

image

where  ℓ : R×{±1} → Ris a loss function and E denotes the expectation over the unknown joint density p(x, y).

du Plessis et al. (2015) showed that the risk can be expressed only with the positive and unlabeled data as

image

where  E1and  EXare the expectations over p(x|y = +1) and p(x) respectively. From this expression, we can easily obtain an unbiased estimator of the classification risk from empirical data, by simply replacing the expectations by the corresponding sample averages.

We use the logarithmic loss function, i.e.,  ℓ(f(x), +1)) = − log(f(x))and  ℓ(f(x), −1) =− log(1 − f(x))for  ℓ. Then the above risk can be expressed as

image

We can derive some other loss functions from the logarithmic loss function and f(x). For example, if we use the sigmoid function as the score function, i.e.,  f(x) = 11+exp(−g(x)), where g is a function such that  g : X → R. In this case, the loss becomes the logistic loss: ℓ(f(x), +1)) = log(1 + exp(−g(x)))and  ℓ(f(x), −1)) = log(1 + exp(g(x))).

In this section, we propose our algorithm for learning a classifier and estimating the class-prior alternately only from PU data. Let us regard the class-prior as a parameter and denote it by  κ. We define the risk  Rκ(f)as

image

and denote its empirical version by ˆRκ(f).

3.1 Algorithm in Population

In this subsection, we assume an access to infinite samples and propose an algorithm in population. After the theoretical analysis of the algorithm in population, we also define the empirical version of the algorithm. The algorithm trains a classifier f and estimates the class-prior  πalternately by iterating the following steps:

Given estimated class-prior  π∗, train a classifier h by finding f that minimizes the empirical risk  Rπ∗(f).

Let us denote the minimizer as  f ∗. Treat  f ∗as an approximation of p(y = +1|x) and update  π∗by taking the expectation of  f ∗over unlabeled data.

We denote the estimated class-prior after the kth iteration as  π(k). In the initial step, we set an initial prior  π(0)such that  π(0) > π. Then, we iterate the above alternate estimation until convergence. Intuitively, this is based on the following relationship for the true conditional probability p(y = +1|x).

image

In the rest of this section, we theoretically justify this estimation procedure.

3.2 Convergence to Non-negative Class-prior

In this subsection, we investigate theoretical properties of our algorithm in population for the true risk minimizer.

Assumptions: Firstly we put the following assumptions on the classification model f, p(y = +1|x), p(x) and p(x|y = +1). Let us denote the function space of f by F.

Assumption 1. f is a continuously differentiable function  f : Rd → (0, 1 − ǫ].

Assumption 2. Probability density function p(x) is  L1-Lipchitz continuous, and p(x|y = +1) is  L2-Lipchitz continuous. This means that the following inequalities hold for all  x, x′ ∈ Rd:

image

where  ∥ · ∥1denotes the  l1-norm.

We use Assumption 1 to derive a stationary point based on the Euler-Lagrange equation. Assumption 2 guarantees convergence to the true class-prior by assuming the smoothness of the probability distributions.

Then, we introduce the following quantity  πmax, which represents the largest possible class-prior given the positive and unlabeled data.

Definition 1 (Non-negative class-prior). Let us define  πmaxas follows:

image

As Ward et al. (2009) proved, if we do not put any assumptions, class-prior estimation in the case-control scenario is an intractable task. Hence, we also put the following assumption on the true class-prior.

Assumption 3.

image

Assumption 3 makes us possible to identify the class-prior. The idea of estimating  πmaxshared with other existing methods such as du Plessis et al. (2016).

Next, we prove that our algorithm defined with population converges to  πmaxwhen  ˆπ(0)is large enough.

3.2.1 Theoretical Convergence Guarantee:

Here, we prove our main theorem which guarantees the convergence of our class-prior estimator to the true class-prior. Given an estimator of the class-prior  π(k)at the kth step, the optimization problem for learning a classifier is written as follows:

image

Let us denote the solution of problem (2) as  f (k+1)∗. Then, we obtain a new estimator of class-prior  π(k+1)as follows:

image

In order to prove the theorem, we prove the following two lemmas. The proofs of the lemmas are shown in Appendix.

Lemma 1. Suppose that Assumption 1 holds. The stationary point of the risk minimization problem is given as follows:

image

Lemma 2. If  π(k) > πmax, then we have

image

Lemma 1 implies that the function given as a stationary point approaches to π(k)p(x|y=+1)p(x),

but it is truncated by 1 if π(k)p(x|y=+1)p(x) ≥ 1. This lemma explicates the behavior of a classifier trained under an incorrectly estimated class-prior. Lemma 2 shows the difference between an estimated class-prior used for training a classifier and another class-prior estimated by the trained classifier.

We prove the following theorem, which shows that, in population, the convergence of our class-prior estimator to the true class-prior. The theorem follows directly from the lemmas and the assumptions mentioned above. This proof is also shown in Appendix.

Theorem 1. For the initial class-prior  π(0)such that  π(0) > π, π(k)in Algorithm 1 converges to πmaxas  k → ∞.

Necessity of  ǫ: Readers might feel that we do not have to consider  ǫ > 0; rather we might simply consider  f : Rd → [0, 1]. However, it causes a problem because, in that case, there is no stationary point in our optimization problem due to the existence of  log(1 − f(x))in (1).

3.3 Algorithm

Here, we define an empirical version of our algorithm. Let us denote the the solution of the empirical risk minimization problem at the kth step as ˆf (k+1)and the estimated class-prior after the kth iteration as  ˆπ(k+1). In the initial step, we set an initial prior  ˆπ(0)such that  ˆπ(0) > π. Then, we iterate the following alternate estimation procedure.

At the kth step, given estimated class-prior  ˆπ(k), train a classifier h by finding f that minimizes the empirical risk ˆRˆπ(k)(f), where ˆRˆπ(k)(f)is an empirical estimator of (1) with  πreplaced by  ˆπ(k).

Treat ˆf (k+1)as an approximation of p(y = +1|x) and obtain  ˆπ(k+1)by the average of ˆf (k+1)over samples from unlabeled data.

However, unlike the property in population, choice of the initial class-prior has influence on the convergence. We observed that, if the initial class-prior  ˆπ(0)is much larger than the true class-prior  π, ˆf (k)tends to approach to 1 as  k → +∞. For example, we observed this phenomenon when the initial class-prior was  ˆπ(0) = 0.9, but the true class-prior was  π = 0.2. This phenomenon is considered to be related with the estimation error and violation of the assumptions. In order to avoid this phenomenon, we introduce  δand  ξas heuristics.  δshould be a large value which is less than 1, but close to  1. ξshould be a small positive value. If ˆf (k)(x) = 1−ǫfor all x, the initial class-prior is wrong. Hence, we reset  ˆπ(0)by  ˆπ(0) − ξwhen  ˆπ(k) > δ. Then, we restart our algorithm under a new estimator of the class-prior  ˆπ(0). A pseudo-code of our algorithm is shown in Algorithm 1.

image

Our algorithm is similar to expectation-maximization(EM) algorithm (Dempster et al., 1977). However, there is a crucial difference between our algorithm and EM algorithm. EM algorithm aims to maximize the likelihood, but our algorithm is based on the property of the stationary point to estimate the class-prior and do not consider the maximization of the likelihood through iterations. For this reason, although our algorithm is similar to EM algorithm, the mechanism of the estimation is quite different.

In this section, we report experimental results which were conducted using numerical and real datasets1). In Comparison Tests and Benchmark Tests, we used 6 classification datasets from UCI repository2). The details of datasets are given in Table 1. In all experiments, we set  δ = 0.9and  ξ = 0.01. In Comparison Tests and Benchmark Tests, we iterated 150 times in alternate estimation to make sure that it converges. In practice, it is not necessary to iterate as many steps as we did in the experiments.

We used the sigmoid function for representing the probabilistic model f:

image

where g(x) is a real-valued function. For g(x), we used two linear models. We denote the parameter of models as  θand the parameter space as  Θ. The first model is

image

where  z = (1, x⊤)⊤and ⊤denotes the transpose. This model has (d+1)-dimensional parameters θwhen x is a d-dimensional vector. The second model is

image

Table 1: Specification of datasets

image

This model does not use the bias term. The second model might be unnatural, but empirically it performed well as we show below. We refer to the first model as AltEst1 and the second model as AltEst2 respectively. In the Numerical Tests and the Comparison Tests section, we used only AltEst1. In the Benchmark Tests section, we used both models. We assume the parameter space Θis a compact set. Because we also assume the input space X is a compact set, the models (3) and (4) satisfy Assumption 1.

4.1 Numerical Tests

In this subsection we numerically illustrate the convergence to the true class-prior of AltEst1. We used samples from a mixture distribution of the following two class-conditional distributions:

image

where  N(µ, σ2)denotes the univariate normal distribution with mean  µand variance  σ2.

Behavior of Classifiers: This experiment shows how to approach the true classifier through alternate estimation. We generated 100 positive samples and 10000 unlabeled samples. We made three datasets with different class-priors  π = 0.2, π = 0.5, and  π = 0.8. We plotted the classifiers estimated from the initial class-prior 0.9 in each round (in total 10 rounds) in Fig. 1.

Behavior of Updated Class-Priors: This experiment shows how to move the estimated class-prior after the optimization under an inaccurate class-prior. We generated 100 positive samples and generated 10000 unlabeled samples with the different class-prior  π = 0.2, π = 0.5, and π = 0.8. In Fig 2, the horizontal axis represents the inserted class-prior and the vertical axis represents the updated class-prior after one iteration using the inserted class-prior. The blue line represents y = x and visualizes the fixed points. If the estimated class-prior is on the line in a round, the estimated class-prior will not change after one iteration.

In our two Gaussian datasets, we believe that the non-negative class-prior matched the true class-prior and the theoretical result explained the behavior of our algorithm.

image

Figure 1: Behavior of classifiers. The horizontal axis is value of x and the vertical axis is probabilistic density. The vertical line represents the classifier.

image

Figure 2: Behavior of updated class-priors. The horizontal axis is the initial class-prior and the vertical axis is the returned class-prior after one iteration.

4.2 Comparison Tests

To demonstrate the usefulnes of the proposed algorithm compared with two methods proposed by Ramaswamy et al. (2016), ”KM1” and ”KM2”, which are based on kernel mean embedding. We used the digits, usps, connect-4, mushroom, waveform, and spambase datasets from UCI repository. For the digits, usps, connect, and mushroom datasets, we projected data points onto the 50-dimensional space by principal component analysis (PCA). For each binary labeled dataset, we made 8 different pairs of positive and unlabeled data. Firstly, given the binary labeled dataset, we made 4 pairs of positive and unlabeled data with the different class-priors for the unlabeled dataset. The class-prior was chosen from {0.2, 0.4, 0.6, 0.8}. Next, we flipped the labels, i.e., we used negative data as positive data and positive data as negative data, and made another 4 datasets with the different class-priors. As a result, we obtained 8 pairs of positive and unlabeled data. We evaluated the performance by the absolute error defined as  |ˆπ − π|, where ˆπis an estimated class-prior. For each pair of 8 pairs made from 6 datasets, we estimated the class-prior 5 times and calculated the average absolute error. The number of positive data was fixed at 400. The numbers of unlabeled data were 400, 800, 1600, and 3200. This setting is almost all same as Ramaswamy et al. (2016). The result is shown in Fig 3. The horizontal axis is corresponding to the number of unlabeled data and the vertical axis is corresponding to the average absolute error  |ˆπ − π|. We set the initial class-prior of our algorithm as 0.9.

Our algorithm has preferable performance compared with the existing method in several

cases. The reason why our algorithm could not work well in some datasets such as the waveform might be due to the violation of assumptions.

image

Figure 3: Comparison: The horizontal axis is the number of unlabeled data and the vertical axis is the averaged absolute error  |ˆπ − π|. We set the initial class-prior of our algorithm as 0.9.

4.3 Benchmark Tests

In this subsection, we investigate the experimental performance in more detail. Unlike the previous experiment, we report estimators of the class-prior instead of the average absolute error rate. We used the digits, mushrooms, usps, waveform and spambase datasets. For digits, mushrooms and usps datasets, we reduced the dimension by PCA. For the digits dataset, we projected the data points onto the 100-dimensional space and the 200-dimensional space. For the mushrooms and usps datasets, we projected the data points onto the 100-dimensional data space. For each dataset, we drew 400 positive data and 1600 unlabeled data. After learning a classifier and estimating the class-prior, we also checked the accuracy of the classifier using test data. For the digits, mushrooms and usps datasets, we used 1000 test data. For the waveform and spambase datasets, we used 300 test data since the size of the original datasets were limited. The result is shown in Tables  2 − 4with the estimated class-prior and the error rate of classifiers. To evaluate the accuracy of a classifier, we compared classifiers trained by AltEst1 and AltEst2 with three classifiers trained by convex PU learning with logistic loss and the model (3) given the true class-prior and two estimated class-priors by KM1 and KM2. To evaluate the performance of class-prior estimation, we compared our algorithm with the methods proposed by Ramaswamy et al. (2016). We set the initial class-prior of our algorithm as 0.9. We ran the experiments 100 times and calculated the mean and standard deviation. We evaluated the performance of classifiers and, by reducing by PCA, we also checked the robustness of the algorithms to the dimension.

As shown in Tables  2 − 4, AltEst2 returned a preferable estimator of the class-prior. We can observe that our algorithms work stably in various settings, but the existing methods were easily influenced by the true class prior and the dimensionality of data. For the waveform and spambase datasets, our algorithms could not show good results. For these datasets, the accuracies of the classifier were low even given the true class-prior. We believe that the accuracy of the classifier affects the estimation of the class-prior, i.e., it is difficult to estimate the class-prior with data which is difficult to be classified. In practice, we need to specify the model more carefully to gain accuracy of a classifier.

In this paper, we proposed a novel unified approach to estimating the class-prior and training a classifier alternately. Our method has benefits from the one-step estimation compared with other conventional two-step approaches which estimate the class-prior firstly and then train a classifier. We showed the theoretical guarantees in population and proposed an algorithm. In experiments, our method showed preferable performances. Moreover, we confirmed that, if we set the initial class-prior as the value that is close to the true class-prior, the behavior of our algorithm are improved in practice. An important future direction is to extend our method to adopt deep neural networks to gain more accuracy.

MS was supported by JST CREST JPMJCR1403.

Gilles Blanchard, Gyemin Lee, and Clayton Scott. Semi-supervised novelty detection. Journal of Machine Learning Research, 11(Nov):2973–3009, 2010.

A. P. Dempster, N. M. Laird, and D. B. Rubin. Maximum likelihood from incomplete data via the em algorithm. Journal of the Royal Statistical Society, Series B, 39(1):1–38, 1977.

M. C. du Plessis and M. Sugiyama. Class prior estimation from positive and unlabeled data. IEICE Transactions on Information and Systems, E97-D(5):1358–1362, 2014.

M. C. du Plessis, G. Niu, and M. Sugiyama. Convex formulation for learning from positive and unlabeled data. In ICML, pages 1386–1394, 2015.

M. C. du Plessis, G. Niu, and M. Sugiyama. Class-prior estimation for learning from positive and unlabeled data. In ACML, pages 221–236, 2016.

Charles Elkan and Keith Noto. Learning classifiers from only positive and unlabeled data. In ICDM, pages 213–220, 2008.

Izrail Moiseevitch Gelfand, Richard A Silverman, et al. Calculus of variations. Courier Corporation, 2000.

Shantanu Jain, Martha White, Michael W Trosset, and Predrag Radivojac. Nonparametric semi- supervised learning of class proportions. In NIPS, 2016.

Xiao-Li Li, Philip S Yu, Bing Liu, and See-Kiong Ng. Positive unlabeled learning for data stream classification. In ICDM, pages 259–270, 2009.

Minh Nhut Nguyen, Xiaoli-Li Li, and See-Kiong Ng. Positive unlabeled leaning for time series classification. In IJCAI, pages 1421–1426, 2011.

Harish Ramaswamy, Clayton Scott, and Ambuj Tewari. Mixture proportion estimation via kernel embeddings of distributions. In ICML, pages 2052–2060, 2016.

Ralph Tyrell Rockafellar. Convex analysis. Princeton university press, 2015.

Clayton Scott and Gilles Blanchard. Novelty detection: Unlabeled data definitely help. In AISTATS, pages 464–471, 2009.

Gill Ward, Trevor Hastie, Simon Barry, Jane Elith, and John R Leathwick. Presence-only data and the em algorithm. Biometrics, 65(2):554–563, 2009.

image

Table 2: digits 0 vs. 1, 2, 3, 4, 5, 6, 7, 8 with dimension 100 via PCA: The estimated class-prior (Prior: %) and the error rate of classification in test data (Error: %) are shown. Best and equivalent methods (under 5% t-test) are bold.

image

Table 3: digits 0 vs. 1, 2, 3, 4, 5, 6, 7, 8 with dimension 200 via PCA: The estimated class-prior (Prior: %) and the error rate of classification in test data (Error: %) are shown. Best and equivalent methods (under 5% t-test) are bold.

image

Table 4: mushrooms, usps, waveform and spambase (mushrooms and usps with dimension 100 via PCA): The estimated class-prior (Prior: %) and the error rate of classification in test data (Error: %) are shown. Best and equivalent methods (under 5% t-test) are bold.

To prove the lemma, the KKT condition is important. Here, we briefly explain the KKT condition. We consider the following optimization problem with continuous convex functions, a(x) and b(x):

image

Then we consider its Lagrange function:

image

where  λis the Lagrangian multiplier. If  b(x) > 0, λb(x) → +∞as  λ → +∞. If  b(x) ≤ 0, λb(x) = 0when  λ = 0as a result of minimization. As a result, we can derive the KKT condition as follows:

image

Proof. Let us consider maximizing J for  f ∈ (0, 1−ǫ]. We introduce Lagrangian variables  α(x)and  β(x), which are functions  Rd → Rwith range  (0, +∞), and the Lagrange functional

image

Then we consider the maximization of L with respect to  αand  β, i.e.,

image

Now we obtain

image

Therefore,

image

Next, we consider the dual problem defined as follows:

image

Because of the convexity of  Rπ(k)(f)for f, the following equality can be obtained Rockafellar (2015).

image

Hence, we discuss the solution of the dual problem. Given  αand  β, we apply the Euler-Lagrangeequation for calculating  inff L(f; α, β)Gelfand et al. (2000). The maximizer  f (k+1)∗(x)satisfies the following equation,

image

Therefore,

image

As a result, we can derive the following KKT condition:

image

where  f (k+1)∗(x), α(k+1)∗(x)and  β(k+1)∗(x)are the optimal functions, and the following results:

1. If  α(k+1)∗(x) = β(k+1)∗(x) = 0, by (5),  f (k+1)∗(x) = π(k)p(x|y=+1)p(x). This solution satisfies the constraint only when x is in the domain  D(k), where  D(k) = {x|π(k)p(x|y =1) ≤ (1 − ǫ)p(x)}.

2. If  α(k+1)∗(x) > 0and  β(k+1)∗(x) = 0, by (5),  f (k+1)∗(x) = 1 − ǫ. It is because inff L(f; α, β) → +∞as  α(k+1)∗(x) → +∞if  f (k+1)∗(x) < 1 − ǫ. α(k+1)∗(x) > 0holds only when x is in the domain  R\D(k), where  R\D(k) = {x|π(k)p(x|y = 1) <(1 − ǫ)p(x)}, because

image

image

Otherwise, there is no feasible solution. As a result, the solution for the optimization problem f (k+1)∗(x) = arg minf(x)∈(0,1−ǫ] Rπ(k)(f)is

image

Proof. By definition,

image

Then, we have

image

�R\D�π(k)p(x|y=+1)p(x) − (1 − ǫ)�p(x)dxis strictly positive because it takes the integral on the

domain of x, which is  {x|π(k)p(x|y = 1) − (1 − ǫ)p(x) > 0}.

Proof. Before showing the proof of the theorem, we prove the following lemma.

Lemma 3. For  a ≥ b > 0and  n ∈ {1, 2, . . .}, the following inequality holds.

image

Proof.

image

Then, we prove  π(k) ≥ πmax. This can be proved by mathematical induction. Let  f (k)(x)be a classifier learned at the k-th round, namely,

image

Then, if  π(k) ≥ πmax, we have

image

Since  1 − ǫ ≥ πmaxp(x|y=+1)p(x)holds by the definition of  πmax, we have

image

Therefore,  π(k) ≥ πmax ⇒ π(k+1) ≥ πmax. Considering that  π(0) ≥ πmax, we have  π(k) ≥πmaxfor all k > 0.

Based on this fact, we can prove the convergence. By the definition of  πmax, there exists x∗ ∈ Rdsuch that

image

Hence, let  g(k)(x)be  g(k)(x) = π(k)p(x|y = +1) − (1 − ǫ)p(x), and we have

image

Since the assumption of Lipchitz continuity of probability density functions,  g(k)(x)is  (π(k)L1+(1 − ǫ)L2)-Lipchitz continuous, which implies

image

Let  ykbe  yk = π(k) − πmaxand  p∗be  p∗ = p(x∗|y = +1). Then we have

image

where d is the dimension of x. Considering that  π(k) > πmax, we have  yk > yk+1 > 0. Therefore, using Lemma 3, we have

image

Taking the sum of both sides for  t = 1, . . . , T − 1yields

image

Combined (5) with  yT ≥ 0, we can conclude that  yT → 0as  T → +∞, which implies  π(k) →πmax.


Designed for Accessibility and to further Open Science