b

DiscoverSearch
About
My stuff
Online Passive-Aggressive Total-Error-Rate Minimization
2020·arXiv
Abstract
Abstract

We provide a new online learning algorithm which utilizes online passive-aggressive learning (PA) and total-error-rate minimization (TER) for binary classification. The PA learning establishes not only large margin training but also the capacity to handle non-separable data. The TER learning on the other hand minimizes an approximated classification error based objective function. We propose an online PATER algorithm which combines those useful properties. In addition, we also present a weighted PATER algorithm to improve the ability to cope with data imbalance problems. Experimental results demonstrate that the proposed PATER algorithms achieves better performances in terms of efficiency and effectiveness than the existing state-of-the-art online learning algorithms in real-world data sets.

Online learning has been widely studied and efficiently applied to sequentially arriving data problems that cannot be solved by batch learning (Cesa-Bianchi & Lugosi, 2006; Shalev-Shwartz, 2011; Hoi et al., 2018). In online binary classification, Perceptron (Rosenblatt, 1958) is one of the most popular algorithms based on the first-order information. The Perceptron adopts the following step loss function to minimize misclassification:

image

where www  ∈ Rdis a parameter vector, and xt  ∈ Rdis a sample vector at time t. In view of margin based learning, online passive-aggressive learning (PA) (Crammer et al., 2006) has been successfully developed as a notable online learning method. The PA updates their learning parameters with a large margin obtained by the following hinge loss

function:

image

where yt  ∈ {−1,1}is a target label. The PA also accommodates non-separable data problems by not assuming data separability. In the use of the second-order information, the second-order perceptron (Cesa-Bianchi et al., 2005), the confidence weighted learning (Dredze et al., 2008), the adaptive regularization of weights learning (Crammer et al., 2009) have received considerable attention. Although the second-order information ensures the faster optimization convergence, the use of the first-order information is still reasonably attractive due to its simplicity.

Different from the margin based learning, Total-Error-Rate minimization (TER) based learning (Toh, 2008) has been introduced with a classification error approximation for binary classification (see Section 2.2). The TER takes two different step loss functions for false negative and false positive involved in the confusion matrix of binary classifica-tion. In order to achieve a deterministic global solution and avoid local solutions, the TER objective function adopts a quadratic approximation to the step function for the desired convexity. However, the above TER based solution has constructed a batch learning solution that has poor scalability for large-scale and real-time applications.

In this paper, we introduce a new online learning algorithm which can put in use the properties of PA and TER and overcome their individual shortcomings by combining them in which we call PATER for online binary classifica-tion. In addition, motivated by a weighted accuracy scheme (Brodersen et al., 2010), we also propose a weighted PATER (wPATER) method for data imbalance problems. Our empirical results demonstrate that the proposed methods show promising performances on 31 real-world data sets in terms of efficiency and effectiveness.

2.1. Online Passive-Aggressive Learning (PA)

We introduce online Passive-Aggressive learning (PA) which is one of the popular online learning methods of linear classifiers for binary classification (Crammer et al.,

2006). The optimization problem of the PA is as follows:

image

By solving this optimization problem, the PA solution is given by:

image

where τt = ℓhinge(www)∥xt∥2is a Lagrange multiplier under the Karush-Khun-Tucker (KKT) condition (Boyd & Vandenberghe, 2004). The online PA solution is passively updated when the loss is close zero, whereas it is aggressively updated when it suffered from a significant loss.

2.2. Total-Error-Rate (TER) Minimization

In (Toh, 2008), a Total-Error-Rate (TER) has been utilized in an optimization problem for binary classification. The TER is defined as the sum of the False Positive Rate (FPR) and the False Negative Rate (FNR):

image

where

image

n−and n+is the numbers of negative and positive samples respectively  (n = n−+n+). The two superscripts,  −and +, are for indication of each class. The TER objective function has been presented based on the quadratic approximation to the two step loss functions of FP and FN as follows:

image

where

image

which yields a closed-form solution related to weighted least-squares as

image

where X  = [X−,X+]Tincludes two sample matrices for negative and positive classes. W = diag�� 1n− , ... , 1n− , 1n+ , ... , 1n+��is a weight matrix that includes two different weights for the negative and positive classes. y  =�−1, ..., −1, 1, ..., 1�is the target vector. The batch mode TER solution handles binary classification problems by approximately counting the misclassified samples using the two different weights, 1n−and 1n+.

In this section, based on the online PA learning, we present an online TER minimization algorithm (PATER). We start by addressing the two step loss functions  ℓFPstep(www) and ℓFNstep(www) of TER. Similar to the hinge loss function in (2), these  ℓFPstep(www) and  ℓFNstep(www) can be rewritten as:

image

Based on  ℓFPhinge(www) and  ℓFNhinge(www) in (10), a new TER loss function is given by:

image

Similar to the PA objective function in (3), a new TER objective function using (11) can be written as:

image

The Lagrangian of optimization problem for TER minimization can then be defined as follows:

image

Taking the derivative of L(www,τ)with respect to www and setting it to zero as:

image

image

the PATER solution www is given by:

image

In order to transform the two summation terms into vectorized recursive forms in (15), the PATER solution www can be rewritten as:

image

www = wwwt  − τt

image

where  λ −t = 1−yt2and  λ +t = 1+yt2are indicators to select one class between the two classes at time t. zt  = z+t − z−t, z−t =n−t−1n−tz−t−1 + λ −tn−txt, and z+t =n+t−1n+tz+t−1 + λ +tn+txt.

Substituting the parameter vector www with (16) in (13), we get

image

In (17), the linearly increasing computational complexity O(dnt) of the summation terms is observed and caused by the dot product of wwwt·x for each sample. To avoid this computation, two different versions, namely PATER-I and PATER-II, of (17) are given by:

image

PATER-II:  L(τ) = −12τ2 ∥zt∥2 + τ

�n−t−1n−t k−t−1 + λ −tn−t (1 +wwwt·xt)

Taking the derivative of  L(τ)in (18) and (19) with respect to  τand setting it zero as:

image

we get:

image

3.1. Summary

The pseudo code of the proposed PATER algorithm is given in Algorithm 1. The main difference from the PA learning is that the PATER solution is given by the newly derived objective function (12) utilizing the PA and the TER learning together. The main difference between PATER-I and PATER-II is that PATER-I includes an instantaneous loss while PATER-II uses an approximately accumulated loss in  τt.

In this section, we present a weighted PATER algorithm (wPATER) inspired by the concept of a balanced accuracy for performance evaluation in (Brodersen et al., 2010). In order to address imbalanced data sets, a balanced accuracy

image

Negative, and TP is the True Positive. In view of a generalized design, a weighted accuracy can be written as:

image

where bAcc is given by  α− = α+ = 0.5. In a similar manner to this weighted accuracy, a weighted TER (wTER) can be defined as

image

Based on the wTER definition, similar to (11), a new wTER loss function is given by:

image

Accordingly, the wTER objective function can be written as:

image

Similar to (13), the Lagrangian of the optimization problem can then be defined as:

image

Similarly, the wPATER solution www is given by:

image

image

In order to establish wPATER-I and wPATER-II, similar to (18) and (19), we get:

image

Taking the derivative of  L(τ)in (31) and (32) with respect to  τand setting it zero as:

image

Table 1. Summary of the 31 real-world data sets for binary classification.

image

Table 2. Comparison of average accuracies with its standard deviations and ranks.

image

we get:

image

image

4.1. Summary

The pseudo code of the proposed wPATER algorithm is given in Algorithm 2. The main difference from PATER is that wPATER has two weights,  α−and  α+, as hyperparameters in its loss function (26) to address the imbalance data issue.

image

Figure 1. The cumulative classification accuracies along the online learning process. The enlarged figure is also drawn for detailed comparison. The brackets include the average accuracies.

In this section, we evaluate the performance of the proposed PATER and wPATER for binary classification. In terms of efficiency and effectiveness, the performance comparison is presented based on 31 real-world data sets.

5.1. Data sets and experimental settings

In our evaluation, 31 real-world data sets are obtained from UCI machine learning repository1 (Lichman, 2013), LIBSVM2 (Chang & Lin, 2011) and 20 Newsgroups3 which are popular in the NLP community. The description of the 31 data sets is summarized in Table 1. Each data set is conducted in a z-score normalization. The data set size ranges from 122 to 245,057 samples. The data imbalance ratio is calculated by n+n−.

For comparison with competing state-of-the-arts, the following algorithms are used as baselines: the perceptron (PE) (Rosenblatt, 1958) and the online passive-aggressive algorithm (PA) (Crammer et al., 2006). The average accuracies and CPU times are recorded using 10 runs of 2-fold

cross-validation for each algorithm.

In wPATER, the two weights are varied as  α−,α+ ∈{0.01, 0.1, 0.3, 0.5, 0.9, 0.99}. When one weight is changed, another weight is fixed at 1 (i.e.,  α− = 0.1, α+ =1) to only address one side population of the imbalance binary classification problems.

5.2. Results and Discussion

Table 2 shows the average accuracies and its standard deviations on the 31 real-world data sets. Additionally, ranks for each algorithm are shown in brackets. In terms of the average accuracy, the proposed wPATER-I performs better accuracy performance than that of the other algorithms. Moreover, the proposed wPATER-II is also shown as the second best. In Table 2, both wPATER-I and wPATER-II achieve the best performances on the 29 data sets. In order to investigate the validity of the proposed wPATER for the data imbalance issue, in Table 3, we evaluate a best weight (BW), which gives the best accuracy, between  α−and  α+for each data set. We then present matching results between the best weights and needed weights (NW) which are given by the data imbalance ratios from Table 1. Consequently, Table 3 shows that our assumption of the proposed wPATER on addressing the data imbalance prob-

Table 3. Evaluation of the best weights between  α− and α+ of the proposed wPATER for the data imbalance problems.

image

NOTES: ‘N’ AND ‘P’ INDICATE THE NEGATIVE AND POSIVIE CLASSES RESPECTIVELY. IN NW, ‘N’ IS GIVEN IF n+n− ≥ 1; OTHERWISE ‘P’ IS GIVEN.n+n−IS THE DATA IMBALANCE RATIO SHOWN IN TABLE 1. IN BW, ‘N’ IS GIVEN IF THE BEST PERFORMANCE IS OBTAINED FROM THE  α− VARIATION; OTHERWISE ‘P’ IS GIVEN.

Table 4. Comparison of average CPU times in seconds.

image

lems properly works in the 25 data sets. Figure 1 presents cumulative classification accuracies along the online learning process on the last 6 data sets (the data sets 26-31) that have large training samples. Except the ‘Ijcnn1’ data set in Figure 1(e), the PATER based algorithms maintain the best performers in terms of the cumulative accuracies. For time complexity, Table 4 shows the average CPU times on the aforementioned 31 data sets. In terms of the average CPU time, the proposed wPATER-I shows better CPU performance than that of the other algorithms.

To evaluate the statistical significance on the reported accuracy and CPU time comparisons, Friedman tests (see (Demˇsar, 2006)) and then Nemenyi plots are performed as a post-hoc test to statistically group connected algorithms at a confidence threshold (e.g., p = 0.05). In terms of

image

Figure 2. The Nemenyi test to check statistical significance of the average (a) accuracies and (b) CPU times. The connected algorithms by the Critical Difference (CD) have no statistical signifi-cance.

the accuracy rank, Fig. 2(a) has three groups of algorithms, namely 1) wPATER-I—wPATER-II, 2) wPATERII—PA and 3) PA—PATER-I—PE—PATER-II. The proposed wPATER-I and wPATER-II are shown as the best group. In terms of the computational rank, Fig. 2(b) shows four groups of algorithms, namely 1) PE—PA—wPATERI, 2) wPATER-I—PATER-I, 3) PATER-I—wPATER-II and 4) wPATER-II—PATER-II. The proposed wPATER-I is included in the best group, although this is shown as the third best individually. In summary, the proposed wPATER-I is seen as the best algorithm in terms of efficiency and effectiveness.

We have presented a passive-aggressive total-error-rate (PATER) minimization for online learning. Consequently, the proposed PATER had the useful properties of PA and TER together. In addition, a weighted PATER method (wPATER) has also presented to address the data imbalance problems. The proposed wPATER effectively and effi-ciently outperformedthe state-of-the-art algorithms, including the proposed PATER on the real-world data sets. This work was based on the first-order information. We will improve this work to extend it to the second-order methods.

Boyd, S. and Vandenberghe, L. Convex optimization. Cambridge University Press, 2004.

Brodersen, K. H., Ong, C. S., Stephan, K. E., and Buhmann, J. M. The balanced accuracy and its posterior distribution. In International Conference on Pattern Recognition, pp. 3121–3124. IEEE, 2010.

Cesa-Bianchi, N. and Lugosi, G. Prediction, learning, and games. Cambridge university press, 2006.

Cesa-Bianchi, N., Conconi, A., and Gentile, C. A second-order perceptron algorithm. SIAM Journal on Computing, 34(3):640–668, 2005.

Chang, C.-C. and Lin, C.-J. LIBSVM: A library for support vector machines. ACM Transactions on Intelligent Systems and Technology, 2:27:1–27:27, 2011. Software available at http://www.csie.ntu.edu.tw/˜cjlin/libsvm.

Crammer, K., Dekel, O., Keshet, J., Shalev-Shwartz, S., and Singer, Y. Online passive-aggressive algorithms. Journal of Machine Learning Research, 7(Mar):551– 585, 2006.

Crammer, K., Kulesza, A., and Dredze, M. Adaptive regularization of weight vectors. In Advances in neural information processing systems, pp. 414–422, 2009.

Demˇsar, J. Statistical comparisons of classifiers over multiple data sets. Journal of Machine learning research, 7 (Jan):1–30, 2006.

Dredze, M., Crammer, K., and Pereira, F. Confidence-weighted linear classification. In International Conference on Machine Learning, pp. 264–271. ACM, 2008.

Hoi, S. C. H., Sahoo, D., Lu, J., and Zhao, P. Online learning: A comprehensive survey. arXiv preprint arXiv:1802.02871, 2018.

Lichman, M. UCI machine learning repository, 2013. URL http://archive.ics.uci.edu/ml.

Rosenblatt, F. The perceptron: A probabilistic model for information storage and organization in the brain. Psy- chological review, 65(6):386, 1958.

Shalev-Shwartz, S. Online learning and online convex optimization. Foundations and Trends in Machine Learning, 4(2):107–194, 2011.

Toh, K.-A. Deterministic neural classification. Neural computation, 20(6):1565–1595, 2008.


Designed for Accessibility and to further Open Science