Online learning has been widely studied and efficiently applied to sequentially arriving data problems that cannot be solved by batch learning (Cesa-Bianchi & Lugosi, 2006; Shalev-Shwartz, 2011; Hoi et al., 2018). In online binary classification, Perceptron (Rosenblatt, 1958) is one of the most popular algorithms based on the first-order information. The Perceptron adopts the following step loss function to minimize misclassification:
where www is a parameter vector, and xt
is a sample vector at time t. In view of margin based learning, online passive-aggressive learning (PA) (Crammer et al., 2006) has been successfully developed as a notable online learning method. The PA updates their learning parameters with a large margin obtained by the following hinge loss
function:
where yt is a target label. The PA also accommodates non-separable data problems by not assuming data separability. In the use of the second-order information, the second-order perceptron (Cesa-Bianchi et al., 2005), the confidence weighted learning (Dredze et al., 2008), the adaptive regularization of weights learning (Crammer et al., 2009) have received considerable attention. Although the second-order information ensures the faster optimization convergence, the use of the first-order information is still reasonably attractive due to its simplicity.
Different from the margin based learning, Total-Error-Rate minimization (TER) based learning (Toh, 2008) has been introduced with a classification error approximation for binary classification (see Section 2.2). The TER takes two different step loss functions for false negative and false positive involved in the confusion matrix of binary classifica-tion. In order to achieve a deterministic global solution and avoid local solutions, the TER objective function adopts a quadratic approximation to the step function for the desired convexity. However, the above TER based solution has constructed a batch learning solution that has poor scalability for large-scale and real-time applications.
In this paper, we introduce a new online learning algorithm which can put in use the properties of PA and TER and overcome their individual shortcomings by combining them in which we call PATER for online binary classifica-tion. In addition, motivated by a weighted accuracy scheme (Brodersen et al., 2010), we also propose a weighted PATER (wPATER) method for data imbalance problems. Our empirical results demonstrate that the proposed methods show promising performances on 31 real-world data sets in terms of efficiency and effectiveness.
2.1. Online Passive-Aggressive Learning (PA)
We introduce online Passive-Aggressive learning (PA) which is one of the popular online learning methods of linear classifiers for binary classification (Crammer et al.,
2006). The optimization problem of the PA is as follows:
By solving this optimization problem, the PA solution is given by:
where is a Lagrange multiplier under the Karush-Khun-Tucker (KKT) condition (Boyd & Vandenberghe, 2004). The online PA solution is passively updated when the loss is close zero, whereas it is aggressively updated when it suffered from a significant loss.
2.2. Total-Error-Rate (TER) Minimization
In (Toh, 2008), a Total-Error-Rate (TER) has been utilized in an optimization problem for binary classification. The TER is defined as the sum of the False Positive Rate (FPR) and the False Negative Rate (FNR):
where
nand n
is the numbers of negative and positive samples respectively
. The two superscripts,
and +, are for indication of each class. The TER objective function has been presented based on the quadratic approximation to the two step loss functions of FP and FN as follows:
where
which yields a closed-form solution related to weighted least-squares as
where X includes two sample matrices for negative and positive classes. W = diag
is a weight matrix that includes two different weights for the negative and positive classes. y
is the target vector. The batch mode TER solution handles binary classification problems by approximately counting the misclassified samples using the two different weights, 1n
and 1n
.
In this section, based on the online PA learning, we present an online TER minimization algorithm (PATER). We start by addressing the two step loss functions www) and
www) of TER. Similar to the hinge loss function in (2), these
www) and
www) can be rewritten as:
Based on www) and
www) in (10), a new TER loss function is given by:
Similar to the PA objective function in (3), a new TER objective function using (11) can be written as:
The Lagrangian of optimization problem for TER minimization can then be defined as follows:
Taking the derivative of L(wwwwith respect to www and setting it to zero as:
the PATER solution www is given by:
In order to transform the two summation terms into vectorized recursive forms in (15), the PATER solution www can be rewritten as:
www = wwwt
where and
are indicators to select one class between the two classes at time t. zt
, z
z
xt, and z
z
xt.
Substituting the parameter vector www with (16) in (13), we get
In (17), the linearly increasing computational complexity O(dnt) of the summation terms is observed and caused by the dot product of wwwtx for each sample. To avoid this computation, two different versions, namely PATER-I and PATER-II, of (17) are given by:
PATER-II: 12
zt
n
wwwt
xt)
Taking the derivative of in (18) and (19) with respect to
and setting it zero as:
we get:
3.1. Summary
The pseudo code of the proposed PATER algorithm is given in Algorithm 1. The main difference from the PA learning is that the PATER solution is given by the newly derived objective function (12) utilizing the PA and the TER learning together. The main difference between PATER-I and PATER-II is that PATER-I includes an instantaneous loss while PATER-II uses an approximately accumulated loss in .
In this section, we present a weighted PATER algorithm (wPATER) inspired by the concept of a balanced accuracy for performance evaluation in (Brodersen et al., 2010). In order to address imbalanced data sets, a balanced accuracy
Negative, and TP is the True Positive. In view of a generalized design, a weighted accuracy can be written as:
where bAcc is given by 5. In a similar manner to this weighted accuracy, a weighted TER (wTER) can be defined as
Based on the wTER definition, similar to (11), a new wTER loss function is given by:
Accordingly, the wTER objective function can be written as:
Similar to (13), the Lagrangian of the optimization problem can then be defined as:
Similarly, the wPATER solution www is given by:
In order to establish wPATER-I and wPATER-II, similar to (18) and (19), we get:
Taking the derivative of in (31) and (32) with respect to
and setting it zero as:
Table 1. Summary of the 31 real-world data sets for binary classification.
Table 2. Comparison of average accuracies with its standard deviations and ranks.
we get:
4.1. Summary
The pseudo code of the proposed wPATER algorithm is given in Algorithm 2. The main difference from PATER is that wPATER has two weights, and
, as hyperparameters in its loss function (26) to address the imbalance data issue.
Figure 1. The cumulative classification accuracies along the online learning process. The enlarged figure is also drawn for detailed comparison. The brackets include the average accuracies.
In this section, we evaluate the performance of the proposed PATER and wPATER for binary classification. In terms of efficiency and effectiveness, the performance comparison is presented based on 31 real-world data sets.
5.1. Data sets and experimental settings
In our evaluation, 31 real-world data sets are obtained from UCI machine learning repository1 (Lichman, 2013), LIBSVM2 (Chang & Lin, 2011) and 20 Newsgroups3 which are popular in the NLP community. The description of the 31 data sets is summarized in Table 1. Each data set is conducted in a z-score normalization. The data set size ranges from 122 to 245,057 samples. The data imbalance ratio is calculated by n.
For comparison with competing state-of-the-arts, the following algorithms are used as baselines: the perceptron (PE) (Rosenblatt, 1958) and the online passive-aggressive algorithm (PA) (Crammer et al., 2006). The average accuracies and CPU times are recorded using 10 runs of 2-fold
cross-validation for each algorithm.
In wPATER, the two weights are varied as {0.01, 0.1, 0.3, 0.5, 0.9, 0.99}. When one weight is changed, another weight is fixed at 1 (i.e.,
1) to only address one side population of the imbalance binary classification problems.
5.2. Results and Discussion
Table 2 shows the average accuracies and its standard deviations on the 31 real-world data sets. Additionally, ranks for each algorithm are shown in brackets. In terms of the average accuracy, the proposed wPATER-I performs better accuracy performance than that of the other algorithms. Moreover, the proposed wPATER-II is also shown as the second best. In Table 2, both wPATER-I and wPATER-II achieve the best performances on the 29 data sets. In order to investigate the validity of the proposed wPATER for the data imbalance issue, in Table 3, we evaluate a best weight (BW), which gives the best accuracy, between and
for each data set. We then present matching results between the best weights and needed weights (NW) which are given by the data imbalance ratios from Table 1. Consequently, Table 3 shows that our assumption of the proposed wPATER on addressing the data imbalance prob-
Table 3. Evaluation of the best weights between of the proposed wPATER for the data imbalance problems.
NOTES: ‘N’ AND ‘P’ INDICATE THE NEGATIVE AND POSIVIE CLASSES RESPECTIVELY. IN NW, ‘N’ IS GIVEN IF nIS THE DATA IMBALANCE RATIO SHOWN IN TABLE 1. IN BW, ‘N’ IS GIVEN IF THE BEST PERFORMANCE IS OBTAINED FROM THE
Table 4. Comparison of average CPU times in seconds.
lems properly works in the 25 data sets. Figure 1 presents cumulative classification accuracies along the online learning process on the last 6 data sets (the data sets 26-31) that have large training samples. Except the ‘Ijcnn1’ data set in Figure 1(e), the PATER based algorithms maintain the best performers in terms of the cumulative accuracies. For time complexity, Table 4 shows the average CPU times on the aforementioned 31 data sets. In terms of the average CPU time, the proposed wPATER-I shows better CPU performance than that of the other algorithms.
To evaluate the statistical significance on the reported accuracy and CPU time comparisons, Friedman tests (see (Demˇsar, 2006)) and then Nemenyi plots are performed as a post-hoc test to statistically group connected algorithms at a confidence threshold (e.g., p = 0.05). In terms of
Figure 2. The Nemenyi test to check statistical significance of the average (a) accuracies and (b) CPU times. The connected algorithms by the Critical Difference (CD) have no statistical signifi-cance.
the accuracy rank, Fig. 2(a) has three groups of algorithms, namely 1) wPATER-I—wPATER-II, 2) wPATERII—PA and 3) PA—PATER-I—PE—PATER-II. The proposed wPATER-I and wPATER-II are shown as the best group. In terms of the computational rank, Fig. 2(b) shows four groups of algorithms, namely 1) PE—PA—wPATERI, 2) wPATER-I—PATER-I, 3) PATER-I—wPATER-II and 4) wPATER-II—PATER-II. The proposed wPATER-I is included in the best group, although this is shown as the third best individually. In summary, the proposed wPATER-I is seen as the best algorithm in terms of efficiency and effectiveness.
We have presented a passive-aggressive total-error-rate (PATER) minimization for online learning. Consequently, the proposed PATER had the useful properties of PA and TER together. In addition, a weighted PATER method (wPATER) has also presented to address the data imbalance problems. The proposed wPATER effectively and effi-ciently outperformedthe state-of-the-art algorithms, including the proposed PATER on the real-world data sets. This work was based on the first-order information. We will improve this work to extend it to the second-order methods.
Boyd, S. and Vandenberghe, L. Convex optimization. Cambridge University Press, 2004.
Brodersen, K. H., Ong, C. S., Stephan, K. E., and Buhmann, J. M. The balanced accuracy and its posterior distribution. In International Conference on Pattern Recognition, pp. 3121–3124. IEEE, 2010.
Cesa-Bianchi, N. and Lugosi, G. Prediction, learning, and games. Cambridge university press, 2006.
Cesa-Bianchi, N., Conconi, A., and Gentile, C. A second-order perceptron algorithm. SIAM Journal on Computing, 34(3):640–668, 2005.
Chang, C.-C. and Lin, C.-J. LIBSVM: A library for support vector machines. ACM Transactions on Intelligent Systems and Technology, 2:27:1–27:27, 2011. Software available at http://www.csie.ntu.edu.tw/˜cjlin/libsvm.
Crammer, K., Dekel, O., Keshet, J., Shalev-Shwartz, S., and Singer, Y. Online passive-aggressive algorithms. Journal of Machine Learning Research, 7(Mar):551– 585, 2006.
Crammer, K., Kulesza, A., and Dredze, M. Adaptive regularization of weight vectors. In Advances in neural information processing systems, pp. 414–422, 2009.
Demˇsar, J. Statistical comparisons of classifiers over multiple data sets. Journal of Machine learning research, 7 (Jan):1–30, 2006.
Dredze, M., Crammer, K., and Pereira, F. Confidence-weighted linear classification. In International Conference on Machine Learning, pp. 264–271. ACM, 2008.
Hoi, S. C. H., Sahoo, D., Lu, J., and Zhao, P. Online learning: A comprehensive survey. arXiv preprint arXiv:1802.02871, 2018.
Lichman, M. UCI machine learning repository, 2013. URL http://archive.ics.uci.edu/ml.
Rosenblatt, F. The perceptron: A probabilistic model for information storage and organization in the brain. Psy- chological review, 65(6):386, 1958.
Shalev-Shwartz, S. Online learning and online convex optimization. Foundations and Trends in Machine Learning, 4(2):107–194, 2011.
Toh, K.-A. Deterministic neural classification. Neural computation, 20(6):1565–1595, 2008.