Nonlinear classifiers for ranking problems based on kernelized SVM

2020·Arxiv

Abstract

Abstract

Many classification problems focus on maximizing the performance only on the samples with the highest relevance instead of all samples. As an example, we can mention ranking problems, accuracy at the top or search engines where only the top few queries matter. In our previous work, we derived a general framework including several classes of these linear classification problems. In this paper, we extend the framework to nonlinear classifiers. Utilizing a similarity to SVM, we dualize the problems, add kernels and propose a componentwise dual ascent method.

1 Introduction

The aim of classical linear binary classification is to separate positive and negative samples by a linear hyperplane. In many applications, it is desirable to separate only a certain number of samples. In such a case, the goal is not to maximize the performance on all samples but only the performance on the required samples with the highest relevance. Such classifiers have many applications. For example, in information retrieval systems, only the most relevant documents should be returned for a given query. Furthermore, they are useful in domains, where a large number of samples needs to be quickly screened and only a small subset of samples needs to be selected for further evaluation.

These problems can be generally written as pushing the positive samples above some decision threshold. The methods differ in the definition of the decision threshold. In our previous work [2], we introduced a general framework that unifies these methods. We showed that several problem classes, which were considered as separate problems so far, fit into the framework. As the most relevant we mention the following methods:

• Ranking problems focuses on ranking the positive samples higher than the negative ones. Many methods, such as RankBoost [14], Infinite Push [3] or p-norm push [24] employ a pairwise comparison of samples, which makes them infeasible for larger datasets. This was alleviated in TopPush [20] where the authors considered the limit p → ∞. Since the l∞ norm from TopPush is equal to the maximum, the decision threshold from our framework equals to the maximum of scores of negative samples. This was generalized into TopPushK [2] by considering the threshold to be the mean of K largest scores of negative samples.

• Accuracy at the Top [8] focuses on maximizing the number of positive samples above the top τ-quantile of scores. There are many methods on how to solve accuracy at the top. In [8], the authors assume that the top quantile is one of the samples, construct n unconstrained optimization problems with fixed thresholds, solve them and select the best solution. This method is computationally expensive. In [16] the authors propose a fast projected gradient descent method. In our previous paper, we proposed a convex approximation of the accuracy at the top called Pat&Mat-NP. This method is reasonably fast and guaranteed the existence of global optimum.

The deficiency of methods from this framework is that they usually cover only linear classifiers. However, as many problems are not linearly separable, nonlinear classifiers are needed. In this work, we show how to extend our framework into nonlinear classification problems. To do so, we use the fact that our framework is similar to the primal formulation of support vector machines [11]. The classical way to incorporate nonlinearity into SVM is to derive the dual formulation [7] and to employ the kernels method [25]. In this work, we follow this approach, derive dual formulations for the considered problems and add nonlinear kernels to them. Moreover, as dual problems are generally expensive to solve, we derive a quick method to solve them. This is a modification of the coordinate-wise dual ascent from [17]. For a review of other approaches see [4, 28].

The paper is organized as follows: In Section 2 we recall the unified framework derived in [2]. In Section and two class of problems that falls into it. Moreover, for selected methods, we derive their dual formulations. Namely, we focus on TopPush, TopPushK and Pat&Mat-NP. In Section 3.3, we show how to add nonlinear kernels into dual formulations. In Section 3.4 derive a new method for solving these dual problems and perform its complexity analysis. Since our method depends on the chosen problem and surrogate function, we provide a concrete form of the solution for TopPushK with the quadratic hinge loss. Solutions for other problems are provided in Appendix. Finally, in Section 4 we present the description of performance criteria, choice of hyperparameters and description of datasets. The rest of the section is focused on the results of numerical experiments.

2 Framework

In this section, we recall the general framework for classification at the top introduced in [2]. For simplicity, we use the following notation in the rest of the text.

Notation 2.1 (Dataset). In this work, we use label 0 to encode the negative class and label 1 to encode the positive class. By a dataset of size n ∈ N we mean a set of pairs in the following form

where xi ∈ represents samples and yi ∈ {0,1} corresponding labels. To simplify future notation, we denote a set of all indices of dataset D as I = I− ∪ I+, where

We also denote the number of negative samples in D as n− = |I−| and the number of positive samples in D as n+ = |I+|. The total number of samples is n = n− + n+.

Linear binary classification is a problem of finding a linear hyperplane that separates a group of positive samples from a group of negative samples and achieves the lowest possible error. For a sample x ∈ , the prediction for a linear classifier amounts to

Here, w ∈ is the normal vector to the separating hyperplane and t ∈ R is a decision threshold. The well-known example of such a classifier is a support vector machine [11] where the decision threshold t is a free variable. However, many important binary classification problems maximize the performance only for a certain amount of samples with the highest scores s = w⊤x. In these cases, the threshold t is not a free variable but a function of the scores. In our previous work [2], we formulated a general framework for maximizing performance above the threshold t as

where function G: × {0,1}n → R takes the scores and labels of all samples and computes the decision threshold and is the Iverson function which is used to count misclassified samples and is defined as

The concrete form of the function G that defines the decision threshold depends on the used problem. Note the important distinction from the standard binary classification: the decision threshold is no longer fixed (as

Figure 1: Comparison of the approximation quality of the Iverson function using different surrogate functions and scaling parameters.

in the case of neural networks) or trained independently (as in SVM) but is a function of scores of all samples. Therefore, the minimization in problem (1) is performed only concerning the one variable w.

The objective function in (1) is a weighted sum of false-positive and false-negative counts. Since these counts are discontinuous due to the presence of the Iverson function, the whole objective function is discontinuous too. Therefore, problem (1) is difficult to solve. One way how to simplify the problem is to derive its continuous approximation. The usual approach is to employ a surrogate function to replace the Iverson function [20, 16].

Notation 2.2 (Surrogate function). To approximate the Iverson function (2), we use any surrogate function l that is convex, non-negative, and non-decreasing with l(0) = 1, and l(s) → 0 as s → −∞. As examples of such function, we can mention the hinge loss or the quadratic hinge loss defined by

Figure 1 compares the Iverson function with the hinge and quadratic hinge loss with scaled inputs by ϑ = 2 and without scaling. We use ϑ > 0 to denote any scaling parameter.

By replacing the Iverson function in the objective function of (1) with its surrogate approximation and adding a regularization for better numerical stability, we get

The resulting objective function is continuous, and therefore the problem is easier to solve than the original problem (1).

As we derived in [2], there are many problems belonging to the general framework (1). The summary of all formulations is provided in Table 1. However, this framework handles only linear classification problems. As many problems are not linearly separable, this is often not sufficient. To generalize the framework to nonlinear classifiers, we realize that (3) is similar to the primal formulation of the SVM [11]. We will follow the standard way to incorporate nonlinearity into SVM by deriving the dual problem [7] and using the kernels methods [25].

In the next section, we introduce two problem families based on formulations from [2] and for each of them, we derive its dual formulation. Namely, we will discuss family of TopPushK formulations and family of Pat&Mat formulations.

3 Derivation of Dual Problems

In Section 2, we introduced a general framework for binary classification at the top. Moreover, we showed that several problem classes, considered separate problems so far, fit into this framework. Many formulations have nice theoretical properties such as convexity or differentiability in this specific case. However, many real-world problems are not linearly separable, and in such cases, the approach from the previous section is

Table 1: Summary of problem fomrulations that fall in the framework (3). Column Formulation shows the name of the formulation that we use in this work. Column Source is the citation of the work where the formulation was introduced. Column Ours shows whether the formulation was introduced in any of our previous papers. Column Hyperparameters shows the hyperparameters available for each formulation. The last three columns show the values of parameters C1, C2 and the form of the decision threshold for given framework (3).

not sufficient. In this section, we use the similarity of (3) to primal formulation of SVM [11] and derive dual forms for almost all formulations from Table 1. Then we use the kernel method [25] to introduce nonlinearity into the dual formulations. Moreover, as dual problems are generally computationally expensive, we propose an efficient method to solve them.

This section is dedicated to deriving dual forms for almost all formulations from Table 1. We do not discuss Grill and Grill-NP formulations in the following text since both formulations are not convex, and therefore their primal and dual formulations are not equivalent. Since many of the remaining formulations are very similar, we divide them into two families:

• TopPushK family: TopPush, TopPushK, TopMeanK and τ-FPL.

• Pat&Mat family: Pat&Mat and Pat&Mat-NP.

Both families use surrogate false-negative rate as an objective function. Moreover, all formulations from TopPushK family use the mean of K highest scores of all or negative samples as a threshold and differ only in the definition of K. Finally, both formulations from Pat&Mat family use a surrogate approximation of the top τ-quantile of scores of all or negative samples. In other words, we have two families of formulations that share the same objective function and the same form of the decision threshold. Therefore, we derive all results for the general form of these two families. Before we start, we need to introduce the concept of conjugate functions.

Definition 3.1 (Conjugate function [7]). Let l : → R. The function l⋆ : → R, defined as

is called the conjugate function of l. The domain of the conjugate function consists of y ∈ for which the supremum is finite.

These functions will play a crucial role in the resulting form of dual problems. Recall the hinge loss and quadratic hinge loss function defined in Notation 2.2

The conjugate function for the hinge loss can be found in [26] and has the following form

Similarly, the conjugate function for the quadratic hinge was computed in [18] as

Notation 3.2 (Kernel Matrix). To simplify the future notation, we introduce matrix X of all samples. Each row of X represents one sample and is defined for all i ∈ I as

In the same way, we defined matrices of all negative and positive samples with rows defined as

Moreover, for all formulations that use only negative samples to compute the threshold t, we define kernel matrix

and for all formulations that use only all samples to compute the threshold t, we define kernel matrix as

In the rest of the text, matrix K always refers to one of the kernel matrices defined above.

3.1 Family of TopPushK Formulations

In this section, we focus on the family of TopPushK formulations. The general optimization problem that covers all formulations from this family can be written in the following way

where C ∈ R. The set of indices ˜I equals I for TopMeanK and I− for other formulations. The parameter K equals 1 for TopPush, K for TopPushK, nτ for TopMeanK, and n−τ for τ-FPL. Note that we use an alternative formulation with constant C, since it is more similar to the standard SVM, and we wanted to stress this similarity. For C /λn+ the new formulation is identical to the original one.

The following theorem shows the dual form of formulation (6). The dual formulation for TopPush was originally derived in [20]. We only show, that our general dual formulation also covers this special case. To keep the readability as simple as possible, we postpone all proofs to Appendix.

Theorem 3.3 (Dual formulation for TopPushK family). Consider Notation 3.2, surrogate function l, and formulation (6). Then the corresponding dual problem has the following form

where l⋆ is conjugate function of l and

If K = 1, the upper bound in the second constraint (7c) vanishes due to the first constraint. Finally, the primal variables w can be computed from dual variables as follows

3.2 Family of Pat&Mat Formulations

In the same way, as for TopPushK family, we introduce a general optimization problem that covers all formulations from Pat&Mat family and reads

where C ∈ R. For Pat&Mat we have ˜I = I and ˜n = n. For Pat&Mat-NP we have ˜I = I− and ˜n = n−. Again, we use the alternative formulation with constant C. The following theorem shows the dual form of the formulation (9).

Theorem 3.4 (Dual formulation for Pat&Mat family). Consider Notation 3.2, surrogate function l, and formulation (9). Then the corresponding dual problem has the following form

where l⋆ is conjugate function of l, ϑ > 0 is a scaling parameter and

Finally, the primal variables w can be computed from dual variables as follows

Note 3.5. For simplicity, the rest of the section covers only the TopPushK formulation with hinge loss. We use this formulation since it is the prototypical example for the TopPushK family of formulations. The results for the rest of the formulations from this family can be derived almost identically. Moreover, results for the Pat&Mat family of formulations can be derived similarly. Therefore, derivations for the TopPushK family with quadratic hinge loss and the Pat&Mat family with hinge and quadratic hinge loss are postponed to Appendix.

3.3 Kernels

As we mentioned at the beginning of the section, our goal is to extend our framework to be usable for linearly inseparable problems. In two previous sections, we derived dual formulations for TopPushK and Pat&Mat families. In this section, we show how to employ the kernels method [25] to introduce nonlinearity into these dual formulations. For simplicity, we focus only on the TopPushK formulation that computes the decision threshold only from negative samples. As mentioned in Notation 3.2, TopPushK formulation uses kernel matrix . The following derivation is the same for all other formulations.

To add kernels, we first realize that primal variables w can be computed from dual variables α, β using (8). Therefore, the classification score for any sample x can be calculated as follows

Importantly, all samples xi in the previous formula occur only in the dot product with x and not separately. This property allows us to use the standard kernel trick from SVMs [11]. The kernel trick replaces the dot product of the vectors from input space using the so-called kernel function k : × → R. This function represents a dot product in the space of a higher dimension

where φ : → is a mapping function. The idea is to transform the input vectors using φ into some feature space in which the classification problem is easier to solve. However, getting the explicit formula for the mapping function is usually very hard. The kernel trick allows us to avoid this explicit mapping to the feature space since we can only replace the dot product in (12) by the kernel function k

The downside of this approach is, that we can not compute the primal variables using (8) if we do not know the mapping function φ. We always have to calculate the scores using the formula above, which is computationally expensive.

Now we must show how to modify the original dual problem (7) to incorporate kernels. Recall the form of the kernel matrix K for TopPushK

Since each component of the kernel matrix K is computed as a dot product of two training samples, we can replace K with a matrix in the following form

The kernel function k(·, ·) is applied to all rows of both arguments. In other words, if we use the kernel trick, the original dual problem (7) remains almost the same. The only change is in the construction of the kernel matrix.

3.4 Coordinate Descent Algorithm

In the previous sections, we derived dual formulations for TopPushK and Pat&Mat families of formulations. Moreover, we showed how to incorporate non-linear kernels into these formulations. As a result, we can use all presented formulations even for linearly non-separable problems. However, the dimension of the dual problems is at least equal to the number of all samples n, and therefore, it is computationally expensive to use standard techniques such as gradient descent. To handle this issue, the standard coordinate descent algorithm [10, 17] has been proposed in the context of SVMs. In this section, we derive a coordinate descent algorithm suitable for our dual problems (7, 10). We also show that we can reduce the whole optimization problem to a one-dimensional quadratic optimization problem with a closed-form solution in every iteration. Therefore, every iteration of our algorithm is cheap. For a review of other approaches see [4, 28].

Recall that we perform all derivations only for TopPushK with hinge loss. Classification scores can be computed directly from dual variables as shown in (13). Using the definition (14) of kernel matrix K, we can define a vector of scores s by

Note that dual scores are not identical to the primal ones (12) (even though we use the same notation). The main difference is that dual scores use kernel function k. Therefore, they are equivalent only if the kernel function is defined as a dot product in the input space, i.e., if k(x,x′) = x⊤x′. To simplify the indexing of the vector of scores (15) and kernel matrix K, we introduce a new notation in Notation 3.6.

Notation 3.6. Consider any index l that satisfies 1 ≤ l ≤ n+ + ˜n. Note that the length of dual variable α is n+ for both formulations (7) and (10). Therefore, we can define auxiliary index ˆl as

Then the index l can be safely used for kernel matrix K or vector of scores s, while its corresponding version ˆl can be used for dual variables α or β.

3.4.1 Update Rules

Consider dual formulation (7) from Theorem 3.3 and fixed feasible dual variables α, β. Our goal in this section is to derive an efficient iterative procedure for solving this problem. We follow the ideas presented in [10, 17] for solving SVMs using a coordinate descent algorithm. However, we must modify the approach since we have an additional constraint (7b). Due to this constraint, we always have to update (at least) two components of dual variables α, β. There are only three update rules which modify two components of α, β, and satisfy constraints (7b). The first one updates two components of α

where denotes i-th column of K and indices ˆk, ˆl are defined in Notation 3.6. Note that the update rule for s does not use matrix multiplication but only vector addition. The second rule updates one component of α and one component of β

and the last one updates two components of β

Using any of the update rules above, the problem (7) can be written as a one-dimensional quadratic problem in the following form

where a, b, c, ∆lb, ∆ub are constants with respect to ∆. The optimal solution to this problem is

where γ = − ba and clip[a, b] (x) amounts to clipping (projecting) x to interval [a,b]. Since we assume one of the update rules (16), the constraint (7b) is always satisfied after the update. Even though all three update rules hold for any surrogate, the calculation of the optimal ∆⋆ depends on the concrete form of surrogate function. In the following text, we show the closed-form formula for ∆⋆, when the hinge loss function from Notation 2.2 is used.

Plugging the conjugate (4) of the hinge loss into the dual formulation (7) yields

The form of K and ˜n depends on the used formulation as discussed in Theorem 3.3. Moreover, the upper bound in (18d) can be omitted for K = 1. Since we know the form of the optimal solution (17), we only need to show how to compute ∆lb, ∆ub and γ for all update rules (16). The following three propositions provide closed-form formulae for all three update rules. To keep the presentation as simple as possible, we postpone all proofs to Appendix B.1.

Proposition 3.7 (Update rule (16a) for problem (18)). Consider problem (18), update rule (16a), indices 1 ≤ k ≤ n+ and 1 ≤ l ≤ n+ and Notation 3.6. Then the optimal solution ∆⋆ is given by (17) where

Proposition 3.8 (Update rule (16b) for problem (18)). Consider problem (18), update rule (16b), indices 1 ≤ k ≤ n+ and n+ + 1 ≤ l ≤ ˜n and Notation 3.6. Let us define

Then the optimal solution ∆⋆ is given by (17) where

Proposition 3.9 (Update rule (16c) for problem (18)). Consider problem (18), update rule (16c), indices n+ + 1 ≤ k ≤ ˜n and n+ + 1 ≤ l ≤ ˜n and Notation 3.6. Then the optimal solution ∆⋆ is given by (17) where

3.4.2 Initialization

For all update rules (16) we assumed that the current solution α, β is feasible. So to create an iterative algorithm that solves problem (18) or (27), we need to have a way how to obtain an initial feasible solution. Such a task can be formally written as a projection of random variables α0, β0 to the feasible set of solutions

where the upper bound in the second constraint depends on the used surrogate function. To solve problem (19), we follow the same approach as in [1]. In the following theorem, we show that problem (19) can be written as a system of two equations of two variables λ and µ. Moreover, the theorem shows the concrete form of feasible solution α, β that depends only on λ and µ.

Theorem 3.10. Consider problem (19), some initial solution α0, β0 and denote the sorted version (in non-decreasing order) of β0 as β0[·]. Then if the following condition holds

the optimal solution of (19) amounts to α 0. In the opposite case, the following system of two equations

has a solution (λ,µ) with µ > 0, and the optimal solution of (19) is equal to

Theorem 3.10 shows the optimal solution of (19) that depends only on (λ,µ) but does not provide any way to find such a solution. In the following text, we show that the number of variables in the system of equations (21) can be reduced to one. For any fixed µ, we denote the function on the left-hand side of (21b) by

Then g is non-decreasing in λ but not necessarily strictly increasing. We denote by λ(µ) any such λ solving (21b) for a fixed µ. Denote z the sorted version of −β0. Then we have

Now we can easily compute λ(µ) by solving g(λ(µ);µ) = 0 for fixed µ. To get the solution efficiently, we derive Algorithm 1, which can be described as follows: Index i will run over z while index j will run over z + µ. At every iteration, we know the values of g(zi−1;µ) and g(zj−1 + µ;µ) and we want to evaluate g at the next point. We denote the number of indices j such that λ − zj ∈ [0,µ) by d. If zi ≤ zj + µ, then we consider λ = zi and since one index enters the set�j��� λ − zj ∈ [0,µ)�, we increase d by one. On the other hand, if zi > zj + µ, then we consider λ = zj + µ and since one index leaves the set�j��� λ − zj ∈ [0,µ)�, we decrease d by one. In both cases, g is increased by d times the difference between the new λ and old λ. Once g exceeds 0, we stop the algorithm and linearly interpolate between the last two values. To prevent an overflow, we set zm+1 = +∞. Concerning the initial values, since z1 ≤ z1 + µ, we set i = 2, j = 1 and d = 1.

Since λ(µ) can be computed for fixed µ using Algorithm 1, we can define auxiliary function h in the following form

Then the system of equations (21) is equivalent to h(µ) = 0. The following lemma describes properties of h. Since h is decreasing in µ on (0,∞), any root-finding algorithm such as bisection can be used to find the optimal solution.

Lemma 3.11. Even though λ(µ) is not unique, function h from (22) is well-defined in the sense that it gives the same value for every choice of λ(µ). Moreover, h is decreasing in µ on (0,+∞).

3.5 Summary

In this section, we derived dual formulation for TopPushK and Pat&Mat family of formulations. Moreover, we derived simple update rules that can be used to improve the current feasible solution. We also showed that these update rules have closed-form formulae, and therefore they are simple to compute. Finally, we showed how to find an initial feasible solution. For TopPushK family with hinge loss, we showed the derivation in the previous section, while the derivations for Pat&Mat family are in Appendix B.2. This section combines all these intermediate results into Algorithm 2 and discusses its computational complexity.

The left column in Algorithm 2 describe the algorithm for TopPushK family while the right column for Pat&Mat family. In step 2 we initialize α, β and δ to some feasible value using Theorem 3.10 or Theorem B.10. Then, based on (15) we compute scores s. Each repeat loop in step 3 updates two coordinates as shown in (16). In step 4 we select a random index k and in the for loop in step 5 we compute the optimal (∆l,δl) for all possible combinations (k,l) as in (16). In step 8 we select the best pair (∆l,δl) which maximizes the coresponding objective function. Finally, based on the selected update rule we update α, β, s and δ in steps 9 and 10.

Now we derive the computational complexity of each repeat loop from step 3. The computation of (∆l,δl) amounts to solving a quadratic optimization problem in one variable. As we showed in Sections 3.4.1 and B.2, there is a closed-form solution and step 6 can be performed in O(1). Since this is embedded in a for loop in step 5, the whole complexity of this loop is O(n+ + ˜n). Step 9 requires O(1) for the update of α and β while O(n+ + ˜n) for the update of s. Since the other steps are O(1), the total complexity of the repeat loop is O(n+ + ˜n). This holds only if the kernel matrix K is precomputed. In the opposite case, all complexities must be multiplied by the cost of computation of components of K, which is O(d). This complexity analysis is summarized in Table 2.

Table 2: Computational complexity of one repeat loop (which updates two coordinates of α or β) from Algorithm 2.

4 Numerical Experiments

In this section, we describe in detail all settings used for the experiments. The section consists of five subsections. The first one discusses which formulations from Table 1 we use for the experimental evaluation. In this subsection, we also introduce baseline formulations used for the comparison. In the second one, we introduce datasets used in the experiments and describe their structure. A detailed description of the datasets is then provided in separate sections with their corresponding experiment results. The third and fourth subsections contain a detailed description of performance metrics. The last subsection contains a description of tools used for implementation. All codes used for the experiments, as well as all experiment configurations, are publicly available on GitHub. We provide one respository with the code

and one repository with numerical experiments

4.1 Formulations

To simplify the setup of all experiments, we decided to focus on formulations that only use negative samples for the threshold computation, since the performance of such formulations can be compared by basic performance metrics, as shown later in Section 4.4. In total, we use four different formulations from Table 1, namely TopPush, TopPushK, τ-FPL, and Pat&Mat-NP. Moreover, for TopPushK, we use two different values of K = {5,10} and consider the resulting formulations as separate formulations, i.e., we have TopPushK (5) and TopPushK (10). Similarly, for τ-FPL and Pat&Mat-NP we use two different values of τ = {0.01,0.05}. For all formulations, we use the hinge loss defined in Notation 2.2 as a surrogate function.

As a baseline formulation for comparison, we use C-SVC variant of SVM [6, 11, 9] defined by

where yi ∈ {−1,1} for all i ∈ I and φ(xi) maps xi into a higher-dimensional space (see Section 3.3). The corresponding dual form is as follows

where the kernel matrix K is defined for all i,j = 1,2,...,n as

Note that the dual form of C-SVC is very similar to the dual forms of our formulations derived in Section 3. We will denote C-SVC as SVM.

In total, we have five different formulations for experiments, as seen in Table 3. The following section discusses which hyper-parameters are used for each formulation. Since we used a slightly different primal form (standard formulation for SVM) for the derivation of dual forms, we also show how to convert used parameters to the resulting dual forms and get identical experiment settings.

4.2 Hyper-parameters

The selected formulations differ in the number of available hyper-parameters. Therefore, we decided to use a fixed value for all but one of the hyper-parameters jfor each formulation. For most of the considered formulations, the only hyper-parameter is the regularization constant λ. In our experiments, we used the following six values of this hyper-parameter

The only exceptions are the formulations derived from Pat&Mat-NP since they also have the scaling parameter ϑ. Since the parameter is essential for the approximation quality of the threshold, we decided to fine-tune this hyper-parameter instead of the regularization constant λ. Therefore, we fixed λ to 10−3 for Pat&Mat-NP formulations and used the following six different values of the scaling parameter

Since we used a slightly different (but equivalent) primal formulation for the derivation of the dual forms, we use λ to compute the hyper-parameter C used in these dual forms

where ˜n = n for SVM and ˜n = n+ otherwise. In all experiments, the best hyperparameter is selected based on the validation data and the appropriate performance metric. A summary of all used formulations and their hyper-parameters is in Table 3.

Table 3: Summary of all formulations used for experiments. The first column shows the aliases used for the formulations when describing the experiment results. The second column shows fixed parameters used for each formulation, while the third column shows which hyper-parameters are tuned using the validation set.

4.3 Datasets

We consider various datasets summarized in Table 4 for the numerical experiments. All these datasets are from the domain of image recognition. We use this domain since it is one of the most popular with plenty of publicly available datasets. MNIST [13] and FashionMNIST [29] are grayscale datasets of digits and fashion items, respectively. CIFAR100 [19] is a dataset of colored images of different items grouped into 100 classes. CIFAR10 and CIFAR20 merge these classes into 10 and 20 superclasses, respectively. Finally, SVHN2 [22] contains colored images of house numbers. All these datasets are originally divided only into training and test sets. We select 25% samples from the training set to obtain the validation set. Moreover, all datasets are multiclass, we need to adjust the labels to get a binary classification problem. Therefore, for each data set, we select one class as the positive class and consider the rest as the negative class.

It is worth mentioning that all datasets used in the experiments are not primarily designed for the classifi-cation at the top. We use these datasets since they are publicly available and well-known.

Table 4: Structure of datasets: The training, validation and testing sets show the positive label y+, the number of features d, samples n and the fraction of positive samples n+n .

4.4 Performance Criteria

In this section, we describe which performance criteria are used for evaluation and how these criteria are related to the tested formulations.

As we discussed at the beginning of Section 4, we decided to only test formulations that minimize the false-negative rate and use only negative samples for the threshold computation. This choice allows us to use simple metrics to compare the formulations. The first metric that we use in the experiments is TPR@K defined as follows

This metric computes the true-positive rate at a threshold t defined as the mean of K-largest negative scores. For K = 1, the threshold corresponds to the threshold used by TopPush formulation. Otherwise, threshold t corresponds to the threshold used by TopPushK. Moreover, since minimizing the false-negative rate is equivalent to maximizing the true-positive rate, both TopPush and TopPushK should optimize the TPR@K metric. In the upcoming experiments, we use this metric with three different values of K ∈ {1,5,10}.

The second metric is defined in a similar way

This metric computes the true-positive rate at a specific top τ-quantile of negative scores. This metric is ideal for testing the performance of τ-FPL and Pat&Mat-NP formulations since both maximize the true-positive rate and use some approximation of the true top τ-quantile of negative scores as a threshold. In our experiments, we use this metric with two different values of τ ∈ {0.01,0.05}.

The two previous metrics are specific to the formulations from our framework. However, we should also test if the baseline formulations work correctly. Since the baseline method is designed to optimize overall performance, we use the area under the ROC curve to measure the overall performance. The summary of all used metrics is in Table 5.

Table 5: The summary of all used performance metrics used for evaluation. In total, we use six different metrics and nine different formulations. For each formulation denotes the metric in which the formulation should be the best.

4.5 Critical Difference Diagrams

All metrics from Section 4.4 can be used to compare different formulations on a single dataset. However, these metrics are unsuitable for comparing multiple formulations on multiple datasets. To address this issue, we use the Friedman test [15] as suggested in [12].

Consider that we have m datasets, and k formulations. Then for each dataset i, each formulation j is ranked by rank rij according to some performance criterium. Any performance metric from the previous section can be used. The formulation that provides the best result gets ranked 1; the second best gets ranked 2, and so on. If two formulations provide the same results, the average ranks are assigned. The average rank overall dataset for formulation j is computed as

The Friedman test compares the average ranks of formulations under the null hypothesis, which states that all formulations are equivalent. Therefore, their average ranks should be equal. If the null hypothesis is rejected, we proceed with the post hoc Nemenyi test [21] that compares all formulations to each other. The performance of the two formulations is significantly different if the corresponding average ranks differ by at least the critical difference

where critical values qα are based on the Studentized range statistic divided by√2, see Table 5(a) in [12]. The results of this post hoc test can be easily visualized using critical difference diagrams proposed in [12]. The x-axis of such a diagram shows the average rank over all datasets for each formulation. Formulations that are not significantly different according to the Nemenyi test are connected using a green horizontal line. As an example, see Figure 3.

4.6 Implementation

For the implementation of all experiments, we use the Julia programming language [5]. All formulations are implemented from scratch. Only for SVM, we use the Julia wrapper for the LIBSVM library [9].

4.7 Results

In this section, we present results for a dual form of formulations from Table 3 with a Gaussian kernel model. For training, we use the coordinate descent algorithm introduced in Section 3.4. We set a number of steps to 20 epochs. For all experiments, we use precomputed kernel matrix with a Gaussian kernel function defined as

where d is the dimension of the primal problem. We used this value of d since it is the default setting for the Gaussian kernel function in LIBSVM [9]. We only use one kernel function for computational reasons. In addition, we are more interested in the comparison of methods between each other than in obtaining the best results possible.

In Figure 2, we investigate the convergence of the coordinate descend algorithm introduced in Section 3.4 for three formulations, namely TopPush, TopPushK, and Pat&Mat-NP. In each column, we show the primal and dual objective function convergence for one formulation. To solve the primal problem, we used full gradient descent. Computation of the full gradient is computationally intensive, even for relatively small datasets such as MNIST. Therefore, for this experiment (and only for this experiment) we use the Ionosphere dataset [27], which is small. We can see that TopPush and TopPushK converge to the same objective for primal and dual problems. It means that both problems were solved to optimality. However, there is a little gap between the optimal primal and dual form solution for Pat&Mat-NP. In other words, Pat&Mat-NP may suffer from convergence issues when solving the proposed coordinate descent algorithm.

For comparison we use critical difference diagrams introduced in Section 4.5. One of the basic assumptions of the critical difference diagrams to work appropriately is a large number of used datasets. Since we performed all experiments for each formulation and each dataset ten times with different random seeds for train/valid/test split, we decided to consider each of these runs as a separate dataset. It is important to say that we use this setting only for the critical difference diagrams. Since the critical diagrams show the relative performance of the formulations against each other, we can easily see if any formulation is significantly worse or better. However, the critical diagrams do not provide any information on the actual performance of the formulations. Therefore, even if one formulation outperforms other tested formulations, it does not mean that its performance is good.

To address the issue above, we also compare concrete performance metrics on each dataset separately. Since we have six hyper-parameters for each formulation, we always select the best result for each formulation on the validation set based on the criterion for which the specific formulation is optimized. Then for each formulation, we select the median of the best results from ten independent runs. Moreover, the best result for each dataset is highlighted in green, while the worst result is highlighted in red.

From Figure 3 and Table 6, we make several observations:

• We observe that some formulations have problems with convergence and, in some cases, even diverge for some datasets. The improper choice of the kernel function parameters can be the cause. As a result, CD diagrams may provide unreliable results. If the formulation diverges in some experiments, it immediately obtains very high ranks for these experiments that skew the final diagram. It is especially evident for Pat&Mat-NP and SVM formulations.

Figure 2: Convergence of the objectives for the primal (red line) and dual (blue dashed line) forms with linear kernel.

• Figure 3 shows that Pat&Mat-NP formulations provide the worst results for all metrics. It can be caused by the bad convergence of the coordinate descent algorithm, as shown in Figure 2. However, it is important to say that Figure 3 shows only relative results. From Table 6 is clear that even though Pat&Mat-NP usually provides worse results than other formulations, the results are, in many cases, only slightly worse.

• Similarly to Pat&Mat-NP, the SVM formulation does not perform well for most metrics. However, as shown in Table 6, the results are usually only slightly worse than those of other formulations.

• Most formulations perform well on the criteria for which they are optimized. The only exceptions are SVM and Pat&Mat formulations.

• Most formulations provide an AUC greater than 99% on the MNIST and FashionMNIST datasets. These two datasets are very easy when a non-linear model is used.

• τ-FPL formulations work very well for TPR@τ = 0.01, TPR@τ = 0.05 and AUC metric.

• TopPush, TopPushK (5) and TopPushK (10) provides very good results for TPR@K = 1, TPR@K = 5 and TPR@K = 10.

5 Conclusion

In this paper, we analyzed and extended the general framework for binary classification on top samples from [2] to nonlinear problems. Achieved results can be summarized as follows:

• We showed that all presented formulations (except for Grill and Grill-NP) can be divided into two families based on the form of the constraints, namely TopPushK and Pat&Mat family of formulations. We derived dual forms for TopPushK and Pat&Mat family of formulations. Moreover, for both these formulations we show how to incorporate non-linear kernels.

• We proposed a new coordinate descent algorithm for solving dual forms of TopPushK and Pat&Mat family of formulations. The resulting algorithm depends on the used surrogate function. Therefore, we derived the closed-form formulae for selected surrogate functions. Since the algorithm needs a feasible solution for initialization, we also showed how to find such a solution.

• We performed a numerical analysis of the proposed method.

Bibliography

[1] Lukáš Adam and Václav Mácha. “Projections onto the canonical simplex with additional linear inequalities”. In: Optimization Methods and Software 37.2 (2022), pp. 451–479. doi: 10.1080/10556788.2020.1797023.

[2] Lukáš Adam et al. “General framework for binary classification on top samples”. In: Optimization Methods and Software 37.5 (2022), pp. 1636–1667. doi: 10.1080/10556788.2021.1965601.

Figure 3: Dual formulations with a gaussian kernel: Critical difference diagrams (level of importance 0.05) of the Nemenyi post hoc test for the Friedman test. Each diagram shows the mean rank of each method, with rank one being the best. The green horizontal lines group methods with mean ranks that are not significantly different. The critical difference diagrams were computed for mean rank averages over all datasets.

Table 6: Dual formulations with a gaussian kernel: Each table corresponds to one performance metric, and all presented results are medians of ten independent runs for each dataset and formulation pair. The best result for each dataset is highlighted in green, while the worst result is highlighted in red. For better readability, we have reduced the number of discussed metrics compared to Figure 3.

[3] Shivani Agarwal. “The infinite push: A new support vector ranking algorithm that directly optimizes accuracy at the absolute top of the list”. In: Proceedings of the 2011 SIAM International Conference on Data Mining (SDM). SIAM. 2011, pp. 839–850. doi: 10.1137/1.9781611972818.72.

[4] Zeynep Batmaz et al. “A review on deep learning for recommender systems: challenges and remedies”. In: Artificial Intelligence Review 52.1 (2019), pp. 1–37. doi: 10.1007/s10462-018-9654-y.

[5] Jeff Bezanson et al. “Julia: A fresh approach to numerical computing”. In: SIAM review 59.1 (2017), pp. 65–98. doi: 10.1137/141000671.

[6] Bernhard E Boser, Isabelle M Guyon, and Vladimir N Vapnik. “A training algorithm for optimal margin classifiers”. In: Proceedings of the Fifth Annual Workshop on Computational Learning Theory. COLT ’92. Pittsburgh, Pennsylvania, USA: Association for Computing Machinery, 1992, pp. 144–152. isbn: 089791497X. doi: 10.1145/130385.130401.

[7] Stephen Boyd and Lieven Vandenberghe. Convex optimization. 1st ed. Cambridge University Press, 2004. isbn: 978-0521833783.

[8] Stephen Boyd et al. “Accuracy at the top”. In: Advances in neural information processing systems. Ed. by F. Pereira et al. Vol. 25. Curran Associates, Inc., 2012.

[9] Chih-Chung Chang and Chih-Jen Lin. “LIBSVM: A Library for Support Vector Machines”. In: ACM Transactions on Intelligent Systems and Technology (TIST) 2.3 (2011), pp. 1–27. issn: 2157-6904. doi: 10.1145/1961189.1961199.

[10] Kai-Wei Chang, Cho-Jui Hsieh, and Chih-Jen Lin. “Coordinate Descent Method for Large-scale l2-loss Linear Support Vector machines”. In: Journal of Machine Learning Research 9.7 (2008), pp. 1369–1398.

[11] Corinna Cortes and Vladimir Vapnik. “Support-vector networks”. In: Machine learning 20.3 (1995), pp. 273–297. doi: 10.1007/BF00994018.

[12] Janez Demšar. “Statistical comparisons of classifiers over multiple data sets”. In: Journal of Machine Learning Research 7 (2006), pp. 1–30.

[13] Li Deng. “The MNIST Database of Handwritten Digit Images for Machine Learning Research [Best of the Web]”. In: IEEE Signal Processing Magazine 29.6 (2012), pp. 141–142. doi: 10.1109/MSP.2012.2211477.

[14] Yoav Freund et al. “An Efficient Boosting Algorithm for Combining Preferences”. In: Journal of Machine Learning Research 4 (2003), pp. 933–969.

[15] Milton Friedman. “A Comparison of Alternative Tests of Significance for the Problem of m Rankings”. In: The Annals of Mathematical Statistics 11.1 (1940), pp. 86–92. issn: 00034851.

[16] Martin Grill and Tomáš Pevný. “Learning combination of anomaly detectors for security domain”. In: Computer Networks 107 (2016), pp. 55–63. issn: 1389-1286. doi: 10.1016/j.comnet.2016.05.021.

[17] Cho-Jui Hsieh et al. “A Dual Coordinate Descent Method for Large-Scale Linear SVM”. In: Proceedings of the 25th International Conference on Machine Learning. ICML ’08. Helsinki, Finland: Association for Computing Machinery, 2008, pp. 408–415. isbn: 9781605582054. doi: 10.1145/1390156.1390208.

[18] Takafumi Kanamori, Akiko Takeda, and Taiji Suzuki. “Conjugate Relation Between Loss Functions and Uncertainty Sets in Classification Problems”. In: Journal of Machine Learning Research 14.1 (2013), pp. 1461–1504.

[19] Alex Krizhevsky, Geoffrey Hinton, et al. Learning multiple layers of features from tiny images. Citeseer, 2009.

[20] Nan Li, Rong Jin, and Zhi-Hua Zhou. “Top Rank Optimization in Linear Time”. In: Advances in Neural Information Processing Systems. Ed. by Z. Ghahramani et al. Vol. 27. NIPS’14. Curran Associates, Inc., 2014, pp. 1502–1510.

[21] Peter Bjorn Nemenyi. Distribution-free Multiple Comparisons. Princeton University, 1963.

[22] Yuval Netzer et al. “Reading digits in natural images with unsupervised feature learning”. In: NIPS Workshop on Deep Learning and Unsupervised Feature Learning 2011. NIPS’11. 2011, pp. 1502–1510.

[23] Wlodzimierz Ogryczak and Arie Tamir. “Minimizing the sum of the k largest functions in linear time”. In: Information Processing Letters 85.3 (2003), pp. 117–122. issn: 0020-0190. doi: 10.1016/S0020-0190(02)00370-8.

[24] Cynthia Rudin. The P-Norm Push: A Simple Convex Ranking Algorithm That Concentrates at the Top of the List. Vol. 10. 2009, pp. 2233–2271.

[25] Bernhard Scholkopf and Alexander Smola. “Learning with kernels: support vector machines, regularization, optimization, and beyond”. In: Journal of the American Statistical Association 98.462 (2003), pp. 489– 489. doi: 10.1198/jasa.2003.s269.

[27] Vincent G Sigillito et al. “Classification of radar returns from the ionosphere using neural networks”. In: Johns Hopkins APL Technical Digest 10.3 (1989), pp. 262–266.

[28] Tino Werner. A review on ranking problems in statistical learning. 2019. arXiv: 10.48550/ARXIV.1909.02998.

[29] Han Xiao, Kashif Rasul, and Roland Vollgraf. Fashion-mnist: a novel image dataset for benchmarking machine learning algorithms. 2017. arXiv: 10.48550/ARXIV.1708.07747.

[30] Ao Zhang et al. tau-FPL: Tolerance-Constrained Learning in Linear Time. 2018. arXiv: 10.48550/ARXIV.1801.04701.

A Derivation of Dual Problems

A.1 Family of TopPushK Formulations

Theorem 3.3 (Dual formulation for TopPushK family). Consider Notation 3.2, surrogate function l, and formulation (6). Then the corresponding dual problem has the following form

where l⋆ is conjugate function of l and

If K = 1, the upper bound in the second constraint (7c) vanishes due to the first constraint. Finally, the primal variables w can be computed from dual variables as follows

Proof. We show the proof only for TopPushK formulation, i.e., the decision threshold is computed only from negative samples. The proof for the remaining formulations is identical. Firstly, we derive an alternative formulation to formulation (6). Using Lemma 1 from [23], we can rewrite the formula for the decision threshold to the following form

By substituting this formula into the objective function of (6), we get

where the last equality follows from the fact that the surrogate function l is non-decreasing. The max operator can be replaced using an auxiliary variable z ∈ that fulfills zj ≥ s−j − t and zj ≥ 0 for all j = 1,..., n−.

Furthermore, we use auxilliary variable y ∈ defined for all i = 1,..., n+ as

The combination of all the above relations and the use of a linear model yields to

The Lagrangian of this formulation is defined as

with feasibility conditions βj ≥ 0 and γj ≥ 0 for all j = 1,..., n−. Since the Lagrangian L is separable in primal variables, it can be minimized with respect to each variable separately. Then the dual objective function (to be maximized) reads

From optimality conditions with respect to w, we deduce

where we use Notation 3.2. It mean, that we get the first part of the objective function (7a), ane we also get the relation (8) between primal and dual variables. Optimality condition with respect to t reads

Plugging the feasibility condition γj ≥ 0 into this equality and combining it with the feasibility conditions