StochasticRank: Global Optimization of Scale-Free Discrete Functions

2020·Arxiv

Abstract

Abstract

In this paper, we introduce a powerful and effi-cient framework for direct optimization of ranking metrics. The problem is ill-posed due to the discrete structure of the loss, and to deal with that, we introduce two important techniques: stochastic smoothing and novel gradient estimate based on partial integration. We show that classic smoothing approaches may introduce bias and present a universal solution for a proper debiasing. Importantly, we can guarantee global convergence of our method by adopting a recently proposed Stochastic Gradient Langevin Boosting algorithm. Our algorithm is implemented as a part of the CatBoost gradient boosting library and outperforms the existing approaches on several learning-to-rank datasets. In addition to ranking metrics, our framework applies to any scale-free discrete loss function.

1. Introduction

The quality of ranking algorithms is traditionally measured by ranking quality metrics such as Normalized Discounted Cumulative Gain (NDCG), Expected Reciprocal Rank (ERR), Mean Average Precision (MAP), Mean Reciprocal Rank (MRR), and so on (Sakai, 2013). These metrics are defined on a list of documents sorted by their predicted relevance to a query and capture the utility of that list for users of a search engine, who are more likely to scan documents starting at the top. Direct optimization of ranking metrics is an extremely challenging problem since sorting makes them piecewise constant (as functions of predicted relevances), so they are neither convex nor smooth. Many algorithms were proposed for different ranking objectives in the learning-to-rank (LTR) research field. We refer to Liu (2009) for a systematic overview of some clas-

sic methods.

To deal with the discrete structure of a ranking loss, one can use some smooth approximation, which is easier to optimize. This technique lies behind such well-known algorithms as SoftRank (Taylor et al., 2008), ApproxNDCG (Qin et al., 2010), RankNet (Burges, 2010), etc. The obtained smooth function can be optimized by gradient-based methods and, in particular, by Stochastic Gradient Boosting (SGB) that is known to be the learning algorithm behind most state-of-the-art LTR frameworks and is commonly preferred by major search engines (Chapelle & Chang, 2011; Yin et al., 2016). Unfortunately, all known smoothing approaches suffer from bias (see Sections 4.2-4.3) which prevents them from truly direct optimization. Moreover, smoothed ranking loss functions are non-convex, and existing algorithms can guarantee only local optima.

Our ultimate goal is to solve these problems and propose a truly direct LTR algorithm with provable guarantees of global convergence and generalization. We adopt a theoretical approach, so we start with formal definitions of the class of ranking losses and its generalization to scale-free (SF) discrete loss functions (Section 3.2). Our results hold for the general class of SF losses, which, in addition to all ranking metrics, includes, e.g., a recently proposed loss function for Learning-to-Select-with-Order (Vorobev et al., 2019). Then, to mitigate the discontinuity of the loss, we use stochastic smoothing. We prove that previous smoothing-based approaches are inconsistent with the underlying loss (due to the problem of ties, which we discuss in the next section) and propose a universal solution to this problem (relevance-based consistent smoothing, see Section 4.3). Next, we derive a novel stochastic gradient estimate, which can be applied to the entire class of SF losses (see Section 5). The obtained estimate has low variance and uniformly bounded error, which is crucial for our analysis. Finally, to guarantee global convergence of the algorithm, we adopt a recently proposed Stochastic Gradient Langevin Boosting (SGLB) algorithm (Ustimenko & Prokhorenkova, 2020). SGLB is based on a well studied Stochastic Gradient Langevin Dynamics (Gelfand et al., 1992; Raginsky et al., 2017; Erdogdu et al., 2018) and converges globally for a wide range of loss functions including non-convex ones. We adapt SGLB to our setting and obtain a gradient boosting algorithm that converges globally for the entire class of SF loss functions with provable generalization guarantees (see Section 6).

To sum up, to the best of our knowledge, the proposed StochasticRank algorithm is the first globally converging LTR method with provable guarantees that optimizes exactly the underlying ranking quality loss. StochasticRank is implemented within the official CatBoost library (Prokhorenkova et al., 2018; CatBoost, 2020). Our experiments show that StochsticRank outperforms the existing approaches on several LTR datasets.

The rest of the paper is organized as follows. In the next section, we briefly overview the related research on learning to rank. In Section 3, we formalize the problem and, in particular, define a general class of ranking loss functions. In Section 4, we formulate the problem of smoothing bias and propose an unbiased solution. Then, in Section 5, we derive a novel stochastic gradient estimate for the whole class of loss functions under consideration. In Section 6, we show how SGLB can be used to achieve global convergence. Finally, Section 7 empirically compares the proposed algorithm with existing approaches, and Section 8 concludes the paper.

2. Related Work

Usually, researches divide all LTR methods into three categories: pointwise, pairwise, and listwise (Liu, 2009).

Pointwise are the earliest and simplest methods: they approximate relevance labels based on simple or ordinal regression or classification. Such methods were shown to be ineffective for LTR, since loss functions they optimize (e.g., RMSE for relevance labels) differ significantly from the target ranking metric, e.g., NDCG@k.

Pairwise methods make a step forward and focus on pairwise preferences and thus known to outperform pointwise approaches significantly. Nevertheless, pairwise approaches still suffer from the problem of solving a different task rather than optimizing a ranking quality objective.

Listwise methods try to solve the problem directly by developing either smooth proxies of the target ranking metric like SoftRank (Taylor et al., 2008), BoltzRank (Volkovs & Zemel, 2009), ApproxNDCG (Qin et al., 2010), RankNet (Burges, 2010) or by Majorization-Minimization procedure that builds a convex upper bound on the metric on each iteration like LambdaMART (Wu et al., 2010), LambdaLoss (Wang et al., 2018), PermuRank (Xu et al., 2010), SVMRank (Cao et al., 2006), etc.

As discussed in the previous section, algorithms based on smooth approximations suffer from bias and local optima. Also, there are listwise approaches that try to optimize the target loss function without smoothing. For instance, DirectRank (Tan et al., 2013) constructs an ensemble of decision trees, where the values in the leaves are chosen to optimize the original loss. However, due to greediness, this approach can guarantee only local optima.

Finally, let us note that algorithms optimizing a convex upper bound instead of the original loss cannot be truly direct since the optimum for the upper bound can potentially be far away from the true optimum. This is nicely illustrated by Nguyen & Sanner (2013) for accuracy optimization. Let us also mention a recent approach for improving learning-to-rank algorithms by adding Gumbel noise to model predictions (Bruch et al., 2020). This is a regularization technique since it builds a convex upper bound on any given convex loss (e.g., LambdaMART).1 Thus, from a theoretical point of view, this approach cannot be truly direct since it uses convex upper bounding.

The issue of smoothing bias mentioned in the introduction is connected to the problem of ties: if predicted relevances of some documents coincide, one has to order them somehow to compute a ranking metric. This situation may occur when two documents have equal features. More importantly, ties are always present in boosting algorithms based on discrete weak learners such as decision trees. Unfortunately, this problem is rarely addressed in LTR papers. In practice, it is reasonable to use the worst permutation. First, due to strong penalization, it would force an optimization algorithm to avoid ties. Second, in practice, one cannot know how a production system would rank the items, and often some attribute negatively correlated with relevance is used (e.g., sorting by a bid in online auctions). The importance of using the worst permutation is also discussed by Rudin & Wang (2018), and this ordering is adopted in some open-source libraries like CatBoost (Prokhorenkova et al., 2018). An alternative choice is to compute the expected value of a ranking metric for a random permutation. This choice is rarely used in practice, since it is computationally complex and gives non-trivial scores to trivial constant predictions, but is often assumed (explicitly or implicitly) by LTR algorithms (Kustarev et al., 2011).

3. Problem Formalization

3.1. Examples of Ranking Loss Functions

Before we introduce a general class of loss functions, let us define classic ranking quality functions widely used throughout the literature and in practice.2 These loss functions depend on z, which is a vector of scores produced by the model, and r, which is a vector of relevance labels for a given query. The length of these vectors is denoted by n and can be different for different queries.

Let s = argsort(z), i.e., is the index of a document at i-th position if documents are ordered according to their scores (if for , then we place the less relevant one first). Let us define DCG@k, where k denotes the number of top documents we are interested in:

where are relevance labels. This quality function is called Discounted Cumulative Gain: for each document, the numerator corresponds to gain for the relevance, while the denominator discounts for a lower position. NDCG@k is a normalized variant of DCG@k:

Expected Reciprocal Rank ERR@k assumes that [0, 1]:

Mean reciprocal rank (MRR) is used for binary relevance labels :

which is the inverse rank of the first relevant document.

Finally, let us define a quality function for the LSO (learning to select with order) problem introduced by Vorobev et al. (2019), which is not exactly a ranking metric, but has a similar structure. The order of elements is prede-fined (documents are sorted by their indices), but the list of documents to be included is determined by :

In the sum above, for each included document we divide its relevance by its rank.

3.2. Generalized Ranking Loss Functions

To develop a stochastic ranking theory, we first formalize the class of loss functions to which our results apply. We start with a very general class of scale-free (SF) discrete loss functions. Further, by we denote a vector of context, which may include relevance and any other factors affecting the ranking quality value (like query type or document topic).

Definition 1. A function is a Scale-Free Discrete Loss Function iff the following conditions hold:

• Uniform boundedness: There exists a constant l > 0 such that holds ;

• Discreteness on subspaces: For each and linear subspace there exist convex open subsets (w.r.t. induced topology on V ), mutually disjoint for , with everywhere dense union(X denotes the closure of X w.r.t. the ambient topology), such that for any and holds ;

• Jumps regularity: By reusing defined above, for any either of the following conditions holds:

• Scalar freeness: For any 0 holds .

We denote the class of all SF discrete loss functions by . Informally speaking, is a class of bounded discrete functions on a sphere. The jumps regularity property is needed to exclude the breaking points from arg min L. One can show that all loss functions defined in Section 3.1, including the LSO loss DCG-RR, belong to .

StochasticRank out-of-box can be applied to any SF discrete loss function. However, to guarantee global convergence, we need to use consistent smoothing (see Section 4.3), which has to be chosen based on the properties of a particular metric. We propose smoothing which is consistent for the whole class of ranking loss functions defined below.

Assume that and is a tuple , where is a vector of relevance labels. As discussed in Section 2, a particular definition of a ranking loss depends on tie resolution. When some documents have equal scores, we may either use the worst permutation (as commonly done in practice) or compute the average over all orderings of such documents (as usually assumed by LTR algorithms). The definition below assumes the worst permutation.

Definition 2. A function is a Ranking Loss Function iff the following properties hold:

Informally, is the worst direction for the loss function, i.e., near a breaking point with and for some i, j, it is better to have .

• Strong upper semi-continuity (s.u.s.c.): For each n > 0 and :

Informally, this means that if we do not know how to rank two items (i.e., for ), then we shall rank them by placing the less relevant one first.

• Translation invariance:3 For any , holds: , where .

• Pairwise decision boundary:4 Partition of the space for discreteness on subspaces for can be obtained as connected components of , similarly for an arbitrary subspace V .

We denote this class of functions by . It can be shown that includes all ranking losses defined in Section 3.1, but not the LSO loss DCG-RR which does not satisfy Relevance monotonicity.

Let us now define a class , where instead of the worst ranking for ties, we consider the expected loss of a random ranking. For this, we replace the s.u.s.c. condition by:

• Soft semi-continuity (s.s.c.): For each n > 0 and we have:

where is a normally distributed random variable.

We will show that under some restrictive conditions (that are commonly assumed in the LTR literature), it does not matter which of the two definitions we use (or ) as they coincide almost surely and have equal arg min L sets. However, we will explain why these conditions do not hold in practice and in general the minimizers for and do not coincide.

3.3. Model Assumptions

We assume that for each n > 0 and there is a model such that for some matrix , where is a vector of parameters (independent from ) and is the number of parameters. Typically, each row of is a feature vector. Gradient boosting over decision trees satisfies this assumption. Indeed, let us consider all possible trees of a fixed depth formed by a finite number of binary splits obtained by binarization of the initial feature vectors. To get a linear model, we say that is a vector of leaf weights of these trees and is a binary matrix formed by the binarized feature vectors.

We will also assume that . Indeed, instead of we can define the model as

, which is equivalent due to the translation in- variance property.

3.4. Data Distribution

Assume that we are given some distribution on meaning that also implicitly incorporates information about the number of items n, i.e., for there exists a unique number n > 0 so that is some unknown distribution, e.g., the distribution of queries submitted to a search system. We are given a finite i.i.d. sample that corresponds to the train set. Let be the empirical distribution.

3.5. Optimization Target

The assumptions and definitions above allow us to define the expected (generalized) ranking quality for the function with respect to and model parameters Our ultimate goal is to find . However, since the distribution D is unknown, we have only i.i.d. samples as defined above. So, we consider the expected ranking quality under the empirical distribution :

We want to optimize globally by optimizing . This is possible because of the stability of global minimizers even for discontinuous functions: for an almost minimizer of should be an almost minimizer of (Artstein & Wets, 1995).

Thus, we need to find a global minimizer of . Due to the discrete structure, we can ignore sets of zero Lebesgue measure. Recall that essential infimum (essinf) is infimum that ignores sets of zero measure and int U denotes an open interior of the set U.

Definition 3. For any function with , we define

We need this unusual definition because of the discrete structure of our loss: we want to exclude the breaking points from arg min. One can see that despite sat-isfies Jumps regularity, the function does not have to.

Statement 1. The set is not empty.

The proof is straightforward (see Appendix A).

4. Stochastic smoothing

4.1. Smoothing of Scores

The discrete structure of ranking loss functions prevents their effective optimization. Hence, some smoothing is needed and a natural approach for this is mollifi-cation (Ermoliev et al., 1995; Dolecki et al., 1983), i.e., adding randomness to parameters. We refer to Appendix B.1 for the formal definition and the reasons why this approach is not applicable in our case.

Thus, instead of acting on the level of parameters , we act on the level of scores , where is a random variable with p.d.f. . We multiply the noise by to preserve Scalar-freeness in a sense that for any .

In the linear case , if , it is not hard to show the convergence of minimizers. However, in general, we cannot assume . In particular, this property is violated in the presence of ties that always occur in gradient boosting due to the discrete nature of decision trees. As a result, there is a smoothing bias that alters the set of minimizers.

4.2. Simple Example of Smoothing Bias

Within this section, assume for simplicity that we are dealing with one function for some arbitrary fixed n and . Let and . To clearly see how a smoothing bias can be introduced, consider the case when , where are from the Discrete- ness on subspaces assumption for . Denote by the values of L(z) on the corresponding subsets . Consider the functions and .

The value of is fully determined by and the subsets in the following way: with

In contrast, the value depends on the values much weaker: for fixed , consider the values that correspond to such that , then the only limitation we have is (this is required by Jumps regularity), which clearly allows more flexibility than the linear combination defined above.

In LTR, the issue of smoothing bias is connected to the problems of ties: the situations when and .

4.3. Consistent Smoothing

Definition 4. We say that the family of distributions is a consistent smoothing for and for the model iff for each n > 0, the following limit holds almost surely locally uniform in :

If is smooth enough and consistent, then the function is also smooth and almost surely locally uniformly approximates the discrete loss as .

To optimize ranking losses, it is important to find a consistent smoothing for functions in . Fortunately, we can do this with an arbitrary precision by shifting the normal distribution by for large enough . Relevance monotonicity and s.u.s.c. imply the following pointwise limit:

This can be strengthened to the following theorem, which is proven in Appendix B.2.

Theorem 1. is a consistent smoothing for as . Formally, except zero measure such that and holds .

By similar arguments, one can show that is a consistent smoothing for . Note that in both cases the consistent smoothing is universal for the entire class (of ), i.e., it is independent from the choice of .

Thus, LTR problems require non-trivial smoothing to preserve consistency. However, under some restrictive assumptions on the loss and on the model, any smoothing is consistent.

Recall that and assume that . The following theorem is proven in Appendix B.3.

Theorem 2. Consider open and convex subsets . If s.t. and, then any smoothing is consistent for .

In early literature on LTR, all authors used such conditions implicitly by assuming that scores for all items are different. In contrast, we do not use this assumption as it never holds in practice (e.g., when two documents have equal features). As a result, all existing LTR approaches suffer from a smoothing bias. In contrast, for the LSO problem, any smoothing is consistent, as we discuss in Appendix B.4.

4.4. Scale-Free Acceleration

It is intuitively clear that for a scale-free function it is better to have a scalar-free approximation. However, for each 0 we have , i.e., the smoothed function is no longer scale-free. To enforce scale-freeness, we take a vector with and define

We refer to such smoothing as Scale-Free Acceleration (SFA). The obtained function is indeed scale-free: for any .

Let . In our optimization, we will be inter- ested only in the case when is the vector of scores obtained on t-th iteration of the optimization algorithm. So, we have and SFA does not change the scale .

One can imagine a sphere of radius , where we restrict and homogenize it along the rays from the origin to infinity to obtain a scalar-free function.

4.5. Smoothing Properties

Finally, let us discuss regularity assumptions for smoothing on which our optimization method relies. Consider a family of distributions with p.d.f. with for some . We require the following properties:

• Continuous differentiability: is , i.e., is differentiable with a continuous derivative.

• Uniformly bounded derivative: we have uniformly in .

• Derivative decay: we have as .

• Tractable conditional expectations: conditional densities are easy to compute.5

Clearly, satisfies these assumptions .

5. Coordinate Conditional Sampling

5.1. Gradient Estimate

In the previous section, we required the ability to easily compute . This property allows us to do the following trick: we decompose with being the marginal distribu- tion for . Then, we can represent . Note that the convolution is an associative operation that commutes with differentiation and, henceforth,

Note that we differentiate by the convolution by the same . So, if we want to estimate the gradient unbiasedly, we need to sample and then compute exactly . The resulting estimate would be unbiased by construction. The following lemma suggests how to deal with .

Lemma 1. The function for all z except zero measure has at most 1 (k is from the Discreteness on subspaces assumption) breaking points (possibly depending on and ) and can be represented as:

All results of this section are proven in Appendix C.

Based on the above lemma, we prove the following theorem. Theorem 3. The derivative is equal to:

where and are from Lemma 1.

Corollary 1. For LTR losses, the above formula becomes:

Uniform boundedness of and implies the following.

Statement 2. The estimate is uniformly bounded by .

Proceeding analogously with each coordinate {1, . . . , n}, we obtain an unbiased estimate of that is uniformly bounded, in contrast to the classic estimate (Nesterov & Spokoiny, 2017) obtained by the log-derivative trick for the normal distribution that is also known as REINFORCE (Williams, 1992). Uniform boundedness is crucial since without it we would not be able to claim global convergence. We call such estimate Conditional Coordinate Sampling (CCS) and denote it by .

Note that for each coordinate when estimating we use the shared noise vector , i.e., the components of the gradient can have non-trivial covariation, but due to the uniform boundness the covariation is also uniformly bounded by .

Finally, let us discuss the complexity of computing . The following result follows from Appendix D.

Statement 3. The estimate can be computed in:

operations and O(n) additional memory for (N)DCG@k and ERR@k.

operations and O(1) memory for MRR.

5.2. SFA Gradient Estimate

Corollary 2. Unbiased CCS estimate for SFA can be obtained by orthogonalizing and z.

Since orthogonalization reduces the norm of the estimate, it necessarily reduces the variance, so we obtain the follow- ing corollary.

Corollary 3. SFA CCS estimate has a lower variance than the original CCS.

The intuition for the orthogonalization is based on Scalarfreenees: the function does not change along z direction, so this direction in the gradient does not con- tribute to optimization.

As we need to deal with possibility of , we introduce a parameter and replace by :

Lemma 2. Bias of SFA CCS estimate is uniformly bounded:

As a consequence, if or , then the estimate is asymptotically unbiased.

Thus, for the convergence analysis we consider only since the estimate can be made unbiased by varying the parameter . In practice, we consider with fixed as we observed that this parameter performs well enough. Moreover, SFA can be seen as a bias–variance tradeoff controlled by for CCS estimate of . For prac- tical comparison of and we refer to Section 7, where we show that SFA gives a sig-nificant improvement.

6. Global Optimization by Diffusion

6.1. SGLB

Previously, we discussed the importance of global optimization of . As we show in this section, this can be achieved by global optimization of smoothed with (if smoothing is consistent) using the recently proposed Stochastic Gradient Langevin Boosting (SGLB) (Ustimenko & Prokhorenkova, 2020). SGLB is easy to apply: essentially, each iteration of standard SGB is modified via model shrinkage and adding Gaussian noise to the gradients. However, the obtained algorithm is backed by strong theoretical results, see (Ustimenko & Prokhorenkova, 2020) for the details and Appendix E.1 for a brief sketch. The global convergence is implied by the fact that as the number of iterations grows, the stationary distribution of the predictions concentrates around the global optima of the implicitly regularized loss

where is an implicitly defined regularization matrix. More formally, .

Ustimenko & Prokhorenkova (2020) related the generalization gap with the uniform spectral gap parameter for the distribution (see Raginsky et al. (2017) for the definition of a uniform spectral gap). Here represents the limiting (as the learning rate goes to zero) distribution of the vector of parameters and is induced by the distribution using the relation . The following theorem is proven in Appendix E.3.

Theorem 6. The generalization gapcan be bounded by:

7. Experiments

As baseline approaches, we consider the well-known LambdaMART framework optimized for NDCG@k (Wu et al., 2010), NDCG-Loss2++ from the LambdaLoss framework (Wang et al., 2018), and SoftRank (Taylor et al., 2008). We also apply the technique proposed by Bruch et al. (2020) to the baselines, the corresponding methods are called -MART and -Loss. Similarly to Wang et al. (2018), we set the parameter for NDCG-Loss2++ to be equal to 5. According to our experiments, NDCG-Loss2++ performed significantly better than NDCG-Loss2, which agrees with Wang et al. (2018).

Table 1. Experimental results on synthetic data.

7.1. Synthetic Data

Unfortunately, in practice, we cannot verify if we have reached the global optimum as we cannot evaluate all possible ensembles of trees. But having theoretical guarantees is important as it implies the stability of the algorithm and good generalization. In this section, we describe a simple synthetic test to verify whether StochasticRank can reach the global optimum.

The following dataset is multimodal (has several local optima) for NDCG@3: the number of queries is N = 2, first relevance vector is and the second is . We consider the following features for the first query: and for the second and (in the given order).

We consider this simple synthetic dataset for two reasons: first, it clearly shows that ranking losses are likely to be multimodal; second, it allows us to demonstrate how multimodality prevents existing approaches from reaching the global optimum.

We limited the tree depth parameter to 3, so one tree can separate all documents with different features. We set the number of iterations to 1000, learning rate to 0.1, diffusion temperature to , and model-shrink-rate to .

The results are shown in Table 1. We note that the maximum achievable NDCG@3 for this dataset is 0.917, i.e., StochasticRank successfully recovers the global optimum while all other approaches converge to a local optimum 0.903.

7.2. Real Data

Datasets For our experiments, we use the following publicly available datasets. First, we use the data from YAHOO! Learning to Rank Challenge (Chapelle & Chang, 2011): there are two datasets, each is pre-divided into training, validation, and testing parts. The other datasets are WEB10K and WEB30K released by Microsoft (Qin & Liu, 2013). Following Wang et al. (2018), we use Fold 1 for these two datasets.

Quality metrics The first metric we use is NDCG@5, which is very common in LTR research. The second one is MRR, which is a well-known click-based metric. Recall that MRR requires binary labels, so we binarize each label by . Notably, while MRR is frequently used in online evaluations, it is much less studied compared to NDCG@k and there are no effective approaches designed for it. Fortunately, our method can be easily adapted to any ranking metric via a combination of SGLB with Coordinate Conditional Sampling smoothed by Gaussian noise.

Parameter tuning For all algorithms, we set the maximum number of trees to 1000. We tune the hyperparameters using 500 iterations of random search and select the best combination using the validation set, the details are given in Appendix F.

Results The results are shown in Table 2. One can see that StochasticRank (SR-) outperforms the baseline approaches on all datasets. In all cases, the difference with the closest baseline is statistically significant with a p-value < 0.05 measured by the paired one-tailed t-test. Also, in most cases, SR-outperforms SR-, which clearly demonstrates the advantage of unbiased smoothing, which takes into account the tie resolution policy.

The results in Table 2 are comparable to previously reported numbers, although they cannot be compared directly, since experimental setup (e.g., the maximum number of trees) is not fully described in many cases (Wang et al., 2018). More importantly, the previously reported results can be overvalued, since many openly available libraries compute ranking metrics using neither worst (as in our case) nor “expected” permutation, but some fixed arbitrary one depending on a particular implementation of the sorting operation.

To further understand how different techniques proposed in this paper affect the quality of the algorithm, we show the improvement obtained from each feature using the Yahoo dataset and the NDCG metrics (see Table 3). We see that

Table 2. Experimental results.

CCS is significantly better than REINFORCE, while SFA gives an additional significant performance boost. SGLB and consistent smoothing further improve NDCG. We note that for both REINFORCE and CCS we use one sample per gradient estimate since the most time-consuming operation for both estimates is sorting (see Appendix D).

8. Conclusion

In this paper, we proposed the first truly direct LTR algorithm. We formally proved that this algorithm converges globally to the minimizer of the target loss function. This is possible due to the combination of three techniques: unbiased smoothing for consistency between the original and smoothed losses; SGLB for global optimization via gradient boosting; and CCS gradient estimate with uniformly

Table 3. Comparison of the algorithm’s features on Yahoo Set 1, where means using unbiased smoothing.

bounded error and low variance, which is required for SGLB to be applied. Our experiments clearly illustrate that the new algorithm outperforms state-of-the-art LTR methods.

References

Artstein, Z. and Wets, R. Consistency of minimizers and the SLLN for stochastic programs. Journal of Convex Analysis, 2(1-2):1–17, 1995.

Bardet, J.-B., Gozlan, N., Malrieu, F., and Zitt, P.-A. Functional inequalities for Gaussian convolutions of compactly supported measures: explicit bounds and dimension dependence. arXiv e-prints, art. arXiv:1507.02389, 2015.

Bruch, S., Han, S., Bendersky, M., and Najork, M. A stochastic treatment of learning to rank scoring functions. In Proceedings of the 13th ACM International Conference on Web Search and Data Mining (WSDM 2020), pp. 61–69, 2020.

Burges, C. J. C. From RankNet to LambdaRank to LambdaMART: An overview. Technical report, Microsoft Research, 2010.

Cao, Y., Xu, J., Liu, T.-Y., Li, H., Huang, Y., and Hon, H.-W. Adapting ranking SVM to document retrieval. In Proceedings of the 29th annual international ACM SIGIR conference on Research and development in information retrieval, pp. 186–193. ACM, 2006.

CatBoost. Ranking: objectives and metrics. https://catboost.ai/docs/concepts/loss-functions-ranking.html, 2020.

Chapelle, O. and Chang, Y. Yahoo! learning to rank challenge overview. In Proceedings of the learning to rank challenge, pp. 1–24, 2011.

Chen, T. and Guestrin, C. XGBoost: A scalable tree boosting system. In Proceedings of the 22nd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, pp. 785–794, 2016.

Dolecki, S., Salinetti, G., and Wets, R. J.-B. Convergence of functions: equi-semicontinuity. Transactions of the American Mathematical Society, 276(1):409–429, 1983.

Erdogdu, M. A., Mackey, L., and Shamir, O. Global non-convex optimization with discretized diffusions. In Proceedings of the 32Nd International Conference on Neural Information Processing Systems, pp. 9694–9703, 2018.

Ermoliev, Y., Norkin, V., and Wets, R. The minimization of semicontinuous functions: mollifier subgradients. SIAM Journal on Control and Optimization, 33, 01 1995.

Gelfand, S. B., Doerschuk, P. C., and Nahhas-Mohandes, M. Theory and application of annealing algorithms for continuous optimization. In Proceedings of the 24th conference on Winter simulation, pp. 494–499, 1992.

Ke, G., Meng, Q., Finley, T., Wang, T., Chen, W., Ma, W., Ye, Q., and Liu, T.-Y. LightGBM: A highly efficient gradient boosting decision tree. In Advances in neural information processing systems, pp. 3146–3154, 2017.

Kustarev, A., Ustinovsky, Y., Logachev, Y., Grechnikov, E., Segalovich, I., and Serdyukov, P. Smoothing NDCG metrics using tied scores. In Proceedings of the 20th ACM international conference on Information and knowledge management, pp. 2053–2056, 2011.

Liu, T.-Y. Learning to rank for information retrieval. Found. Trends Inf. Retr., 3(3):225–331, 2009.

Nesterov, Y. and Spokoiny, V. G. Random gradient-free minimization of convex functions. Foundations of Computational Mathematics, 17:527–566, 2017.

Nguyen, T. and Sanner, S. Algorithms for direct 0–1 loss optimization in binary classification. In International Conference on Machine Learning, pp. 1085–1093, 2013.

Prokhorenkova, L., Gusev, G., Vorobev, A., Dorogush, A. V., and Gulin, A. Catboost: unbiased boosting with categorical features. In Advances in Neural Information Processing Systems, pp. 6638–6648, 2018.

Qin, T. and Liu, T. Introducing LETOR 4.0 datasets. CoRR, abs/1306.2597, 2013.

Qin, T., Liu, T.-Y., and Li, H. A general approximation framework for direct optimization of information retrieval measures. Information retrieval, 13(4):375–397, 2010.

Raginsky, M., Rakhlin, A., and Telgarsky, M. Non-convex learning via stochastic gradient langevin dynamics: a nonasymptotic analysis. CoRR, abs/1702.03849, 2017.

Rudin, C. and Wang, Y. Direct learning to rank and rerank. In International Conference on Artificial Intelligence and Statistics, pp. 775–783, 2018.

Sakai, T. Metrics, statistics, tests. In Bridging Between Information Retrieval and Databases - PROMISE Winter School 2013, Bressanone, Italy, February 4-8, 2013. Revised Tutorial Lectures, pp. 116–163, 2013.

Tan, M., Xia, T., Guo, L., and Wang, S. Direct optimization of ranking measures for learning to rank models. In Proceedings of the 19th ACM SIGKDD international conference on Knowledge discovery and data mining, pp. 856– 864, 2013.

Taylor, M., Guiver, J., Robertson, S., and Minka, T. SoftRank: Optimizing non-smooth rank metrics. In Proceedings of the 2008 International Conference on Web Search and Data Mining, WSDM ’08, pp. 77–86, 2008. ISBN 978-1-59593-927-2.

Ustimenko, A. and Prokhorenkova, L. SGLB: Stochastic Gradient Langevin Boosting. arXiv e-prints, art. arXiv:2001.07248, 2020.

Volkovs, M. N. and Zemel, R. S. BoltzRank: learning to maximize expected ranking gain. In Proceedings of the 26th Annual International Conference on Machine Learning, pp. 1089–1096, 2009.

Vorobev, A., Ustimenko, A., Gusev, G., and Serdyukov, P. Learning to select for a predefined ranking. In International Conference on Machine Learning, pp. 6477–6486, 2019.

Wang, X., Li, C., Golbandi, N., Bendersky, M., and Najork, M. The lambdaloss framework for ranking metric optimization. In Proceedings of The 27th ACM International Conference on Information and Knowledge Management (CIKM ’18), pp. 1313–1322, 2018.

Williams, R. J. Simple statistical gradient-following algorithms for connectionist reinforcement learning. Machine learning, 8(3-4):229–256, 1992.

Wu, Q., Burges, C. J., Svore, K. M., and Gao, J. Adapting boosting for information retrieval measures. Information Retrieval, 13(3):254–270, 2010.

Xu, J., Li, H., Liu, T.-Y., Peng, Y., Lu, M., and Ma, W.-Y. Direct optimization of evaluation measures in learning to rank. 2010.

Yin, D., Hu, Y., Tang, J., Daly, T., Zhou, M., Ouyang, H., Chen, J., Kang, C., Deng, H., Nobata, C., Langlois, J.-M., and Chang, Y. Ranking relevance in Yahoo search. In Proceedings of the 22Nd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, KDD ’16, pp. 323–332, 2016.

Table 4. Notation.

Appendix A. Proof of Statement 1

Let us prove that the set is not empty.

Consider being open and convex sets for (see Discreteness on subspaces in Definition 1). Then, are also open and convex. Hence- forth, the function can be written as (ignoring the sets of zero measure):

Henceforth, the function is also discrete with open convex sets on the whole space . Hence, its arg min is one of these sets or their union.

B. Stochastic smoothing

B.1. Mollification

A natural approach for smoothing is mollifica-tion (Ermoliev et al., 1995; Dolecki et al., 1983): choose a smooth enough distribution with p.d.f. , consider the family of distributions , and let . Then, the minimizers of convergence to the minimizer of . Unfortunately, despite theoretical soundness, it is hard to derive efficient gradient estimates even in the linear case . Moreover, in the gradient boosting setting, we do not have access to all possible coordinates of at each iteration. Henceforth, we cannot use the mollification approach directly.

Thus, instead of acting on the level of parameters , we act on the level of scores , where has p.d.f. . We multiply the noise by to preserve Scalar-freeness in a sense that for any .

In the linear case , if , it is not hard to show the convergence of minimizers. Indeed, we can obtain mollification by “bypassing” the noise from scores to parameters by multiplying on . However, in general, we cannot assume .

B.2. Proof of Theorem 1

The trick is to proceed with and to show that there exists an open and dense set such that the convergence is locally uniform as , .

Let us proceed with proving the existence of such . Let us define

Clearly, the set is not empty, open, and dense. Now, take an arbitrary . Consider and divide the set into disjoint subsets such that all components corresponding to one group are equal and all components corresponding to different J’s are different. Clearly, we need to “resolve” only those which are equal: for small enough we obtain that even after adding the noise the order of J’s is preserved with high probability uniformly in some vicinity of , whilst for large enough we obtain the worst case permutation of corresponding to the one group with high probability uniformly on the whole . Thus, we obtain locally uniform convergence .

B.3. Proof of Theorem 2

Clearly, the conditions of the theorem imply that for general w.l.o.g. we can assume that for some indexes . Henceforth, after adding the noise with we must obtain locally uniform approximation since the functions are locally constant in a vicinity of .

B.4. Consistent smoothing for LSO

Theorem 7. In gradient boosting, if is coming from the LSO problem, then any smoothing is consistent.

Proof. Conditions from Theorem 2 translate into a condition thatfor all j and for all almost surely. This can be enforced by adding a free constant to the linear model, but in the gradient boosting setting this condition is essentially satisfied: consider , then since the matrix is 0-1 matrix and have at least one “1” in each row (every item fells to at least one leaf of each tree). Henceforth, for any general we can assume another general , where is any random variable with absolute continuous p.d.f. This in turn impliesalmost surely. Henceforth, Theorem 2 holds ensuring the consistency of smoothing.

C. Coordinate Conditional Sampling

C.1. Proof of Lemma 1

Consider a line and subsets for from the Discretness on subspaces assumption for . Then due to opennes and convexity of for . Moreover, and, by ignoring sets of zero measure, we can assume that. After that, we can take all finite as breaking points.

C.2. Proof of Theorem 3

Observe that tautologically equals and the convolution is distributive with respect to summation, so we can write:

The convolution is equal to , allowing us to rewrite:

The above formula is ready for differentiation since each term is actually a function by the variable :

After the convolution with , we finally get the required formula.

C.3. Proof of Corollary 1

For LTR (and ), all these actually lay in due to Pairwise decision boundary assumption and, henceforth, we do not need to compute them, we just need to take coordinates of as breaking points and note that if some of is not a breaking point for , then essentially . Then, we can write

Let us note that for LSO, we can actually take and and simplify the formula to:

C.4. Proof of Theorem 4

Lemma 3. The function satisfies the following linear first order Partial Differential Equation (PDE):

Proof. The proof is a direct consequence of ScalarFreenees: we just need to differentiate the equality (holding for ) by and set .

Proof. Consider writing in the integral form:

By Fubini’s theorem, we can pass the differentiation to inside the integral and obtain:

Consider the variable , then we arrive at

Taking the absolute value of both sides and using the triangle inequality, we derive

where by the Uniform boundedness assumption and the last integral is well defined by the Derivative decay assumption.

Corollary 4. independently from .

Proof. Immediate consequence of the previous lemmas.

Now, assume that is differentiable and non-zero at z. The following lemma describes in terms of .

Lemma 5. The following formula holds:

Proof. Consider writing

Then, by Lemma 3 we obtain the formula.

D. Fast ranking metrics computation

We need to be able to compute for an arbitrary and a position i, where represents for the CCS estimate (note that there is no ambiguity in computing argsort since with probability one for ). Moreover, argsort requires O(n log n) operations.

Typically, the evaluation of costs O(n), e.g., for ERR. Fortunately, for many losses it is possible to exploit the structure of the loss that allows evaluating L in O(1) operations using some precomputed shared cumulative statistics related to the loss which can be computed in O(n) operations and O(n) memory.

For all in the worst case we need evaluations of L to compute the CCS (for each of n coordinates to sum up at most n evaluations). Thus, the overall worst case asymptotic of the algorithm would be if the evaluation costs O(1). For the sake of simplicity, we generalize both NDCG@k and ERR into one class of losses:

where are some predefined positions’ weights typically picked as for NDCG@k and for is typically picked as for and for ERR; and finally we define g(r) = r for and for .

First, we need to define and compute the following cumulative product:

where . Denote . Next, we use them we define the following cumulative sums:

All these cumulative statistics can be computed at the same time while we compute . Note that we need additional O(n) memory to store these statistics.

Now fix a position i and score . Express as . Thus, we need to compute .

If , we define ; otherwise, define — this variable represents the new position of the -th document in . Also, if , we define:

Otherwise, define:

Then, we calculate as . The meaning of the formula is simple: we measure the change of gain of the -th document if we change its score to from minus the difference of gains of all documents on positions from up to , if , and from i + 1 up to , if .

The above formulas can be verified directly by evaluating the cases when or and expanding as . Note that all differences take into account all documents on positions from j + 1 up to i inclusively.

Note that . Indeed,

Therefore, we obtain:

E. Global Optimization by Diffusion

E.1. Overview of SGLB idea

Global convergence of SGLB is guaranteed by a so-called Predictions’ Space Langevin Dynamics Stochastic Differential Equation

where denotes the predictions Markov Process on the train set is a standard Wiener process with values in is an implicit preconditioner matrix of the boosting algorithm, and is a temperature parameter that controls exploration/exploitation trade-off. Note that here we override the notation since . Further by we denote an implicitly defined regularization matrix.

The global convergence is implied by the fact that as , the stationary distribution of F(t) concentrates around the global optima of the implicitly regularized loss

More formally, the stationary distribution is . According to Ustimenko & Prokhorenkova (2020), optimization is performed within a linear space that encodes all possible predictions F of all possible ensembles formed by the weak learners associated with the boosting algorithm. We refer interested readers to (Ustimenko & Prokhorenkova, 2020) for the details.

E.2. Proof of Theorem 5

Let us first prove the following lemma.

Lemma 6. The function is uniformly bounded, Lipschitz continuous with constant , and Lipschitz smooth with constant .

Proof. The proof of Lipschitz continuity is a direct consequence of the uniform boundedness by of CCS. If we differentiate CCS estimate one more time, we obtain the estimates for the Hessian that must be uniformly bounded by due to the uniform boundedness of , thus giving Lipschitz smoothness.

In addition to Lipschitz smoothness, continuity, and boundedness from above, we also need (Ustimenko & Prokhorenkova, 2020), but that condition is satisfied since both terms are uniformly bounded by . Thus, the algorithm has limiting stationary measure .

Then, consistency of the smoothing ensures that as , where

and thus for the measures and for concentrate around the global optima of .

E.3. Proof of Theorem 6

Following Raginsky et al. (2017);

Ustimenko & Prokhorenkova (2020), we immediately obtain that with and . In general non- convex case can be of order exp(O(d)) (Raginsky et al., 2017) but for smoothed SF losses we can give a better estimate without exponential dependence on the dimension.

Observe that our measure is the sum of uniformly bounded Lipschitz smooth with constant and a Gaussian , then the more appropriate bound from the logarithmic Sobolev inequality applies according to

being dimension-free. Note that Miclo’s trick in the proof of the lemma should be skipped since is already fine enough. Coupling the spectral gap bound with the generalization gap, we obtain the theorem.

F. Parameter tuning

For tuning, we use the random search (500 samples) with the following distributions:

• For learning-rate log-uniform distribution over .

• For l2-leaf-reg log-uniform distribution over for baselines and l2-leaf-reg=0 for StochasticRank.

• For noise strength (Bruch et al., 2020) uniform distribution over [0, 1].

• For model-shrink-rate log-uniform distribution over for StochasticRank.

• For diffusion-temperature log-uniform distribution over for StochasticRank.

• For mu log-uniform distribution over for StochasticRank-.

designed for accessibility and to further open science