Ranking a set of objects: a graph based least-square approach

2020·Arxiv

Abstract

Abstract

We consider the problem of ranking N objects starting from a set of noisy pairwise comparisons provided by a crowd of equal workers. We assume that objects are endowed with intrinsic qualities and that the probability with which an object is preferred to another depends only on the difference between the qualities of the two competitors. We propose a class of non-adaptive ranking algorithms that rely on a least-squares optimization criterion for the estimation of qualities. Such algorithms are shown to be asymptotically optimal (i.e., they require comparisons to be -PAC). Numerical results show that our schemes are very efficient also in many non-asymptotic scenarios exhibiting a performance similar to the maximum-likelihood algorithm. Moreover, we show how they can be extended to adaptive schemes and test them on real-world datasets.

1 INTRODUCTION

Ranking algorithms have many applications. For example they are used for ranking pages, users preferences against advertisements on the web, hotels, restaurants, or online games [1], [2]. In general a ranking algorithm infers an estimated order relation among objects starting from a set of evaluations or comparisons. Sometimes, such evaluations are performed by human “workers” in the framework of crowdsourcing applications. However, since the behavior of humans cannot be deterministically predicted, it is usually described through the adoption of a probabilistic model. Then, the challenge in designing algorithms, is the ability to infer reliable estimates of the ranking starting from “noisy” evaluations of the objects. Often the ranking algorithm resorts to pairwise comparisons of objects. In this work, we focus on such a class of ranking algorithms. Several stochastic models have been proposed in the literature [3], [4], [5], [6] to represent the outcome of comparisons. Most of them are based on the idea that objects to be compared have an intrinsic quality and that the probability, , that object i is preferred to object j depends on their qualities and . In this context, we devise a class of efficient algorithms, which reconstruct object qualities from pairwise difference through a least-square (LS) approach. To do so, we establish a parallelism between the estimation process and the average cumulative reward of random walks on a weighted graph.

1.1 System model

Let be a compact set. We assume that N objects are available for ranking: object i is provided with an intrinsic quality, , which is unknown to the system. Qualities induce a true ranking r among objects, in which iff . A ranking algorithm resorts to a set of observations (or answers) provided by workers, which compare pairs of objects and return the identity of the object they prefer. The comparison procedure implicitly contains some randomness reflect-ing the workers’ behavior. Thus, in general, workers’ answers can be modeled as a collection of binary random variables, whose distribution depends on the qualities of the objects to be evaluated.

Due to this randomness in the evaluation process, the inferred ranking for object , does not always coincide with the true ranking r(i). The reliability of depends on how the evaluation process is organized. In particular, it depends on (i) the workers’ behavior, (ii) the choice of the set of object pairs to be compared, (iii) the number of workers assigned to each pair of objects, and (iv) the processing algorithm used to infer the ranking from workers’ answers.

We assume that all workers behave similarly and that they provide independent answers. In particular, a worker comparing objects i and j, will express a preference for object i against j with probability:

where the function is differentiable and strictly increasing in its argument (and therefore invertible) and such that . Moreover, we assume that is bounded away from zero for where . When the pair of objects (i, j) is compared, the worker’s output is modeled as a binary random variable, , whose outcomes have probability

The model in (1) is pretty general. For example, it encompasses

• the Thurstone model [5], where the preferred object (in a pair) is chosen in accordance with the qualities as perceived by the worker and defined as

respectively, where and are zero-mean random variables that represent noise terms. In this case is the cumulative distribution function of the zero-mean random variable , i.e.,

• the Bradley-Terry-Luce (BTL) model [3], [4], where

Let V = {1, . . . , N} be the set of objects. We observe that an arbitrary choice of a set of object pairs to be compared, denoted by , automatically induces an undirected graph G, whose vertex and edge sets are, respectively, V and E. Clearly, it is possible to infer a ranking among the N objects only if the graph G is connected.

Each object pair is assigned to a number of workers W. In general, an increase of W leads to a more reliable estimate of the ranking. On the other hand, the overall complexity, C, of the ranking algorithm is proportional to the total number of workers employed in the process, i.e.,

Then, an efficient ranking algorithm must find a good trade-off between the complexity C and the reliability of the inferred ranking, i.e., by returning an almost correct ranking of objects with a minimal number of pair comparisons.

About the reliability of the inferred ranking we say that an estimated ranking is -quality approximately correct (or, is an -quality ranking) if whenever . Moreover a ranking algorithm is -PAC [7], [8], [9] if it returns an -quality ranking with a probability larger than .2

1.2 Paper contribution and related work

This paper contributes to a better understanding of the fundamental limits of ranking algorithms based on noisy pairwise comparisons. Our main results complement and extend previous findings about minimal complexity of ranking algorithms under different non-parametric preference models recently derived in [?], [8], [9]. As shown in former studies the efficiency of ranking algorithm is crucially determined by the structure of the underlying preference model.

On the one hand, under a non-parametric preference model satisfying both Strong Stochastic Transitivity (SST) and Stochastic Triangle Inequality (STI) properties,3 a provably asymptotically-optimal4 adaptive algorithm has been proposed, under the restriction that . In particular, the algorithm proposed in [9] is -PAC provided that comparisons are dynamically allocated on the basis of previous outcomes. On the other hand, in [8], [9], it is shown that comparisons are strictly needed to obtain a reliable ranking as soon as either STI or SST are relaxed.

When considering parametric models, estimating a ranking is essentially related to estimating the under- lying qualities. [10], [11] provide a characterization of the expected norm-two distance between estimated and true qualities (later on referred to as mean square error (MSE)), in connection with the properties of a fixed graph G. In particular [10], under the assumption that is log-concave, provides universal (i.e., applicable to optimal algorithms, such as the maximum-likelihood (ML) algorithm) order-optimal upper and lower bounds for the MSE, relating it to the spectral gap of a certain scaled version of the Laplacian of G. The very recent paper [11], for the BTL model only, introduces a LS algorithm and provides upper and lower bounds for a variant of the MSE and the relative tail-probabilities achievable by such algorithm, characterizing it in terms of the graph resistance.

Interesting works are also [12], [13], [14], [15], [16], [17], [18]. In [16], [17] a LS approach for ranking is first introduced, but no theoretical guarantees are given. In particular, [17] proposes Sync-Rank, a semi-definite programming algorithm based on the angular synchronization framework. In [12], [13], instead, an iterative algorithm that emulates a weighted random walk of graph G is proposed and its performance analyzed under the BTL model. In particular, it provides bounds on the MSE and the corresponding tail-probabilities. A direct comparison between the performance of algorithms proposed in [11], [14], [15] is reported in [11] where the LS approach is shown to be, in general, asymptotically more efficient. Under the BTL model, [12], [13] propose and analyze algorithms able to identify the top-k quality objects. At last, [18] describes a ranking algorithm based on the singular value decomposition approach by assuming that workers return unquantized noisy estimates

of objects quality differences.

Regarding online ranking algorithms, in [7], for the BTL model, an online algorithm inspired to a finite-budget version of quick sort is described, able to obtain an -PAC ranking with compar- isons. In [19], it is shown that, for online ranking algorithms, parametric models help to reduce the complexity only by logarithmic factors, in order sense.

In this work, unlike [11], we introduce a rather general parametric preference model according to which preference probabilities are determined by an arbitrary smooth monotonic function of object-quality differences. In this scenario, we show that order-optimal non-adaptive algorithms can be defined without the necessity of introducing any restriction to parameter . In particular, differently from [10], [11], we work with the PAC framework and show that our algorithms are -PAC, provided that comparisons are blindly allocated in a single round. Observe that our preference model does not necessarily satisfy STI, while it satisfies SST. Our ranking procedure is based on the reconstruction of object qualities from pairwise quality differences, by adopting a LS approach akin to the one in [11]. Notice however that the analysis in [11] only applies to the case where total comparisons are performed. Our analysis establishes a parallelism between the quality estimation process and the cumulative reward of random walks on graphs. As an original contribution, we also introduce a weighted LS algorithm with performance very close to the more complex ML algorithm. Finally, by simulation, we show that the performance of our algorithms is extremely good also in non-asymptotic scenarios.

The paper is organized as follows: in Section 2 we introduce a ranking algorithm based on the Maximum Likelihood (ML) approach, which is used as a performance reference. In Section 3 we describe our proposed LS estimation algorithm, whose asymptotic analysis is investigated in Section 4. The LS estimation algorithm is then tested in Sections 5 and 6 against synthetic and real-world datasets, respectively. Finally, in Section 7 we draw our conclusions.

1.3 Notation

Boldface uppercase and lowercase letters denote matrices and vectors, respectively. I is the identity matrix. The transpose of the matrix A is denoted by , while indicates its (i, j)-th entry. For the sake of notation compactness we use the notation to define a matrix A whose elements are . Finally, the symbol represents the Hadamard product, while calligraphic letters denote sets or graphs.

2 MAXIMUM-LIKELIHOOD QUALITY ESTIMATION

Consider a graph G(V, E) with |V| = N vertices where each pair of objects is evaluated W times by independent workers 5. Without loss of generality we assume that the indices of the objects connected by the generic edge are such that i > j. Moreover, we assume that the m-th worker evaluating the pair of objects (i, j) outputs the binary random variable whose distribution is given by (2).

In our proposed ML approach, the estimate of the ranking can be obtained by sorting the quality estimates which are obtained as follows:

(5) When workers are independent on each other and behave similarly, the random variables can be modeled as independent and identically distributed. Therefore, the conditional probability in (5) factorizes as

By using (2) we write

where we recall that . By substituting the above result in (5), the ML estimate of the qualities q can be rewritten as

where

and . The function has a finite global maximum. Indeed, since , and , it is straightforward to show that . However, in general, is a non-linear function of q and its maximization non trivial. Nevertheless, a local maximum can be found by using standard techniques such as, for example, the Newton-Raphson method which works iteratively and requires the function to be twice differentiable.

Let be the estimate of q at iteration t = 1, 2, . . ..Then the estimate of q at iteration t + 1 can be updated as follows:

where and S(q) are, respectively, the gradient and the Hessian matrix of . Specifically,

and . In order to compute and S(q) consider a generic node and the set of edges connecting node h to its neighbors. Then, the function can be rewritten as

where the term c does not depend on . Since , we can write the partial derivatives of as follows:

and, similarly

It immediately follows that

and

Moreover, for

The above equations can be specialized for both the Thurstone model as well as for the BTL model, by using the expressions for provided, respectively, in (3) and (4).

3 LEAST-SQUARES QUALITY ESTIMATION

We propose a simpler linear estimation algorithm, based on a least-square criterion, that can be applied on the graph G(V, E). Let the distance between objects i and j be

and let be the set of binary answers, of cardinality W, provided by the workers comparing the pair (i, j). Also, let be the number of times object i is preferred to object j. Then, by construction, follows the binomial distribution , where . Out of the evaluation results, an estimate of is formed as

where is the estimate of , and represents the estimation error on the probability . Note that has zero mean and variance . As a consequence, , where represents the error on the estimate of induced by the presence of . From the set of noisy estimates , the estimate of can be obtained by solving the following LS optimization problem

where are arbitrary positive weights, whose setting is discussed in Section 3.1. The solution of (9) satisfies the following linear equations:

where represents the neighborhood of node i (i.e., the set of nodes connected to i in G), and is its generalized degree, i.e., . We can compactly express the previous linear system in terms of the matrix associated to the graph G, whose elements are defined as

Let I be the identity matrix, , and . Moreover let and be, respectively, the antisymmetric matrices of the true and estimated quality differences6. Thus, from (10) we can write:

where represents the Hadamard product and 1 = is a column vector of size N. We observe that, by construction, , i.e., is singular. Indeed , as it can be easily checked. This implies that the associated linear operator on is not injective and that, given a solution of (10), also is a solution of (10) for any . Note, however, that, for the purposes of object ranking, the actual value of is irrelevant, since every solution of the form induces the same object ranking. Therefore, we can arbitrarily fix the quality of, say, object N to 0 as a reference, i.e., . To keep into account this constraint, we define the new matrices H and M as follows:

and , respectively. We then replace and in (11) with, respectively, M and H. Since M is full rank, solving for q we obtain

where we have used the fact that .

3.1 Weight optimization

In the following, we will consider two possible choices for the weights . The first, which will be studied in the next section for its simplicity, corresponds to for all i, j, and will be called unweighted LS or simply LS. The second, which will be called weighted LS (WLS), is dictated by the fact that the estimates do not have the same reliability. Indeed, by developing (8) at the first order for , we obtain

so that, if we neglect the higher-order term, is a zero-mean random variable with variance

Given the values of , the optimal weights for in (10) are then proportional to . For our WLS algorithm, we will then set , with

where , for a small positive parameter such that exists finite. Note that, under this setting

4 ASYMPTOTIC ANALYSIS OF THE LEASTSQUARE ESTIMATOR

All the theoretical results in this section are obtained by considering the unweighted LS estimator, for simplicity. However, they can be extended to the general weighted case as long as is bounded away from 0, as for the case described in Section 3.1.

The following propositions derive the conditions for the asymptotic convergence of the estimated qualities to their true values. We start by presenting a preliminary asymptotic result on the mean square error.

Proposition 4.1: Consider the unweighted LS estimator in (10). Assume that the degree of nodes of the graph are upper-bounded and define . Then the mean

where c is a constant, is the largest eigenvalue of and for a sufficiently large .

The proof is provided in Appendix A.

Even if an expression similar to (14) is reported in [10], we recall that the latter was derived for perfect ML-estimators under the assumption that is log-concave; our results, instead, apply to LS algorithm for a generic strictly-increasing . Furthermore, (14) complements and extends results in [11] under more general settings (we recall that results in [11] apply to the BTL model only). It is also to be noted that the theoretical results in [11] only apply to the regime where W is large, i.e., . Under such constraint, for any connected graph, the total complexity of the algorithm in terms of number of comparisons is at least . From (14), we can deduce that, whenever is bounded (as for example in the case of Ramanujan graphs), by symmetry, 1, . . . , N. Thus, if for , then converges in probability to q. To find out the minimum number of comparisons under which the LS approach satisfies the -PAC conditions, we need to evaluate for . The following proposition gives sufficient conditions in order for the absolute error to converge to zero in the properly defined limiting regime. Proposition 4.2: Consider the unweighted LS estimator in (10). For any , as N grows, , provided that i) (i.e., the -norm of is bounded), ii) the total number of edges of G is O(N), and W > for some .

Assumption i) can be weakened by the following condi- tion i’):

i’) . The proof is provided in Appendix B.

Remark 4.1: Note that Proposition 4.2 provides suffi-cient conditions for the existence of a -PAC ranking algorithm with complexity . In the following subsection we characterize classes of graphs meeting condition (i) or (i’) of Proposition 4.2.

4.1 Considerations on graphs structure

computes the quality of object i as the average value of the sum of estimated quality differences along all paths joining node i to the reference node N. In other words, can be regarded as the average total reward earned by a standard random walk that starts from node i and stops as soon it hits node N, when estimates are the elementary rewards associated to graph edges [20]. Then, in Section 4.1.1 we restrict our asymptotic analysis to directed and acyclic graphs. Finally, in Section 4.1.2 we extend it to the more general class of undirected graphs.

4.1.1 G is directed and acyclic

An explicit solution of (10) can be given when graph G is turned into a directed graph, i.e., by imposing to all edges one of the two possible directions. While this assumption is suboptimal, since it constrains the random walk to a subset of possible trajectories, it greatly simpli-fies the analysis. Indeed, first observe that, for directed graphs, (10) can be rewritten as:

where represents the set of in-neighborhoods of i and . Then, when the graph is directed and acyclic, and has the reference node, N, as a common ancestor, an explicit solution for , is

where

and is the length of the longest (simple) path from node 1 to the reference node. Proposition 4.3 gives suffi-cient conditions for a directed acyclic graph to meet the requirements of Proposition 4.2. The proposition exploits the notion of proximality between nodes according to the following definition:

Definition 4.1: Given a family of graphs , we say that a node i is proximal to the reference node N, with parameters , if a random walk starting from i reaches the reference node N within h hops with a probability that is asymptotically (with N) bounded below by .

Proposition 4.3: Given a family of directed and acyclic graphs with bounded diameter, condition (i’) of Proposition 4.2 is satisfied if one of the following three conditions is met: (i) all paths from any node to the reference have bounded length, (ii) , or (iii) a fraction bounded away from 0 of the in-neighbors of any node is proximal for some and . The proof is provided in Appendix C.

4.1.2 G is undirected

Now, let us go back to the original formulation (10) on the undirected graph. In the following, we will show that, considered from the point of view of a given node, the solution of (10) for an undirected graph can be obtained by defining an equivalent problem for a properly defined directed acyclic graph. Consider the graph G = (V, E) on N nodes and let T be the matrix obtained from matrix H by removing the last row and column (i.e., those corresponding to the reference node N). Consider a given node , and notice thatgives the average number of times that node j is visited in the random walk starting from i, before ending in the reference node N [21]. Let

be the average number of times any edge incident to node j is traversed in the direction from j to its neighbors, in the standard random walk defined on G. Now, define a directed graph

if and only if and one of the two following conditions are satisfied: (i) j < N and , or (ii) , and .

of j in

only in-neighbors and node N (the reference node) has only out-neighbors. It is also easy to prove that

acyclic. Indeed, suppose that the cycle belongs to

, which is impossible.

, for which, given that the current node is j, the probability of taking outgoing edge is given by

The following proposition relates the standard random walk on G to the biased random walk on

can be obtained by solving

The proof is provided in Appendix D.

seen as the average total reward of the standard random walk on graph G or as the average total reward of the biased random walk on graph

proposition gives sufficient conditions for a family of graphs to meet the conditions of Proposition 4.2.

with bounded diameter, condition i’) of Proposition 4.2 is satisfied if, for each node one of the following conditions are satisfied: (i) all paths in

from i to the reference have bounded length, (ii) in

a fraction bounded away from 0 of the in-neighbors of any node is proximal.

The proof is provided in Appendix E.

Example 4.1: Consider the family of complete graphs on N nodes, i.e., .7 Because of symmetry, we can easily see that, after a proper permutation of the nodes, (

the same reason, in (

. Thus, the only surviving edges in (

the edges connected either to node 1 or to the reference node N. Then, the maximum path length from node 1 to the reference is 2. Thus this family of graphs meets condition i) of Prop. 4.5. In particular, the estimate of is given by

bers. Let us build the family of graphs as follows. Nodes (a set that includes the reference) are “hubs” with potentially unbounded degree. The subgraph induced by the hub nodes is a connected arbitrary graph. The remaining nodes are divided into subsets . Subset , is composed of nodes with maximum degree , which are neighbors of hub node and whose other neighbors all belong to . It is easy to see that, for this family of graphs, the diameter is bounded by .

reference must pass through the hub nodes, it is easy to see that, in (

belonging to . Whenever the biased random walk on (

hub node ) does not enter it any more. Thus, we can divide into two parts the biased random walk: the first on the subgraph of (

hub node serves as reference, and the second on the hub nodes. Then, we can deduce the following facts.

Fig. 1. Error probability achieved by several ranking algorithms plotted versus the complexity per object C/N, for N = 50. Object qualities are equally spaced in [0, 1) and the workers behave according to the Thurstone model.

Remark 4.2: Although our unweighted LS estimator is akin to the one in [11], our analysis of its performance differs substantially, also because we consider the PAC approach. Consequently, our characterization of “good” graphs does not coincide with that of [11]. For instance, a particular case of Example 4.2 is the wheel graph, which corresponds to choosing (the reference node as the only hub) and . From [11, Theorem 1], the wheel graph would require W = O(N) comparisons per edge in order for the upper bound on the estimation error to hold, since every edge belongs to a simple path from any node to the reference. Instead, Prop. 4.5 allows to conclude that is enough to achieve the -PAC.

Example 4.3: Star graphs represent a particular sequence of acyclic graphs with bounded-length paths. Therefore, they satisfy condition i) of Prop. 4.5. In such a case, an object (let us say object 1) is taken as pivot (i.e. center of the star) and qualities of all the other objects are estimated only through direct comparisons with the pivot. Observe, that, in such particular case, ranking among objects can be directly inferred from without the necessity of inverting function . In practice it is enough to rank objects according to the following rules: iff and iff . Therefore, star graphs are appealing when function (i.e. the precise worker model) is not known.

5 RESULTS WITH SYNTHETIC DATASETS

We present numerical results showing the performance of our proposed algorithms for moderate values of N. In Figures 1 and 2 we compare the error probability achieved by several ranking algorithms versus the complexity per object C/N. Objects qualities are equally

Fig. 2. Error probability achieved by several ranking algorithms plotted versus the complexity per object C/N, for N = 500. Object qualities are equally spaced in [0, 1) and the workers behave according to the Thurstone model.

spaced in the range [0, 1), i.e., object i has quality i/N where N = 50 (Fig. 1) and N = 500 (Fig. 2). Workers’ behavior is described by the Thurstone model detailed in Section 1.1 where and is the cdf of a Gaussian random variable with zero mean and standard deviation . On the y-axis we display the empirical probability of generating an output which is not an -quality ranking, for . Note that an error is counted whenever at least two objects, whose quality difference exceeds , appear swapped in the estimated ranking. The curve labeled “MergeRank” refers to the Merge-Rank algorithm proposed in [22], which we consider as a performance reference. The LS, WLS 8, and ML algorithms have been applied to randomly generated regular graphs [23] whose nodes have degree (lines with square marker) and (lines without markers). The figure shows the superior performance of our ranking algorithm. It is interesting to observe that the WLS algorithm provides significant enhancements with respect to the LS algorithm and almost perfectly matches the performance of the more (computationally) complex ML approach. As the number of nodes increases our proposed solutions substantially outperform the “MergeRank” algorithm.

Figure 3 compares the performance of the LS and of the WLS algorithms for N = 200 objects. Object qualities are randomly generated according to a uniform distribution in [0, 1). Other system parameters are set as in Figure 2. The figure reports the empirical error probability plotted versus the number of tests per node, C/N, for different values of the degree of the nodes in the graph. We first observe that, given C/N, the number

Fig. 3. Performance of the LS and WLS ranking algorithms plotted against the complexity per object C/N, for N = 200 objects. Object qualities are drawn from a uniform distribution in [0, 1) and the workers behave according to the Thurstone model.

of tests per edge of the graph decreases as the degree, , increases. Hence, as increases, distances between pairs of nodes (corresponding to edges of the graph) are estimated with a decreasing accuracy. In spite of that, a larger number of neighbors for each node (i.e., a larger ) leads a more reliable estimation of object qualities. This effect is more evident when the WLS algorithm is employed. Indeed, because of the weights , as increases, WLS is able to well exploit the increasing number of highly-reliable edges in the graph connecting objects with similar qualities; at the same time WLS is able to limit the impact of the greater number of scarcelyreliable edges that connect objects with largely different qualities.

5.1 Adaptive multistage approach

The performance of the proposed ranking algorithms can be improved by adopting a multistage approach where, at each stage, new edges are added to the graph, depending on the quality estimates obtained at previous stage. The rationale of this approach stems from the fact that such algorithms provide approximate rankings, in which the probability of swapping the order of two objects increases as their distance (in terms of their qualities) decreases. Therefore, in order to mitigate this phenomenon and, thus, improve the reliability of the estimate, it is convenient to (i) add to the graph extra edges connecting neighboring objects (in terms of their estimated qualities); (ii) assign additional workers to the already existing edges connecting the aforementioned neighboring objects. This procedure can be iterated until a desired performance level is achieved.

Fig. 4. Error probability provided by ML and WLS algorithms when a 2-stage adaptive approach is employed, for , and N = 50.

In our simulation setup, we have considered a 2-stage approach where we first apply the estimation algorithm to a random regular graph, , of degree , obtaining the vector of estimates . In the second stage, we create a new regular graph, of degree , where each node is connected to its closest neighbors, according to the estimates . Finally, the estimation algorithm is applied to the graph obtaining the output which is used to infer the ranking. In Figure 4 we show the performance of the ML and WLS algorithms when the proposed multistage approach is employed. For both algorithms we show the error probability versus the number of tests per object, C/N, for , and N = 50. We observe that the second stage allows for a significant improvement of the performance and a reduction of about 60% of the required tests per object for and of about 30% for . In both cases the performance of the WLS algorithm is very close to that provided by the ML algorithm.

6 RESULTS WITH REAL-WORLD DATASETS

In this section, we show that our algorithm works well even when considering a real scenario, where the “evaluations” are the outcome of experiments, and not synthetically generated by simulations. In particular, we consider five recent seasons of the English Premier League and build up a N = 20 complete graph, where nodes are the football teams and edges are the matches between each pair of them. The match between team i and team j is considered as lasting for 180 minutes, since it includes both the round when i is at home and the round where i is away. If team i has scored goals in the match against team j, we count evaluations in

Fig. 5. Distance between true and estimated ranking for Premier League scores. The x-axis is the season. The y-axis is the Kendall tau distance (number of inversions) between the final season ranking and the output of the WLS algorithm, for different choices of the model and of the parameters.

favor of i when compared to j, where and are constant. The total number of comparisons between i and j is then simply .

The WLS algorithm has been run with and both the Thurstone and BTL models, to see the influence of the underlying worker model. The true ranking is assumed to be the final season ranking. The results have been plotted in terms of the Kendall tau distance, which counts the number of inversions in the estimated ranking with respect to the true ranking, i.e. the number of pairs (i, j) for which i is ranked better than j in the true ranking and worse than j in the estimated one.

Results are shown in Fig. 5. First, we can observe that the performance is better with than with , since in the latter case there might be some edges for which the estimated preference probability is very close to either 0 or 1. Such edges are automatically dropped by the WLS algorithm, while in the former case each object in each comparison receives at least preferences, so that all edges are used for ranking computation. Second, the Thurstone model seems to be slightly better suited than the BTL model. Third, while in most cases, the influence of parameters is limited, there are cases (like season 16/17) that are more sensible to the choice of and . It is worth mentioning that, in [17], the Sync-Rank algorithm is applied to older Seasons of the Premier League. Comparatively, for , Kendall tau distance for our algorithm never goes beyond 20, giving rise to a Kendall correlation larger than 0.90, which is a better result than those shown in [17].

7 CONCLUSIONS

In this paper, we focused on the problem of ranking N objects starting from a set of noisy pairwise comparisons. Objects are assumed to be endowed with intrinsic qualities. The probability that object i is preferred to j is given by an arbitrary smooth monotonic function of the difference between qualities of the two competitors. For such a scenario we developed a class of order-optimal ranking algorithms, i.e. algorithms, which are provably -PAC when comparisons are blindly allocated in a single round. Our ranking procedure is based on the reconstruction of object qualities, from pairwise quality differences, by adopting a simple LS approach. The analysis establishes a parallelism between the quality estimations process and the cumulative reward accumulated by random walks on graphs. Finally, by simulation, we show that the performance of our algorithms, and further variants, is extremely good also in non-asymptotic scenarios and approaches that obtained by the ML algorithm. Our results complement and extend previous recent studies [7], [8], [9] on the minimal complexity of ranking algorithms under different non-parametric preference models.

REFERENCES

[1] M. Richardson, E. Dominowska, and R. Ragno, “Predicting clicks: Estimating the click-through rate for new ads,” in Proceedings of the 16th International Conference on World Wide Web, ser. WWW ’07, 2007, pp. 521–530.

[2] R. Herbrich, T. Minka, and T. Graepel, “TrueskillTM: A bayesian skill rating system,” in Advances in Neural Information Processing Systems 19, B. Sch¨olkopf, J. C. Platt, and T. Hoffman, Eds. MIT Press, 2007, pp. 569–576.

[3] R. L. Plackett, “The analysis of permutations,” Applied Statistics, pp. 193–202, 1975.

[4] R. D. Luce, Individual choice behavior: A theoretical analysis. Courier Corporation, 2012.

[5] L. L. Thurstone, “The method of paired comparisons for social values.” The Journal of Abnormal and Social Psychology, vol. 21, no. 4, p. 384, 1927.

[6] R. A. Bradley and M. E. Terry, “Rank analysis of incomplete block designs: I. the method of paired comparisons,” Biometrika, vol. 39, no. 3/4, pp. 324–345, 1952.

[7] B. Sz¨or´enyi, R. Busa-Fekete, A. Paul, and E. H¨ullermeier, “Online rank elicitation for plackett-luce: A dueling bandits approach,” in Advances in Neural Information Processing Systems, 2015, pp. 604– 612.

[8] M. Falahatgar, A. Orlitsky, V. Pichapati, and A. T. Suresh, “Maximum selection and ranking under noisy comparisons,” arXiv preprint arXiv:1705.05366, 2017.

[9] M. Falahatgar, A. Jain, A. Orlitsky, V. Pichapati, and V. Ravindrakumar, “The limits of maxing, ranking, and preference learning,” in International Conference on Machine Learning, 2018, pp. 1426–1435.

[10] N. B. Shah, S. Balakrishnan, J. Bradley, A. Parekh, K. Ramchan- dran, and M. J. Wainwright, “Estimation from pairwise comparisons: Sharp minimax bounds with topology dependence,” The Journal of Machine Learning Research, vol. 17, no. 1, pp. 2049–2095, 2016.

[11] J. Hendrickx, A. Olshevsky, and V. Saligrama, “Graph resistance and learning from pairwise comparisons,” in Proceedings of the 36th International Conference on Machine Learning. International Machine Learning Society, 2019.

[12] M. Jang, S. Kim, C. Suh, and S. Oh, “Top-k ranking from pairwise comparisons: When spectral ranking is optimal,” arXiv preprint arXiv:1603.04153, 2016.

[13] Y. Chen and C. Suh, “Spectral mle: Top-k rank aggregation from pairwise comparisons,” in International Conference on Machine Learning, 2015, pp. 371–380.

[14] S. Negahban, S. Oh, and D. Shah, “Iterative ranking from pair- wise comparisons,” in Advances in neural information processing systems, 2012, pp. 2474–2482.

[15] ——, “Rank centrality: Ranking from pairwise comparisons,” Operations Research, vol. 65, no. 1, pp. 266–287, 2016.

[16] A. N. Hirani, K. Kalyanaraman, and S. Watts, “Least squares ranking on graphs, hodge laplacians, time optimality, and iterative methods,” CoRR, vol. abs/1011.1716, 2010. [Online]. Available: http://arxiv.org/abs/1011.1716

[17] M. Cucuringu, “Sync-rank: Robust ranking, constrained ranking and rank aggregation via eigenvector and semidefinite programming synchronization,” CoRR, vol. abs/1504.01070, 2015. [Online]. Available: http://arxiv.org/abs/1504.01070

[18] A. d’Aspremont, M. Cucuringu, and H. Tyagi, “Ranking and synchronization from pairwise measurements via svd,” arXiv preprint arXiv:1906.02746, 2019.

[19] K. R. M. J. W. Reinhard Heckel, Nihar B. Shah, “Active ranking from pairwise comparisons and when parametric assumptions don’t help,” arXiv preprint arXiv:1606.08842, 2016.

[20] B. Ewald, J. Humpherys, and J. West, “Computing expected transition events in reducible markov chains,” SIAM Journal on Matrix Analysis and Applications, vol. 31, no. 3, pp. 1040–1054, 2010.

[21] J. G. Kemeny and J. L. Snell, Markov Chains. Springer-Verlag, New York, 1976.

[22] M. Falahatgar, Y. Hao, A. Orlitsky, V. Pichapati, and V. Ravin- drakumar, “Maxing and ranking with few assumptions,” in Advances in Neural Information Processing Systems, 2017, pp. 7063– 7073.

[23] B. Bollob´as, Random graphs. Cambridge university press, 2001.

1 Ranking a set of objects: a graph based least-square approach

Supplemental material

Given that:

we can write the MSE, on the estimate of q as

where is a deterministic matrix and is the largest eigenvalue of C. By using the definition of the matrix H the term in (20) can be expanded as

In order to evaluate the averages in (21) we consider the following linear approximation of

where are positive constants. By applying this result to (21) we obtain

thanks to the fact that and that . As for the term we can write

Since, is defined as a difference of two probabilities we have . Then . This result allows to upper-bound the term as

for some constant . In conclusion the MSE can be bounded as

where, c is a constant, we assumed that is uniformly upper-bounded for any i, we defined , and used the bound .

From (13) we can write the error on the quality estimates as . We can then upper-bound the term by using the infinity norm as follows

thanks to the fact that H is substochastic. Let . Then we can write

which converges to 0 as under the conditions ii) and iii), as stated in Appendix F, Proposition F.1.

The last statement of the proposition can be proved by considering a sequence of operators mapping the sequence of matrices with bounded norm into a set of vectors with uniformly bounded -norm. Let . We know that, as , with a probability larger than . Then, we can write where and is such that . It follows that with a probability larger than .

Let us consider the directed graph on N nodes. The proof of the proposition descends from the fact that (18) provides an explicit expression for operator , i.e.,

where

and is the length of the longest (simple) path from i to the reference node n. Now, we denote with the set of paths from i to n, and for any path we denote with L(p) and p(h) respectively the graph-theoretical length (expressed in number of hops) of p and h-th node along p. By inspection, it can be easily seen that the previous expression can be rewritten as follows:

From the above expression, it is rather immediate to check that

whereis the length of the longest path to node n on . Thus, ifis finite and , also is finite, and condition i) of Proposition 4.3 is demonstrated. Observe that, if we interpret as the elementary reward associated to edge (p(h), p(h + 1)), can be regarded as the total reward associated to path p. Furthermore, is equal to

the probability, for a Random Walker (henceforth, RWer) on that starts in i and ends in n, to take path p. As a result, can be interpreted as the expected total reward accumulated by the RWer starting in i and stopping as soon as it reaches node n.

Now, assuming , we have that, for i = 1, . . . , n,

P{RWer is still active after t hops}

where represents the average stopping (i.e., absorbing) time for the RWer. Therefore, from the above considerations we can write if . Now, we show that under assumptions ii) .

Let D denote the diameter of . By construction, from any node i < n, there exists one path of length at most D that leads to node n. We call the shortest path from i to n as critical. Define the following events:

A[k] = {RWer is still active after t = kD hops}

RWer in i (active) at time t = kD } RWer takes the critical path from i

We have

Now, uniformly over i,

Therefore, since by construction we have that:

from which we get:

The assertion concerning condition ii) follows immediately,

Now, the previous argument can be easily extended under iii). Let D be now any finite integer sufficiently large and keep the same definitions of the events and . We denote with B the set of proximal nodes, i.e., of the nodes satisfying the property. Let alsobe the set of non-proximal nodes. We qualify as critical any path that, from a given node, reaches the reference node within hops. In particular let be a uniform lower bound to the fraction of in-edges connecting an arbitrary node v to B.

Note that, by construction, at every instant the RWer, if active, is visiting a node in B with a probability at least . Therefore:

Then, proceeding as before we get the result.

The proof descends from the analysis of (13). Let us define

so that

Thus, from (13), we get

where is defined in (16) for , and we have used the fact that .

Now consider the directed graph

. Without loss of generality, let us consider the case i = 1 and, whenever

possible, let us drop the subscript and write simply

, etc. Let and be the out-neighbors and in-neighbors of node j, respectively. We first prove the following lemma.

Lemma D.1: With the previous definitions,

Proof: For 1 < j < N, we have on the indirect graph G

Now, from the definition of

E , it turns out that , where is the subset of neighbors of

j such that . Reordering the terms in (39), we obtain (38). For j = 1, we can proceed in the same way, with the additional remark that is empty.

Consider a perturbed Random Walker (RWer), which starts from node 1 and, given that it has reached node

j, proceeds through edge

E with probability defined in (17). The proof of Proposition 4.4 consists in

showing that the probability for the RWer to pass through edge

We know by definition that . Moreover, for satisfies the linear equation

H be the random-walk Laplacian matrix for the perturbed random walk defined above, i.e., (

Let also

. Finally, let be the row vector of node probabilities. Now, (41) can be written in matrix form as

which can be solved univocally as

Now notice also that satisfies (41), thanks to the definition of in (17) and Lemma D.1.

Then, since (41) has a unique solution given by (43), we conclude that .

As a consequence, the probability for the RWer to pass through edge

Thus, the average total reward accumulated by the RWer before being absorbed will be given by

which coincides with (37). This concludes the proof.

The proof descends from the fact that (18) provides an explicit expression for the operator :

where is the set of paths from i to n, and, for any path and are the graph-theoretical length (expressed in number of hops) of and h-th node along , respectively. Notice that (45) is similar to (28), except for the expression of the probabilities associated to the graph edges.

Notice that conditions i) and ii) of Prop. 3.5 coincide with conditions i) and iii) of Prop. 3.3. The proof of Prop. 3.5 then follows immediately from the fact that the proofs of conditions i) and iii) of Prop. 3.3 do not explicitly depend on the expression of the edge probabilities.

Proposition F.1: For any and , there exists such that, as ,

< δ and P

provided that with and the total number of edges of graph G is |E| = O(N). Proof: We first use the union bound and write . We then observe that the MGF of is given by:

Then we bound by applying the Chernoff bound:

By setting , and , for a sufficiently large , we have

. Now, for sufficiently large, we can always

by imposing that , we get that for the more general case.

Similarly, the term also tends to 0 as N grows. As for the second claim of the proposition we can write again . We then recall that , , and . It follows that

By defining the convergence of to 0 as N grows immediately follows. Similarly, it is straightforward to prove the convergence to 0 of the term .