1.1 Pairwise Comparison Data and the Bradley-Terry Model
Pairwise comparison data are very common in daily life, especially in cases where the goal is to rank several objects. Rather than directly ranking all objects simultaneously, it is usually much easier and more ef-ficient to first obtain results of pairwise comparisons and then use them to derive a global ranking across all individuals in a principled manner. Since the required global rankings are not directly observable, developing a statistical framework for the estimating rankings is a challenging unsupervised learning problem. One such
Proceedings of the 23International Conference on Artificial Intelligence and Statistics (AISTATS) 2020, Palermo, Italy. PMLR: Volume 108. Copyright 2020 by the author(s).
statistical model for deriving global rankings using pairwise comparisons was presented in the classic paper (Bradley and Terry, 1952), and thereafter commonly referred to as the Bradley-Terry model in the literature. A similar model was also studied by Zermelo (Zermelo, 1929). The Bradley-Terry model is one of the most popular models to analyze paired comparison data due to its interpretability and computational efficiency in parameter estimation. The Bradley-Terry model along with its variants has been studied and applied in various ranking applications across many domains. This includes the ranking of sports teams (Masarotto and Varin, 2012; Cattelan et al., 2013; Fahrmeir and Tutz, 1994), scientific journals (Stigler, 1994; Varin et al., 2016), and the quality of several brands (Agresti, 2013; Radlinski and Joachims, 2007), to name a few.
In order to introduce the Bradley-Terry model, suppose that we have N distinct teams, each with a positive score , quantifying its propensity to be picked or win over other items. The model postulates that the comparisons between different pairs are independent and the results of the comparisons between a given pair, say team i and team j, are independent and identically distributed Bernoulli random variables, with winning probability defined as
A common way to parametrize the model is to set, for each ), where (
) are real parameters such that
is to ensure identifiability). In this case, equation (1) is usually expressed as
, where, for
1.2 The time-varying (dynamic) Bradley-Terry Model
In many applications it is very common to observe paired comparison data spanning over multiple (discrete) time periods. A natural question of interest is then to understand how the global rankings vary over time i.e. dynamically. For example, in sports analytics the performance of teams often changes across match rounds and thus explicitly incorporating the time-varying dependence into the model is crucial. In particular the paper (Fahrmeir and Tutz, 1994) considers a state-space generalization of the Bradley-Terry model to analyze sports tournaments data. In a similar manner, bayesian frameworks for the dynamic BradleyTerry model are studied further in (Glickman, 1993; Glickman and Stern, 1998; Lopez et al., 2018). Such dynamic ranking analysis is becoming increasingly important because of the rapid growth of openly available time-dependent paired comparison data.
Our main focus in this paper is to tackle the problem of generalizing the Bradley-Terry model to the time-varying setting with statistical guarantees. Our approach estimates the changes in the Bradley-Terry model parameters over time nonparametrically. This enables the derivation of time-varying global dynamic rankings with minimal parametric assumptions. Unlike previous time-varying Bradley-Terry estimation approaches, we seek to establish guarantees in the model-agnostic setting where the Bradley-Terry model is not the true data generating process. This is in contrast to more assumption-heavy parametric frequentist dynamic Bradley-Terry models including (Cattelan et al., 2013). Our method is also computationally effi-cient compared to some state-space methods including but not limited to (Fahrmeir and Tutz, 1994; Glickman and Stern, 1998; Maystre et al., 2019).
2.1 Model Setup
In our time-varying generalization of the original Bradley-Terry model we assume that N distinct teams play against each other at possibly different times, over a given time interval which, without loss of generality, is taken to be [0, 1]. The result of a game between team i and team j at time t is determined by the timestamped winning probability ) := P(i defeats j at time t) which we assume arises from a distinct Bradley-Terry model, one for each time point. In detail, for each i, j, and t
where is an unknown vector such that
We observe the outcomes of M timestamped pairwise matches among the Here (
) indicates that team
played at time
, where
. The result of the m-th match can be expressed in a
data matrix as follows:
Our goal is to estimate the underlying parameters where
[0, 1], and then derive the corresponding global ranking of teams. In order to make the estimation problem tractable we assume that the parameters
) vary smoothly as a function of time
[0, 1]. It is worth noting that the naive strategy of estimating the model parameters separately at each observed discrete time point on the original data will suffer from two major drawbacks: (i) it will in general not guarantee smoothly varying estimates and, perhaps more importantly, (ii) computing the maximum likelihood estimator (MLE) of the parameters in the static Bradley-Terry model may be infeasible due to sparsity in the data (e.g., at each time point we may observe only one match), as we discuss below in 4. To overcome these issues we propose a nonparametric methodology which involves kernel-smoothing the observed data over time.
2.2 Estimation
Our approach in estimating time-varying global rankings is described in the following three step procedure:
1. Data pre-processing: Kernel smooth the pairwise comparison data across all time periods. This is used to obtain the smoothed pairwise data at each time t:
where is an appropriate kernel function with bandwidth h, which controls the extent of data smoothing. The higher the value of h is, the smoother ˜
) is over time.
2. Model fitting: Fit the regular Bradley-Terry model on the smoothed data ˜). The model estimates the performance of each team at time t using the estimated score vector
3. Derive global rankings: Rank each team at time t by its score from ˆ
We observe that if then step 1 reduces to the original (static) Bradley-Terry model. In this case,
where . Thus, fitting the model on ˜X(t) in step 2 is equivalent to the original method on data X. In this sense our proposed ranking approach represents a time-varying generalization of the original Bradley-Terry model.
This data pre-processing is a critical step in our method and is similar to the approach adopted in (Zhou et al., 2010) where it was used in the context of estimating smoothed time varying undirected graphs. This approach has two main advantages. First, applying kernel smoothing on the input pairwise comparison data enables borrowing of information across timepoints. In sparse settings, this reduces the data requirement at each time point to meet necessary and sufficient conditions required for the Bradley-Terry model to have a unique solution as detailed in Section 4. Second, kernel smoothing is computationally efficient in a single dimension and is readily available in open source scientific libraries.
Our main contributions in this paper are summarized as follows:
1. We obtain necessary and sufficient conditions for the existence and uniqueness for our time-varying estimator ) for each
[0, 1] simulataneously. See Section 4.
2. We extend the results of Simons and Yao (1999) and obtain statistical guarantees for our proposed method in the form of convergence results of the estimated model parameters uniformly over all times. We express such guarantees in the form of oracle inequalities in the model-agnostic setting where the Bradley-Terry model is not necessarily the true data generating process. See Section 5.
3. We apply our estimator with an data-driven tuned (by LOOCV) hyperparameter successfully to simulations and to real life applications including a comparison to 5 seasons of NFL ELO ratings. See Section 6 and Section 7.
The existence and uniqueness of solutions for model (5) is not guaranteed in general. This is an innate property of the original Bradley-Terry model (Bradley and Terry, 1952). As pointed out in (Ford, 1957) existence of the MLE for the Bradley-Terry model parameters demands a sufficient amount of pairwise comparison data so that there is enough information of relative performance between any pair of two teams for parameter estimation purposes. For example, if there is a team which has never been compared to the others, there is no information which the model can exploit to assign a score for the team. As such its derived rank could be arbitrary. In addition if there are several teams which have never outperformed the others then the Bradley-Terry model would assign negative infinity for the performance of these teams. It would not be possible to compare amongst them for global ranking purposes. In all such cases, the model parameters are not estimable.
Ford (1957) derived the necessary and sufficient condition for the existence and uniqueness of the MLE in the original Bradley-Terry model. Below we show how this condition can also be adapted to guarantee the existence and uniqueness of the solution in our time-varying Bradley-Terry model. The condition can be stated for each time t in terms of the corresponding kernel-smoothed data ˜X(t) as follows:
Condition 4.1. In every possible partition of the teams into two nonempty subsets, some team i in the first set and some team j in the second set satisfy ˜. Or equivalently, for each ordered pair (i, j), there exists a sequence of indices
such that ˜
Remark 1. If we regard [] as the adjacency matrix of a (weighted) directed graph, then Condition 4.1 is equivalent to the strong connectivity of the graph.
Under condition 4.1 we obtain the following existence and uniqueness theorem on the solution set of the time-varying Bradley-Terry model.
Theorem 4.1. If the smoothed data ˜X(t) satisfies Condition 4.1, then the solution of (5) uniquely exists at time t.
Hence, in the proposed time-varying Bradley-Terry model we do not require the strong conditions of (Ford, 1957) to be met at each time point, but simply require the aggregated conditions in Theorem 4.1 to hold. This is a significant weakening of the data requirement. For example, even if one team did not play any game in a match round – a situation that would preclude the MLE in the standard Bradley-Terry model – it is still possible to assign a rank to this team in their missing round, as long a game is recorded in another round (with at least one win and one loss). In this sense, the kernel-smoothing of the data in our time-varying BradleyTerry model reduces the required sample complexity for a unique solution.
5.1 Preliminaries
Existing results (Simons and Yao, 1999; Negahban et al., 2017) demonstrate the consistency of the estimated static Bradley-Terry scores provided that the data were generated from the Bradley-Terry model. However this assumption may be too restrictive for data generation processes in real world applications. In the rest of this section, we will consider model-agnostic time-varying settings where the Bradley-Terry model is not necessarily the true pairwise data generating model. In order to investigate the statistical properties of the proposed estimator, we impose the following relatively mild assumptions.
Assumption 5.1. Each pair of teams (times at time points
, where each
satisfy the following conditions, for fixed constants
2. for every interval (
We remark that the second condition further implies that
for k = 1, 2, . . . , T
Assumption 5.1 allows for different team pairs to play against each other a different number of times and for the game times to be spaced irregularly, though in a controlled manner. To enable statistical analyses on time-varying quantities, we further require that the winning probabilities satisfy a minimal degree of smoothness and that their rate of decay is controlled. [0, 1]
dent to N and T.
The quantity need not be bounded away from 0 as function of T and N. However, in order to guarantee estimability of the model parameters in time-varying settings, we will need to control the rate at which it is allowed to vanish. See Theorem 5.1 below.
Finally, we assume that the kernel used to smooth over time satisfy the following regularity conditions, which are quite standard in the nonparametric literature.
Assumption 5.3. The kernel function W : (is a symmetric function such that
where V(f(x)) is the total variation of a function f. For each we further write
It is easy to see that these conditions imply
Thus, without loss of generality, we assume that 1; the general case can be handled by modifying the constants accordingly. The use of kernels satisfying the above assumptions is standard in nonparametric problems involving H¨older continuous functions of order 1, such as the winning probabilities function of Assumption 5.2.
5.2 Existence and uniqueness of solutions
Simons and Yao (1999) showed that the necessary and sufficient condition for the existence and uniqueness of the MLE in the original Bradley-Terry model is satisfied asymptotically almost surely under minimal assumptions. Below, we show that this type of result can be extended to our more general time-varying settings.
Theorem 5.1.
Remark 2. As we remarked above, needs not be bounded away from zero, but is allowed to vanish slowly enough in relation to N and T so that condition (5.1) is fulfilled as long as
5.3 Oracle Properties
In our general agnostic time-varying setting the Bradley-Terry model is not assumed to be the true data generating process. It follows that, for each t, there is no true parameter to which to compare the estimator defined in (5). Instead, we may compare it to the projection parameter , which is the best approximation to the winning probabilities at time t using the dynamic Bradley Terry model; see (2). In detail, the oracle parameter is defined as
where
We note that, when the winning probabilities obey a Bradley-Terry model, the projection parameter corresponds to the true model parameters.
Next, for each fixed time 0, we introduce two quantities, namely
), that can be thought of as conditions numbers of sort, affecting directly both the estimation and prediction accuracy of the proposed estimator. In detail, we set
and
where ˜) + ˜
) and ˜
). The ratio M(t) quantifies the maximal discrepancy in winning scores among all possible pairs at time t, and, as shown in Simons and Yao (1999), determines the consistency rate of the MLE in the traditional Bradely-Terry model (see also the comments following Remark 3 below). The quantity
) is instead a measure of regularity in how evenly the teams play against each other. In particular
t and h when there is a constant number of matches among each pair of teams, for each time. Since we allow for the possibility of different number of matches between teams and across time, it becomes necessary to quantity such degree of design irregularity.
In order to verify the quality of the proposed estimator ), we will consider high-probability oracle bound on estimation error
. In the following theorems, we present both a point-wise and a uniform in t version of this bound in the asymptotic regime of
and under only minimal assumptions on the ground-truth winning probabilities
be the probability that Condition 4.1 fails and suppose that the kernel bandwidth is chosen as
for any and some universal constant
ing only to
, and W. Then, for each fixed time
and sufficiently large N and T,
with probability at least 1 as long as the right hand side is smaller than
Next, we strengthen our previous result, which is valid for each fixed time t, to a uniform guarantee over the entire time course.
be the probability that Condition 4.1 fails and suppose that the kernel bandwidth is chosen as
and that W is -Lipschitz. Then, for sufficiently large N and T,
with probability at least 1 as long as the right hand side is smaller than
Remark 3. The rate of point-wise convergence for the estimation error implied by the previous result is
while the rate for uniform convergence is
M(t)
Importantly, as we can see in the previous results, the proposed time varying estimator ˆ) is consistent only provided that the design regularity parameter
) goes to zero. Of course, if all teams play each other a constant number of times, then
In general, however, the impact of the design on the estimation accuracy needs to be assessed on a case-by-case basis.
The rate (18) should be compared with the convergence rate to the true parameters under the static Bradley Terry model, which (Simons and Yao, 1999) show to be ). Thus, not surprisingly, in the more challenging dynamic settings with smoothly varying winning probabilities the estimation accuracy decreases. The exponent of
in the rate (18) matches the familiar rate for estimating H¨older continuous function of order 1.
From (18) and (19) we observe that the desired oracle property on estimated parameters requires rate constraints on M(t). These constraints appear to be strong assumptions without a direct connection to our model-agnostic setting. Instead, we circumvent this issue by introducing a more interpretable condition number K (or
), dependent on N, T, given by
and proving that for each fixed time (0, 1) our desired oracle property follows from a bound on M(t).
Theorem 5.4. Under the conditions in Theorem 5.2
and under the conditions Theorem 5.3
with probability at least 1
Remark 4. We note that assuming to be bounded away from 0 ensures sup
0, with high probability. This assumption means that no team is uniformly dominated by or dominates others (since this implies 1
) is bounded away from 1). This is a reasonable assumption in real-world data such as sports match histories where teams are screened to be competitive with each other. Therefore it is reasonable to only consider matches between teams that do not have vastly different skills, which is reflected in winning probabilities that are bounded away from {0, 1}.
In summary, our proposed method achieves high-probability oracle bounds on the estimation error in our general model-agnostic time-varying setting. We provide the proofs for the stated theorems in Appendix Section 9.1.
We compare our method with some other methods on synthetic data. We consider both cases where the pairwise comparison data are generated from the Bradley-Terry model and from a different, nonparametric model.
6.1 Bradley-Terry Model as the True Model
First we conduct simulation experiments in which the Bradley-Terry model is the true model. Given the number of teams N and the number of time points M, the synthetic data generation process is as follows:
1. For ], simulate
as described below;
2. For 1 and
], set
and simulate X(t) by
(1 +
For each ], we generate
from a Gaussian process GP(
)) as follows:
1. Set the values of the mean process [M] and get mean vector
2. Set the values of the variance process ) at (
, to derive Σ
3. Generate a sample from Normal(
In our experiment, we generate the parameter Gaussian process for
for all t. See appendix for full details. We compare the true
, the win rate, and ˆ
by different methods in Fig. 1.
Figure 1: Comparison of and different estimators. First row: true
with our dynamic BT (right); second row: win rate (left), original BT (right).
We use LOOCV to select the kernel parameter h. The CV curve can be found in Section 6 of Appendix. In this relatively sparse simulated data we see that our dynamic Bradley-Terry estimator ˆrecovers the comprehensive global ranking of each team in line with the
Table 1: Comparison of Different estimators with re- sults based on 20 repeats. Rank Diff. is the average absolute difference between the estimated and true rankings of teams. LOO Prob means average leave-one-out prediction error of win/loss. LOO nll means average leave-one-out negative log-likelihood.
original Bradley-Terry model. We also observe that due to the kernel-smoothing that our estimator has relatively more stable paths over time.
Table 1 compares the three estimators across key metrics. The results are averaged over 20 repeats. As expected, in this sparse data setting, our dynamic Bradley-Terry method performs better than the original Bradley-Terry model.
6.2 Model-Agnostic Setting
In our second experiment we adopt a model-agnostic setting where we assume that Bradley-Terry model is not necessarily the true data generating model, as described in Section 5.1. With the same notation, for 1 and
] we first simulate
), and then set
and simulate X(t) by
Binom
and
. To
generate a ), we again use Gaussian process. Specifically, first we generate
) for 1
and
] from a Gaussian process. Then we scale those
)’s uniformly to make the values fall within a range [
]. In our experiment we set
95] and
. The projection parameter
, the win rate, and
by different methods are compared in Fig. 2. Again the kernel parameter h in our model is selected by LOOCV.
Table 2: Comparison of different estimators when the underlying model is not the Bradley-Terry model.
By comparing curves in Fig. 2, we note that our estimator ˆrecovers the global rankings better than the win rate and the original Bradley-Terry model, and produces relatively more stable paths over time. The same conclusion is confirmed by Table 2, which compares the three estimators in some metrics with 20 repetitions.
Figure 2: Comparison of and different estimators when the underlying model is not the Bradley-Terry model. First row: projection
(left), our dynamic BT (right); second row: win rate (left), original BT (right).
Remark 5. In this sparse data setting where ) is fairly small, the original Bradley-Terry model performs worse than our model for two reasons: 1. Condition 4.1 can fail to hold at some time points, whence the MLE does not exist; 2. even when the MLE exists, it can fluctuate significantly over time because of the relatively small sample size at each time point. As we show in Section 9.3.4 in the appendix, when M = 50, N = 50 and
frequency. Still, our model performs much better than the original Bradley-Terry model.
Remark 6. Since our method is aimed at obtaining accurate estimates of smoothly changing beta/rankings with strong prediction power, it may fail to capture some changes in rankings, especially when these changes are relatively small (as in the present case). However the winrate and original Bradley-Terry methods perform much worse, as they appear to miss some true ranking changes while returning many false change points.
Additional details about experiments, including running time efficiency in simulated settings, can be found in the Appendix Section 9.3.
In order to test our model in practical settings we consider ranking National Football League (NFL) teams over multiple seasons. Specifically we source 5 seasons of openly available NFL data from 2011-2015 (inclusive) using the nflWAR package (Yurko et al., 2018). Each NFL season is comprised of N = 32 teams play-
Table 3: BT within season vs. ELO NFL top 10 rankings. Blue: perfect match, yellow: top 10 match. Our dynamic BT model is fitted on 16 rounds of each season, and the ranking of a season is based on the ranking at the last round.
ing M = 16 games each over the season. This means that at each point in time t the pairwise comparison matrix based on scores across all 32 teams is sparsely populated with only 16 entries. We fit our time-varying Bradley Terry estimator over all 16 rounds in the season using a standard Gaussian Kernel and tune h using the LOOCV approach described in section 9.2. In order to gauge whether the rankings produced by our model are reasonable we compare our season-ending rankings (fit over all games played in that season) with the relevant openly available NFL ELO ratings (Paine, 2015). The top 10 season-ending rankings from each method across NFL seasons 2011-2015 are summarized in Table 3.
Based on Table 3 we observe that we roughly match between 6 to 10 of the top 10 ELO teams consistently over all 5 seasons. There is often misalignment with specific ranking values across both ranking methods. We note that the unlike our method, the NFL ELO rankings use pairwise match data and also additional features including an adjustment for margin of victory. This demonstrates an advantage of our model in only requiring the minimal time-varying pairwise match data and smoothness assumptions to deliver comparable results to this more feature rich ELO ranking method. Furthermore, since our model aggregates data across time it can provide a reasonable minimalist ranking benchmark in modern sparse time-varying data settings with limited “expert knowledge e.g. e-sports.
We propose a time-varying generalization of the Bradley-Terry model that captures temporal dependencies in a nonparametric fashion. This enables the modeling of dynamic global rankings of distinct teams in settings in which the parameteres of the ordinary Bradley Terry model would not be estimable.
From a theoretical perspective we adapt the results of (Ford, 1957) to obtain the necessary and sufficient con- dition for the existence and uniqueness of our BradleyTerry estimator in the time-varying setting. We extend the previous analysis of (Simons and Yao, 1999) to derive oracle inequalities on for our proposed method for both the estimation error and the excess risk under smoothness conditions on the winning probabilities. The resulting rates of consistency are of nonparametric type.
From an implementation perspective we provide a general strategy for tuning the kernel bandwidth hyperparameter using an efficient data-driven approach specific to our unsupervised time-varying setting. Finally, we test the practical effectiveness of our estimator under both simulated and real world settings. In the latter case we separately rank 5 consecutive seasons of open National Football League (NFL) team data (Yurko et al., 2018) from 2011-2015. Our NFL ranking results compare favourably to the well-accepted NFL ELO model rankings (Paine, 2015). We thus view our nonparametric time-varying Bradley-Terry estimator as a useful benchmarking tool for other feature-rich time-varying ranking models since it simply relies on the minimalist time-varying score information for modeling.
Agresti, A. (2013). Categorical data analysis. Wiley Series in Probability and Statistics. Wiley-Interscience [John Wiley & Sons], Hoboken, NJ, third edition.
Bradley, R. A. and Terry, M. E. (1952). Rank analysis of incomplete block designs. I. The method of paired comparisons. Biometrika, 39:324–345.
Cattelan, M., Varin, C., and Firth, D. (2013). Dy- namic Bradley-Terry modelling of sports tournaments. J. R. Stat. Soc. Ser. C. Appl. Stat., 62(1):135–150.
Fahrmeir, L. and Tutz, G. (1994). Dynamic stochastic models for time-dependent ordered paired comparison systems. Journal of the American Statistical Association, 89(428):1438–1449.
Ford, Jr., L. R. (1957). Solution of a ranking problem from binary comparisons. Amer. Math. Monthly, 64(8, part II):28–33.
Glickman, M. E. (1993). Paired comparison models with time-varying parameters. ProQuest LLC, Ann Arbor, MI. Thesis (Ph.D.)–Harvard University.
Glickman, M. E. and Stern, H. S. (1998). A state-space model for national football league scores. Journal of the American Statistical Association, 93(441):25–35.
Lopez, M. J., Matthews, G. J., and Baumer, B. S. (2018). How often does the best team win? A uni-fied approach to understanding randomness in North American sport. Ann. Appl. Stat., 12(4):2483–2516.
Masarotto, G. and Varin, C. (2012). The ranking lasso and its application to sport tournaments. Ann. Appl. Stat., 6(4):1949–1970.
Maystre, L., Kristof, V., and Grossglauser, M. (2019). Pairwise comparisons with flexible time-dynamics. In Proceedings of the 25th ACM SIGKDD International Conference on Knowledge Discovery & Data Mining, pages 1236–1246.
Negahban, S., Oh, S., and Shah, D. (2017). Rank centrality: ranking from pairwise comparisons. Oper. Res., 65(1):266–287.
Paine, N. (2015). NFL Elo Ratings Are Back! https://fivethirtyeight.com/features/ nfl-elo-ratings-are-back/.
Radlinski, F. and Joachims, T. (2007). Active explo- ration for learning rankings from clickthrough data. In Proceedings of the 13th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, KDD ’07, pages 570–579, New York, NY, USA. ACM.
Raghavan, P. (1988). Probabilistic construction of de- terministic algorithms: approximating packing integer programs. Journal of Computer and System Sciences, 37(2):130–143.
Simons, G. and Yao, Y.-C. (1999). Asymptotics when the number of parameters tends to infinity in the Bradley-Terry model for paired comparisons. Ann. Statist., 27(3):1041–1060.
Stigler, S. M. (1994). Citation patterns in the journals of statistics and probability. Statist. Sci., 9:94108.
Varin, C., Cattelan, M., and Firth, D. (2016). Statis- tical modelling of citation exchange between statistics journals. J. Roy. Statist. Soc. Ser. A, 179(1):1–63.
von Luxburg, U. (2007). A tutorial on spectral clus- tering. Stat. Comput., 17(4):395–416.
Yurko, R., Ventura, S., and Horowitz, M. (2018). nflwar: A reproducible method for offensive player evaluation in football. arXiv preprint arXiv:1802.00998.
Zermelo, E. (1929). Die Berechnung der TurnierErgebnisse als ein Maximumproblem der Wahrscheinlichkeitsrechnung. Math. Z., 29(1):436–460.
Zhou, S., Lafferty, J., and Wasserman, L. (2010). Time varying undirected graphs. Mach. Learn., 80(2-3):295– 319.
9.1 Proofs of Theorems
9.1.1 Proof of Theorem 4.1
Uniqueness of the solution The elements of the Hessian for ˆ) in (6) are:
Note that the Hessian has positive diagonal elements, non-positive off-diagonal elements, and zero column sums. With Condition 4.1, this implies that the Hessian can be regarded as a graph Laplacian for a connected graph. Following the classical proof of the property of graph Laplacian (von Luxburg, 2007),
and only if v = c1. This proves the uniqueness up to constant shifts.
Existence of solution Plugging in , we get an upperbound for the minimum loss function ˆ
ˆ
As ˆ) is continuous with respect to
, it suffices to show that the level set of ˆ
) at log 2 within
is bounded so that it is compact.
Suppose that is in the intersection between the levelset and
. Since each summand of ˆ
) in (6) is non-negative, i.e.,
if i and j satisfies ˜0 then the corresponding summand should be smaller than log 2 so that:
By Condition 4.1, for any distinct i and j, there exists an index sequence (such that
where
In sum,
and this proves the existence part of the theorem.
9.1.2 Proof of Theorem 5.1
The proof of this theorem is based on the proof of Lemma 1 in Simons and Yao (1999).
Since the kernel function W in Assumption 5.3 has support (), ˜
0 if and only if team i defeated team j at least once any time. In other words, if Condition 4.1 holds for at least one time point, then so it does for every time point. Here, we prove that the probability of Condition 4.1 to hold at at least one time point converge to 1 as
Given instead of max
), the prob- ability of the event S that no team in a subset S loses against a team not of S is no larger than
Hence, we bound the probability that data does not meet Condition 4.1 by a union bound
as long as . We note that (1+
+ 1. Hence,
as long as
Since 2 implies 4
to be larger than 1, the probability bound holds for any N,
9.1.3 Proof of Theorem 5.2
For readability, in our notation we will omit the dependence on the time point t in the expressions for ˆ) and
), unless required for clarity.
In our proofs we rely on the results and arguments of Simons and Yao (1999) to demonstrate consistency for the maximum likelihood estimator in the static BradleyTerry model with an increasing number of parameters. In that setting, the authors parametrize the winning probabilities as where ˆ
is the MLE of
it follows from the proof of Lemma 3 of Simons and Yao (1999) that
where ). Next, the authors derived a high probability upper bound on
using the facts that
where is the number of matches in which i defeated j. The second identity comes from the first order optimality condition of ˆ
In our time-varying setting, however, the subgradient optimality of ˆ) only imply that, for each j,
Thus, Eq. (35) does not hold in the dynamic setting, due to different ˜) across all
we have that
and
To make the bias-variance trade-off due to kernel smoothing more explicit, we decompose the term
where, for brevity, and
) here stand for
and
), respectively.
For the first term, we have that
Next, )
1 and hence that multiplicative Chernoff bound (see, e.g. Raghavan, 1988) yields that
for each i as long as
This condition holds for
We note that we have also used the bounds
for any i, j and sufficiently small h, which were shown in Section 9.1.4.
Then using the union bound,
Hence, plugging in
that, with probability at least 1
To handle the deterministic bias terms ∆on the following bound, whose proof is given below in Section 9.1.4.
Lemma 9.1. Suppose that
1. satisfies Eq. (8) and
2.
Then, for a -Lipschitz function
with a universal constant depending only on
Accordingly,
for some constant depending only on
and
Thus, combining all the pieces,
with probability at least 1
Plugging in
leads to the bound
with probability at least 1 when
. We note that, given our choice for
small h, if the right hand side is smaller than, say, then
with probability at least 1
9.1.4 Proof of Lemma 9.1
Since has a finite total variation,
On the other hand, with a similar argument,
=
implying that
As long as
is bounded below from 0 (in particular, we consider a small enough h so that it is bounded below by, say, and the term
and
in Eqs. (57) and (59) become asymptotically negligible. As a result,
where is a universal constant depending only on
, and furthermore
for a univeral constant depending only on
9.1.5 Proof of Theorem 5.3
In Section 9.1.3, we showed that
Since the bound for ∆) depends only on
,
, it is sufficient to find a bound for
We use a covering approach. For , let
for l = 1, 2, . . . , L. Then for any
[0, 1] there exists
where here stands
for brevity.
In order to bound the second term in the curly brackets, we bound each of its summands as follows:
where we denote ),
), and
) for brevity.
We have seen in Section 9.1.4 that, for any a sufficiently small h,
Thus,
as 1 and W is
-Lipschitz by assumption. Hence,
� � � � � �
� � � � �
On the other hand,
where, again,) here stands
simplicity.
Using Eq. (42) and a union bound, we get that
for
Next we plug in
the bounds
and, in turn,
with probability at least 1and as long as
and , we conclude that
with probability at least 1 given the choice of h, this bound holds for all sufficiently small h.
9.1.6 Proof of Theorem 5.4
For convenience, we omit the time index t for ˆ),
), unless it is required for clarification.
We seek to replace M(t) in Eq. (16) by a term of This requires exp(
) to be bounded above by a function of
. The following lemma provides the desired bound. The proof is in Section 9.1.7.
Lemma 9.2.
is upper-bounded by a universal constant, and hence
for some universal constant 1
Plugging in the new bound on exp(
This result easily extends to the uniform case Eq. (21).
9.1.7 Proof of Lemma 9.2
Let be the difference in scores which implies a bias of probability
Suppose that
and that
Then, the maximal difference between
Let . Plugged in
, Eq. (34) implies
Hence,
Now, let . Summing Eq. (34) plugged in
and hence
Similarly for and
Now, without loss of generality we assume that 2
. Then, by the optimality of
Thus, ging in
, we get that
for some universal constant C since the derivative of 2 log (2is positive and converges to 0 as
has a upperbounding tangent line with slope 1
is its y-intercept. This also yields that
for some universal constant . In particular, 1 <
9.2 Tuning kernel bandwidth in practical settings
As noted in Section 2.2, the bandwidth kernel function serves as an effective global smoothing parameter between subsequent time periods and allows to borrow information across contiguous time points. Increasing h, all else held constant, leads to parameter estimates (and hence the derived global rankings) becoming “smoothed” together across time.
Naturally the question remains on how to tune h in practical applications in a principled data-driven manner. This is a fundamentally challenging question not just in our problem but, more generally, in nonparametric regression. Here we present a way to tuning h with some degree of objectivity based on leave-one-out cross-validation (LOOCV).
In general settings where we have independent and identically distributed (i.i.d.) samples, LOOCV assesses the performance of a predictive model on a single held-out i.i.d. sample. In our case, each pairwise comparison can be considered an i.i.d. sample if we take the compared teams and the time point on which they are compared as covariates. Remember that (th pairwise comparison where team
won against team
at time point
for m = 1, . . . , M. Then, for a given smoothing penalty parameter h, LOOCV is adapted to our estimation approach as follows:
(a) fit our model with kernel bandwidth h on the dataset with the m-th comparison held-out;
(b) calculate the negative log-likelihood (nll) of the previous solution to (
2. Take the average of the negative log-likelihoods to obtain nllas a loss in the predictive performance of time-varying Bradley-Terry estimator for given h on our dataset.
3. Choose the bandwith with the smallest nll
value.
We apply this data-driven methodology to the experiments and real-life application in Section 6 and Section 7.
9.3 Details of Experiments
Here we explain some details of the setting of the numerical experiments in Section 6.
9.3.1 Bradley-Terry Model as the True Model
We set the number of teams N = 50 and the number of time points and
For the Gaussian process to generate a path for at t = 1, . . . , M, we use the same covariance matrix Σ
for all
], and Σ is set to be a Teoplitz matrix defined by
and in our experiment we set (1). The mean vector is set to be a constant over time, i.e.,
for t = 1, . . . , M, and
are i.i.d. generated from uniform distribution on [0, 1].
Figure 3: LOOCV curve of our Dynamic Bradley-Terry model fitted with Gaussian kernel. y-axis: averaged negative log-likelihood. The optimal
Fig. 3 shows the curve of LOOCV of our dynamic Bradley-Terry model fitted with a Gaussian kernel in one repetition of our experiment. The curve is for the setting here and for the agnostic model setting the CV curve has similar shape. The curve shows a typical shape of CV curve for tuning parameter. The kernel bandwidth, h, with smallest nllis
03. The LOOCV procedure is described in Section 9.2.
9.3.2 Agnostic Model Setting
Again we set the number of teams N = 50, the number of time points and
]. The covariance matrix is also the same as section 9.3.1. The only difference lies in the mean vector. Now the mean vector is still constant over time, or
for t = 1, . . . , M, but
’s are generated in a following group-wise way:
1. Set the number of groups G and the index set of each group base support to be [0, b] and the group gap to be a.
2. For each , generate
from Uniform(
In our experiment we set G = 5 with each group containing two randomly picked indices, b = 0.5 and a = 1.5. Such group-wise generation intends to ensure that different teams have distinguishable perofrmance in pairwise comparison so that the ranking is more reasonable.
9.3.3 Running Time
Fig. 4 compares the time it takes to fit our model and the original Bradley-Terry model under 3 different settings, where N is the number of teams and M is the number of time points:
• Fix N, vary M.
• Fix M, vary N.
• Vary N and M together while keeping N = M.
Figure 4: Comparison of running time of original Bradley Terry model (oBT) and our Dynamic Bradley Terry model (DBT). The values are averaged over 20 repetitions.
For our dynamic Bradley-Terry model, the running time here is measured for fitting the model with a given kernel parameter h, hence it contains the time cost of kernel smoothing step and the optimization step. In real applications, if one wants to select the best h from a range of values with cross-validation, then the total computation time would be approximately the running time here multiplied by the number of cross-validations.
The results in Fig. 4 shows that with all advantages our model can bring with, it does not cost much more in terms of computation time. Furthermore, when the number of time points M is large while N is relatively small, our model can cost even less time than the original Bradley-Terry model.
If one wants to do LOOCV to select h when N and M are huge, then it could take a long time to finish the whole procedure. However, in this case we observed in some extended experiments that with a pre-determined h in a reasonable range, our model can give fairly good estimate close to the one given by the best by LOOCV. The supporting files of our experiments can be found in our GitHub repository
9.3.4 MLE of the Bradley-Terry Model
Table 4 shows the frequency with which Condition 4.1 holds at a single time point for the original pairwise comparison data for different 1 for all i, j, t. To be clear, here we just regard the matrix ˜X(t) in Condition 4.1 as the original data rather than the smoothed data, as it originally was in Ford (1957). Given
, the frequency here refers to #{t : The condition holds for X(t)}/M.
The data are generated as described in Section 6.1, and the frequency is averaged over 50 repetitions. When , the frequency arises to 0.988, illustrating how
) controls the sparsity of the game matrix and consequently whether Condition 4.1 holds or not.
Table 4: Frequency that Condition 4.1 holds at a single time point for the original pairwise comparison data.
As a comparison, under the same setting, for the kernel-smoothed pairwise comparison data, Condition 4.1 always holds in the experiment. This fact demonstrates the advantage of using kernel-smooth, and partly explains why in our experiments where the data is sparse our model performs the best.
The frequencies in Table 4 seem high for N > 20, but from a global perspective, the induced frequency that Condition 4.1 holds for all M time points could be much lower. Table 5 shows such frequency in some settings. Again the values are averaged over 50 repetitions. Remember that in these settings the condition always holds for kernel-smoothed data.
Table 5: Frequency that Condition 4.1 holds for all M time points for the original pairwise comparison data.
To make it clearer how ) affects the global connectivity, we make Table 6. In the table we fix (N, M) = (10, 10).
Table 6: Frequency that Condition 4.1 holds for all M time points for the original pairwise comparison data. (N, M) = (10, 10).
Figure 5: Divergence of maxwhen MLE does not exist at some time points for the original Bradley-Terry model.
By direct inspection of the likelihood of the original Bradley-Terry model, it can be seen that, when the MLE does not exist, the norm of ˆwill go to infinity if one uses gradient descent without any regularization. Fig. 5 shows an example where N = M = 10 and