b

DiscoverSearch
About
My stuff
Fenton-Wilkinson Order Statistics and German Tanks: A Case Study of an Orienteering Relay Race
2019·arXiv
Abstract
Abstract

Ordinal regression falls between discretevalued classification and continuous-valued regression. Ordinal target variables can be associated with ranked random variables. These random variables are known as order statistics and they are closely related to ordinal regression. However, the challenge of using order statistics for ordinal regression prediction is finding a suitable parent distribution. In this work, we provide a case study of a real-world orienteering relay race by viewing it as a random process. For this process, we show that accurate order statistical ordinal regression predictions of final team rankings, or places, can be obtained by assuming a lognormal distribution of individual leg times. Moreover, we apply Fenton-Wilkinson approximations to intermediate changeover times alongside an estimator for the total number of teams as in the notorious German tank problem. The purpose of this work is, in part, to spark interest in studying the applicability of order statistics in ordinal regression problems.

Machine learning includes various classification, regression, clustering and dimensionality reduction methods. In classification models, the target is a discrete class, while in regression, the target is typically a continuous variable. Ordinal regression, also known as ordinal clas-sification, is regression with a target that is discrete and ordered. Ordinal classification can be thus considered to be a hybrid of classification and regression.

Typical applications of ordinal classification include age estimation with an integer-valued target, advertising systems, recommender systems, and movie ratings on a scale from 1 to 5. For further insight into recent developments in machine learning, including ordinal regression, the reader is kindly directed to (Gutierrez et al., 2016; Cao et al., 2019; Vargas et al., 2019; Wang and Zhu, 2019; Gambella et al., 2019) and the references therein.

In this work, we perform a case study of ordinal regression on the ranks of a sorted sum of random variables corresponding to the duration of an orienteering relay viewed as a random process with a large number of subsamples of changeover times. We compare three widelyused regression schemes to our original method that is based on manipulations of certain properties of ordered random variables, otherwise known as order statistics.

We study whether expectations of order statistics in conjunction with statistical inference can forecast final team places when certain intricate, educated guesses are made about the underlying random process. Specifically, we assume lognormality of both individual leg times and, more importantly, team changeover times.

Generally, in competitive sports, the final place is the hard, quantitative result that both individuals and teams want to minimize. For our case study, both plots and numerical prediction error values show that ordinal regression based on lognormal order statistics of time duration can provide an accurate fit to real-world relay race data.

2.1 Relay Race System Model

Consider an orienteering relay race. Let n denote the number of finishing teams as we ignore disqualified and retired teams. Each team has m runners and each runner runs one leg. Especially note that m is given, whereas n is estimated, as will be discussed later.

We say that leg time results correspond to random variables  {Zi}with  i ∈ {1, 2, . . . , m}. We define the changeover time  T (l)after leg  l ∈ {1, 2, . . . , m}as

image

The final team result list of the relay race is a length-n sample of  T (m)sorted in ascending order.

Let  FT (m)(·)denote the cdf of  T (m), and  T (m)rm:nde- note the  rthmorder statistic of a length-n sample of  T (m). These order statistics satisfy  T (m)1:n ≤ T (m)2:n ≤ · · · ≤T (m)rm:n ≤ · · · ≤ T (m)n:n. We refer to  rmas the (final) place of the corresponding team, where  rm = 1corresponds to the winning team.

Let c denote the number of training observations. We are given a changeover time training vector  xl =�x(l)1 , . . . , x(l)c �and a final place training vector y =

[y1, . . . , yc]. Hence,  xlconsists of realizations of  T (l)

and y includes the corresponding observed places. We wish to find regression functions  h(l)(·)that satisfy

image

as accurately as possible. In (2), we use vector notation to emphasize that we find the parameters of  h(l)(·)that minimize a loss function with a set of observation pairs rather than with a single observation pair.

2.2 The Four Regression Models

1) Linear regression refers here to Ordinary Least Squares (OLS) regression rounded to the nearest integer. OLS finds the intercept and slope that minimize the residual sum of squares between the observed targets and the targets predicted by the linear approximation.

2) Gaussian Process (GP) regression is a nonparametric model that can manage exact regression up to a million data points on commodity hardware (Wang et al., 2019).

For a pair of training vectors  (xl, y), a GP is defined by its kernel function  k(·, ·), a  c×ckernel matrix  Kxlxlwith covariance values for all training point pairs, and a c-dimensional vector  kxlxwith evaluations of the kernel function between training points  xland a given point x. A GP predicts an unknown function  g(·). For kernel matrix �Kxlxl = Kxlxl + σ20Iwith additive Gaussian noise with zero mean and variance  σ20, the expected value of the zero mean GP predictive posterior distribution with a Gaussian likelihood is (Rasmussen and Williams, 2006)

image

We use (3) rounded to the nearest integer as the GP place predictor. The GP regression function thus becomes

image

For practical, numerical implementation of exact GP, we utilize the readily available GPyTorch library with a radial basis function (RBF) kernel as in the “GPyTorch Regression Tutorial” on (GPyTorch, 2019).

3) Ordinal regression refers here to the rounded to the nearest integer regression-based model from the readily available Python mord package for ordered ordinal ridge regression. This model overwrites the ridge regression function from the scikit-learn library and uses the (minus) absolute error as its score function (Mord, 2019; Pedregosa-Izquierdo, 2015).

4) Fenton-Wilkinson Order Statistics (FWOS) regression is our original regression model. For this model we make the following two well-educated assumptions.

Assumption 1: Individual leg time  Ziis lognormal.

Assumption 2: Changeover time  T (l)is lognormal.

The lognormal distribution often appears in sciences (Limpert et al., 2001). Assumption 1 is based on the lognormality of vehicle travel time (Chen et al., 2018).

Assumption 2 paraphrases what in the literature is known as the Fenton-Wilkinson approximation (Wilkin- son, 1934; Fenton, 1960; Cobb, 2012). The FentonWilkinson approximation method is the method of approximating the distribution of the sum of lognormal random variables with another lognormal distribution.

Note that  Ziand  T (l)are both lognormal and independent but not identically distributed. With this in mind, we can now derive the FWOS regression prediction function h(l)OS(·). For this purpose we use the following two well- known preliminary tools in probability theory.

Tool 1: Let W follow the standard uniform distribution U(0, 1) and  T (l)follow distribution F. Let  Wr:ndenote the  rthorder statistic of a length-n sample of W. The  rthorder statistic of a length-n sample of  T (l)has the same distribution as the inverse cdf of F at  Wr:n

Tool 2: The  rthstandard uniform order statistic follows Beta(r, n − r + 1). Therefore,  E(Wr:n) = r/(n + 1).

The inverse cdf of F is known as the quantile function QF (·). Tool 1 can be therefore expressed as

image

where “d=” reads “has the same distribution as”, and applying Tool 2 to (5) hence yields

image

Let F be the lognormal distribution with cdf  FT (l)(·). Now (6) directly implies  FT (l)�E(T (l)r:n)�= r/(n + 1)and, further, most interestingly for our purposes, that

image

which resembles (2) as r is associated with y.

For large n, as in our case study, it is reasonable to assume that  ∀x ∈ R+, ∃rsuch that

image

The well-known lognormal cdf is given by

image

where  Φ(·)is the standard normal cdf,  µland  σlare the lognormal parameters and  log(·)is the natural logarithm.

We combine (7), (8) and (9), and define the FWOS place prediction function for leg l as

image

where  [·]again denotes rounding to the nearest integer.

Maximum likelihood estimation (MLE) for the normal distribution yields lognormal estimators for  µland  σlwhen  x(l)iare replaced by  q(l)i := log x(l)i. Hence,

image

What remains to be done is finding an estimate for the total number of teams n. We assume that there are no ties, which is equivalent to stating that the elements in y are unique. Thus, y is a sample, without replacement, of the discrete uniform distribution U[1, n]. Let D denote a random variable that follows this distribution. Now recall that y is a random c-subset of {1, 2, . . . , n}.

Estimating the parameter n of U[1, n] with a sample drawn without replacement is in the literature known as the German tank problem (Ruggles and Brodie, 1947), a solution to which is a uniformly minimum-variance unbiased estimator (UMVUE) (Goodman, 1952)

image

is the realization of the  cthorder statistic (maximum) of a length-c sample of D.

We replace n and  (µl, σl)in (10) with the  ˆnof (12) and ( ˆµl, ˆσl), respectively, and note that numerical values for ( ˆµl, ˆσl)are obtained through (11). We obtain the following proposition.

Proposition 1. For a training dataset of c observations, lognormal parameter estimates  (ˆµl, ˆσl)that are obtained through the maximum likelihood estimators of changeover time observations  xl, and  d(c), the maximum of final place observations y, the FWOS ordinal regression function

image

predicts final place with changeover time  x ∈ R+at changeover l.

Loosely speaking, Proposition 1 states that the FWOS method approximates final place with expected changeover place. We anticipate that this approximation holds to a satisfactory degree and that it improves with l.

We note that c, the dimension of the training vectors, is constant for all changeovers l as is  ˆn, the estimate of n, while the pair  (µl, σl)is separately estimated for each changeover l with changeover time training vector  xl.

Further note that FWOS regression only requires an easily acquirable pair�c, d(c)�of dimension 2 as opposed to the other three regression methods that require the whole place training vector y of dimension c.

Real-world data are acquired from the results of Jukola 2019 (Jukola, 2019) with n = 1653 teams with m = 7 runners per team. Training of the regression models is conducted for two cases: for  c = 1322 (c/n ≈ 80%)and also for  c = 82 (c/n ≈ 5%)training observations. The rest of the data are used for testing the models.

image

Figure 1: Place predictions (orange) at changeover l = 4 with training set size 80% and test set (blue) size 20%.

image

Figure 2: Place predictions (orange) at changeover l = 4 with training set size 5% and test set (blue) size 95%.

image

Figure 3: Place prediction root-mean-square errors (RMSEs) after each leg.

3.1 Regression Curves

Figure 1 plots place against team changeover time after leg l = 4 for the predictions (orange) and the test set (blue) for all the four regression models. The size of the training set is 80% and the size of the test set is 20%.

We see that linear regression and GP regression provide accurate fits for a large portion of the test set points, but behave poorly when time is large. For average teams, with approximately 350 to 550 minutes of elapsed time after four legs, place seems to grow linearly with time, though overall linearity is clearly an oversimplification.

We further notice that ordinal regression captures the effect where place saturates for large values of time, but fails to provide a smooth transition. FWOS regression, unlike the other models, does indeed exhibit the smooth sigmoidal behavior of the data.

In the setting of Fig. 2, there are significantly fewer training data compared to the setting of Fig. 1, namely, 5% compared to 80%. Interestingly, linear regression, ordinal regression and FWOS regression maintain high performance, whereas GP regression greatly suffers from the lack of training data.

Overall, we make the following two key remarks.

Remark 1: The “place against changeover time” curve resembles a scaled lognormal cumulative distribution function. This further suggests that elite teams “pull away” from the rest of the teams, while extremely slow teams fall far behind the rest.

Remark 2: A lognormal cumulative distribution function, alongside a German tank estimator, effectively predicts places with changeover times even when the number of training observations is small.

3.2 Root-Mean-Square Errors

Here we illustrate the error between place prediction h(l)(xi)and the corresponding true value  yiwith label i ∈ {1, 2, . . . , v}, where v is the size of the test set with n = c + v. We use the root-mean-square error (RMSE)

image

Fig. 3a plots the RMSEs after each of the m = 7 changeovers for a random test set of v = 331 points (v/n ≈ 0.20as in Fig. 1), while Fig. 3b plots the RMSEs for a random test set of v = 1571 points (v/n ≈ 0.95as in Fig. 2).

When comparing Fig. 3a with Fig. 3b, we notice similar RMSEs for both training set sizes, except for GP regression. It is clear that GP regression requires more training data than the other regression models to achieve comparable prediction error performance.

In every case, FWOS exhibits the best RMSE performance. However, the RMSEs are not zero even for the  7thchangeover, i.e., after the anchor leg when the team fin-ishes. This is due to the imperfections of the regression models and the random fluctuations of the data.

For our case study of an orienteering relay race, we have shown that rankings of ordered sums, here referred to as places, can be relatively accurately predicted by intermediate sums by first taking an educated guess that leg times are lognormal, and then by scaling the cumulative distribution function of another lognormal distribution corresponding to observed changeover times. The latter distribution can be approximated by the Fenton-Wilkinson method, while the scaling factor can be estimated with a well-known solution to the German tank problem.

The use of order statistics for ordinal regression provides a powerful approximation for our case study. However, intricate prior insight into the underlying random process is assumed, which is unnecessary for nonparametric models such as the Gaussian process. Similarly, linear regression and standard ordinal regression require no knowledge about the underlying distributions.

An event that may complicate leg time distribution fit-ting and erode our prediction performance is the restart, where the runners that have not started before a specific cut-off time start simultaneously. Restarts may skew the leg time distributions in the same way as the first leg runners that often run in packs. Running in packs may decrease our prediction accuracy.

Fenton-Wilkinson order statistics, alongside German tank estimators, could be applied whenever a longer random process can be modeled as a sum of shorter random processes with random durations and when the outcomes of the longer process are uniquely ordered by the corresponding total time duration. Such processes are plentiful in sports – those sources of inspiration for ordinal classi-fication order statistics research.

Cao, W., Mirjalili, V., and Raschka, S. (2019). Rank- consistent ordinal regression for neural networks. [online] Available: arXiv:1901.07884.

Chen, P., Tong, R., Lu, G., and Wang, Y. (2018). Ex- ploring travel time distribution and variability patterns using probe vehicle data: Case study in beijing. J. Adv. Transp.

Cobb, R. R. B. (2012). Approximating the distribution of a sum of log-normal random variables. In Proc. 6th Eur. Workshop Probab. Graph. Models.

Fenton, L. F. (1960). The sum of log-normal probabil- ity distibutions in scattered transmission systems. IRE Trans. Commun. Syst., 8:57–67.

Gambella, C., Ghaddar, B., and Naoum-Sawaya, J. (2019). Optimization models for machine learning: A survey. [online] Available: arXiv:1901.05331.

Goodman, L. A. (1952). Serial number analysis. J. Amer. Statist. Assoc., 47(270):622–634.

GPyTorch (2019). [online] https://gpytorch.ai.

Gutierrez, P. A., Perez-Ortiz, M., Sanchez-Monedero, J., Fernandez-Navarro, F., and Hervas-Martinez, C.

(2016). Ordinal regression methods: survey and experimental study. IEEE Trans. Knowl. and Data Eng., 28(1):127–146.

Jukola (2019). [online] https://www.jukola.com/2019/.

Limpert, E., Stahel, W. A., and Abbt, M. (2001). Log- normal distributions across the sciences: Keys and clues. Bioscience, 51:341–352.

Mord (2019). [online] https://pythonhosted.org/mord/.

P¨a¨akk¨onen, J., Dharmawansa, P., Freij-Hollanti, R., Hol- lanti, C., and Tirkkonen, O. (2019). File size distributions and caching for offloading. In Proc. IEEE Signal Process. Advances in Wireless Commun. (SPAWC).

Pedregosa-Izquierdo, F. (2015). Feature extraction and supervised learning on fMRI: from practice to theory. PhD thesis, Universit´e Pierre et Marie Curie.

Rasmussen, C. E. and Williams, C. K. I. (2006). Gaus- sian processes for machine learning. The MIT Press.

Ruggles, R. and Brodie, H. (1947). An empirical ap- proach to economic intelligence in world war ii. J. Amer. Statist. Assoc., 42(237):72–91.

Vargas, V.-M., Guti´errez, P.-A., and Herv´as-Mart´ınez, C. (2019). Cumulative link models for deep ordinal clas-sification. [online] Available: arXiv:1905.13392.

Wang, K. A., Pleiss, G., Gardner, J. R., Tyree, S., Wein- berger, K. Q., and Wilson, A. G. (2019). Exact gaussian processes on a million data points. [online] Available: arXiv:1903.08114.

Wang, L. and Zhu, D. (2019). Tackling multiple ordinal regression problems: Sparse and deep multi-task learning approaches. [online] Available: arXiv:1907.12508.

Wilkinson, R. I. (1934). Unpublished, cited in 1967. Bell Telephone Labs.


Designed for Accessibility and to further Open Science